Message boards :
CMS Application :
CMS VM tasks started to fail suddenly
Message board moderation
Author | Message |
---|---|
![]() Send message Joined: 28 Sep 04 Posts: 760 Credit: 54,077,868 RAC: 41,326 ![]() ![]() ![]() |
So I am running CMS on Windows 10 machine on VM. Today I run a few ATLAS tasks with no problems (still running them OK) but now new CMS tasks are failing. Snippet from stderr: 2024-08-12 16:57:44 (2368): Guest Log: [INFO] Mounting the shared directory 2024-08-12 16:57:45 (2368): Guest Log: [INFO] Shared directory mounted, enabling vboxmonitor 2024-08-12 16:57:45 (2368): Guest Log: [INFO] Sourcing essential functions from /cvmfs/grid.cern.ch 2024-08-12 16:57:45 (2368): Guest Log: [INFO] Testing connection to cern.ch 2024-08-12 16:57:45 (2368): Guest Log: [INFO] Testing connection to VCCS 2024-08-12 16:57:45 (2368): Guest Log: [INFO] Testing connection to HTCondor 2024-08-12 16:57:45 (2368): Guest Log: [INFO] Testing connection to WMAgent 2024-08-12 16:57:45 (2368): Guest Log: [INFO] Testing connection to EOSCMS 2024-08-12 16:57:46 (2368): Guest Log: [INFO] Testing connection to CMS-Factory 2024-08-12 16:58:01 (2368): Guest Log: [DEBUG] Status run 1 of up to 3: 1 2024-08-12 16:58:24 (2368): Guest Log: [DEBUG] Status run 2 of up to 3: 1 2024-08-12 16:58:53 (2368): Guest Log: [DEBUG] Status run 3 of up to 3: 1 [b]2024-08-12 16:58:53 (2368): Guest Log: [DEBUG] run 1 2024-08-12 16:58:53 (2368): Guest Log: Ncat: Version 7.50 ( https://nmap.org/ncat ) 2024-08-12 16:58:53 (2368): Guest Log: Ncat: Connection to 137.138.55.253 failed: Connection timed out. 2024-08-12 16:58:53 (2368): Guest Log: Ncat: Trying next address... 2024-08-12 16:58:53 (2368): Guest Log: Ncat: Network is unreachable. 2024-08-12 16:58:53 (2368): Guest Log: run 2 2024-08-12 16:58:53 (2368): Guest Log: Ncat: Version 7.50 ( https://nmap.org/ncat ) 2024-08-12 16:58:53 (2368): Guest Log: Ncat: Connection to 137.138.55.253 failed: Connection timed out. 2024-08-12 16:58:53 (2368): Guest Log: Ncat: Trying next address... 2024-08-12 16:58:53 (2368): Guest Log: Ncat: Network is unreachable. 2024-08-12 16:58:53 (2368): Guest Log: run 3 2024-08-12 16:58:53 (2368): Guest Log: Ncat: Version 7.50 ( https://nmap.org/ncat ) 2024-08-12 16:58:53 (2368): Guest Log: NCAT DEBUG: Using system default trusted CA certificates and those in /usr/share/ncat/ca-bundle.crt. 2024-08-12 16:58:53 (2368): Guest Log: NCAT DEBUG: Unable to load trusted CA certificates from /usr/share/ncat/ca-bundle.crt: error:02001002:system library:fopen:No such file or directory[/b] 2024-08-12 16:58:53 (2368): Guest Log: libnsock nsi_new2(): nsi_new (IOD #1) 2024-08-12 16:58:53 (2368): Guest Log: libnsock nsock_connect_tcp(): TCP connection requested to 137.138.55.253:80 (IOD #1) EID 8 2024-08-12 16:58:53 (2368): Guest Log: libnsock nsock_trace_handler_callback(): Callback: CONNECT TIMEOUT for EID 8 [137.138.55.253:80] 2024-08-12 16:58:53 (2368): Guest Log: Ncat: Connection to 137.138.55.253 failed: Connection timed out. 2024-08-12 16:58:53 (2368): Guest Log: Ncat: Trying next address... 2024-08-12 16:58:53 (2368): Guest Log: libnsock nsock_connect_tcp(): TCP connection requested to 2001:1458:d00:17::13:80 (IOD #1) EID 16 2024-08-12 16:58:53 (2368): Guest Log: libnsock nsock_trace_handler_callback(): Callback: CONNECT ERROR [Network is unreachable (101)] for EID 16 [2001:1458:d00:17::13:80] 2024-08-12 16:58:53 (2368): Guest Log: Ncat: Network is unreachable. 2024-08-12 16:58:53 (2368): Guest Log: [ERROR] Could not connect to vocms0205.cern.ch on port 80 2024-08-12 16:58:53 (2368): Guest Log: [INFO] Testing connection to CMS-Frontier 2024-08-12 16:58:54 (2368): Guest Log: [INFO] Testing connection to Frontier 2024-08-12 16:58:54 (2368): Guest Log: [DEBUG] Check your firewall and your network load 2024-08-12 16:58:54 (2368): Guest Log: [ERROR] Could not connect to all required network services 2024-08-12 16:58:54 (2368): Guest Log: [DEBUG] Volunteer: Harri Liljeroos (2739) 2024-08-12 16:58:54 (2368): Guest Log: [INFO] Shutting Down. It is still downloading new CMS tasks without a problem, so my network should be OK. ![]() |
Send message Joined: 18 Dec 15 Posts: 1863 Credit: 132,034,900 RAC: 111,486 ![]() ![]() ![]() |
same here, see: https://lhcathome.cern.ch/lhcathome/result.php?resultid=413454653 Unfortunately, I noticed this problem only after some time. So tons of failed tasks :-( What's the problem ? |
![]() Send message Joined: 15 Jun 08 Posts: 2629 Credit: 268,650,265 RAC: 133,962 ![]() ![]() |
2024-08-12 16:58:53 (2368): Guest Log: [ERROR] Could not connect to vocms0205.cern.ch on port 80 Looks like that (essential) system is down. Just sent a mail to CERN to make them aware. |
![]() Send message Joined: 15 Jun 08 Posts: 2629 Credit: 268,650,265 RAC: 133,962 ![]() ![]() |
Looks like vocms0205.cern.ch is back. |
![]() Send message Joined: 28 Sep 04 Posts: 760 Credit: 54,077,868 RAC: 41,326 ![]() ![]() ![]() |
I am suddenly having this same problem again. Failing to connect these same services. ![]() |
Send message Joined: 3 Nov 12 Posts: 69 Credit: 156,312,027 RAC: 118,313 ![]() ![]() ![]() |
I am suddenly having this same problem again. Failing to connect these same services. same here: Connection to 137.138.55.253 failed: Connection timed out again and again since yesterday |
![]() Send message Joined: 15 Jun 08 Posts: 2629 Credit: 268,650,265 RAC: 133,962 ![]() ![]() |
Sent a mail to CERN this morning and asked to investigate. Got this answer: "I have notified the problem againt to the CMS Submission and Infrastricture team. The problem is they are doing the update of the factory (and other services) to el9 and we need new glidein wrappers to use the new one." Notes: Upgrades to el9 are a must since older linux versions used by CERN are not supported any more. See: https://linux.web.cern.ch/ "factory" means vocms0205.cern.ch, which is the system currently not responding. I don't know why the BOINC service still creates tasks but out of experience I would guess they let it run to test if all changes finally lead to a fully operational service chain. |
![]() Send message Joined: 29 Aug 05 Posts: 1082 Credit: 8,889,210 RAC: 12,892 ![]() |
Sent a mail to CERN this morning and asked to investigate. Sorry I've been quiet -- it was a long weekend here so I was out of the loop. As noted above, we are waiting for some upgrades from Submission and Infrastructure. There are jobs in the condor pool, so the BOINC server is creating tasks -- which fail due to the above problem. By now I expect most people will be in my situation, no new tasks because of all the failures (in fact there are no volunteer hosts asking for condor jobs). It's probably best to set No New Tasks for the interim. We'll let you know when things change. ![]() |
![]() Send message Joined: 29 Aug 05 Posts: 1082 Credit: 8,889,210 RAC: 12,892 ![]() |
We now have the new glide-in updates, just need to get them installed and checked out. ![]() |
![]() Send message Joined: 29 Aug 05 Posts: 1082 Credit: 8,889,210 RAC: 12,892 ![]() |
Unfortunately, we might still have a firewall problem. My first task failed because it couldn't establish a connection to the CMS-Factory. I'm trying to run on another machine (my other PCs are at quota limit because of yesterday's failures). ![]() |
![]() Send message Joined: 28 Sep 04 Posts: 760 Credit: 54,077,868 RAC: 41,326 ![]() ![]() ![]() |
Two tasks running here at the moment. They are at 2 hours so far. ![]() |
![]() Send message Joined: 29 Aug 05 Posts: 1082 Credit: 8,889,210 RAC: 12,892 ![]() |
|
Send message Joined: 18 Dec 15 Posts: 1863 Credit: 132,034,900 RAC: 111,486 ![]() ![]() ![]() |
I started several tasks 3 hours ago, they are / have been running fine. |
![]() Send message Joined: 28 Sep 04 Posts: 760 Credit: 54,077,868 RAC: 41,326 ![]() ![]() ![]() |
For the past hour and a half all new tasks are failing with this same error. During the weekend I've had these errors every once in a while but also some successful ones. ![]() |
Send message Joined: 18 Dec 15 Posts: 1863 Credit: 132,034,900 RAC: 111,486 ![]() ![]() ![]() |
For the past hour and a half all new tasks are failing with this same error. During the weekend I've had these errors every once in a while but also some successful ones.hm, that's strange. Here, all tasks running on several hosts are running okay. |
![]() Send message Joined: 28 Sep 04 Posts: 760 Credit: 54,077,868 RAC: 41,326 ![]() ![]() ![]() |
I allowed Atlas and CMS tasks for my Hosts, So it is now up to the server to decide what it sends to me. So far only Atlas is downloaded. ![]() |
©2025 CERN