Message boards : CMS Application : CMS VM tasks started to fail suddenly
Message board moderation

To post messages, you must log in.

AuthorMessage
Harri Liljeroos
Avatar

Send message
Joined: 28 Sep 04
Posts: 760
Credit: 54,077,868
RAC: 41,326
Message 50546 - Posted: 12 Aug 2024, 14:09:25 UTC

So I am running CMS on Windows 10 machine on VM. Today I run a few ATLAS tasks with no problems (still running them OK) but now new CMS tasks are failing. Snippet from stderr:
2024-08-12 16:57:44 (2368): Guest Log: [INFO] Mounting the shared directory
2024-08-12 16:57:45 (2368): Guest Log: [INFO] Shared directory mounted, enabling vboxmonitor
2024-08-12 16:57:45 (2368): Guest Log: [INFO] Sourcing essential functions from /cvmfs/grid.cern.ch
2024-08-12 16:57:45 (2368): Guest Log: [INFO] Testing connection to cern.ch
2024-08-12 16:57:45 (2368): Guest Log: [INFO] Testing connection to VCCS
2024-08-12 16:57:45 (2368): Guest Log: [INFO] Testing connection to HTCondor
2024-08-12 16:57:45 (2368): Guest Log: [INFO] Testing connection to WMAgent
2024-08-12 16:57:45 (2368): Guest Log: [INFO] Testing connection to EOSCMS
2024-08-12 16:57:46 (2368): Guest Log: [INFO] Testing connection to CMS-Factory
2024-08-12 16:58:01 (2368): Guest Log: [DEBUG] Status run 1 of up to 3: 1
2024-08-12 16:58:24 (2368): Guest Log: [DEBUG] Status run 2 of up to 3: 1
2024-08-12 16:58:53 (2368): Guest Log: [DEBUG] Status run 3 of up to 3: 1
[b]2024-08-12 16:58:53 (2368): Guest Log: [DEBUG] run 1
2024-08-12 16:58:53 (2368): Guest Log: Ncat: Version 7.50 ( https://nmap.org/ncat )
2024-08-12 16:58:53 (2368): Guest Log: Ncat: Connection to 137.138.55.253 failed: Connection timed out.
2024-08-12 16:58:53 (2368): Guest Log: Ncat: Trying next address...
2024-08-12 16:58:53 (2368): Guest Log: Ncat: Network is unreachable.
2024-08-12 16:58:53 (2368): Guest Log: run 2
2024-08-12 16:58:53 (2368): Guest Log: Ncat: Version 7.50 ( https://nmap.org/ncat )
2024-08-12 16:58:53 (2368): Guest Log: Ncat: Connection to 137.138.55.253 failed: Connection timed out.
2024-08-12 16:58:53 (2368): Guest Log: Ncat: Trying next address...
2024-08-12 16:58:53 (2368): Guest Log: Ncat: Network is unreachable.
2024-08-12 16:58:53 (2368): Guest Log: run 3
2024-08-12 16:58:53 (2368): Guest Log: Ncat: Version 7.50 ( https://nmap.org/ncat )
2024-08-12 16:58:53 (2368): Guest Log: NCAT DEBUG: Using system default trusted CA certificates and those in /usr/share/ncat/ca-bundle.crt.
2024-08-12 16:58:53 (2368): Guest Log: NCAT DEBUG: Unable to load trusted CA certificates from /usr/share/ncat/ca-bundle.crt: error:02001002:system library:fopen:No such file or directory[/b]
2024-08-12 16:58:53 (2368): Guest Log: libnsock nsi_new2(): nsi_new (IOD #1)
2024-08-12 16:58:53 (2368): Guest Log: libnsock nsock_connect_tcp(): TCP connection requested to 137.138.55.253:80 (IOD #1) EID 8
2024-08-12 16:58:53 (2368): Guest Log: libnsock nsock_trace_handler_callback(): Callback: CONNECT TIMEOUT for EID 8 [137.138.55.253:80]
2024-08-12 16:58:53 (2368): Guest Log: Ncat: Connection to 137.138.55.253 failed: Connection timed out.
2024-08-12 16:58:53 (2368): Guest Log: Ncat: Trying next address...
2024-08-12 16:58:53 (2368): Guest Log: libnsock nsock_connect_tcp(): TCP connection requested to 2001:1458:d00:17::13:80 (IOD #1) EID 16
2024-08-12 16:58:53 (2368): Guest Log: libnsock nsock_trace_handler_callback(): Callback: CONNECT ERROR [Network is unreachable (101)] for EID 16 [2001:1458:d00:17::13:80]
2024-08-12 16:58:53 (2368): Guest Log: Ncat: Network is unreachable.
2024-08-12 16:58:53 (2368): Guest Log: [ERROR] Could not connect to vocms0205.cern.ch on port 80
2024-08-12 16:58:53 (2368): Guest Log: [INFO] Testing connection to CMS-Frontier
2024-08-12 16:58:54 (2368): Guest Log: [INFO] Testing connection to Frontier
2024-08-12 16:58:54 (2368): Guest Log: [DEBUG] Check your firewall and your network load
2024-08-12 16:58:54 (2368): Guest Log: [ERROR] Could not connect to all required network services
2024-08-12 16:58:54 (2368): Guest Log: [DEBUG] Volunteer: Harri Liljeroos (2739)
2024-08-12 16:58:54 (2368): Guest Log: [INFO] Shutting Down.

It is still downloading new CMS tasks without a problem, so my network should be OK.
ID: 50546 · Report as offensive     Reply Quote
Erich56

Send message
Joined: 18 Dec 15
Posts: 1863
Credit: 132,034,900
RAC: 111,486
Message 50547 - Posted: 12 Aug 2024, 14:33:51 UTC - in response to Message 50546.  

same here, see:
https://lhcathome.cern.ch/lhcathome/result.php?resultid=413454653

Unfortunately, I noticed this problem only after some time. So tons of failed tasks :-(

What's the problem ?
ID: 50547 · Report as offensive     Reply Quote
computezrmle
Volunteer moderator
Volunteer developer
Volunteer tester
Help desk expert
Avatar

Send message
Joined: 15 Jun 08
Posts: 2629
Credit: 268,650,265
RAC: 133,962
Message 50548 - Posted: 12 Aug 2024, 15:52:40 UTC - in response to Message 50546.  

2024-08-12 16:58:53 (2368): Guest Log: [ERROR] Could not connect to vocms0205.cern.ch on port 80

Looks like that (essential) system is down.
Just sent a mail to CERN to make them aware.
ID: 50548 · Report as offensive     Reply Quote
computezrmle
Volunteer moderator
Volunteer developer
Volunteer tester
Help desk expert
Avatar

Send message
Joined: 15 Jun 08
Posts: 2629
Credit: 268,650,265
RAC: 133,962
Message 50549 - Posted: 12 Aug 2024, 17:39:37 UTC - in response to Message 50548.  

Looks like vocms0205.cern.ch is back.
ID: 50549 · Report as offensive     Reply Quote
Harri Liljeroos
Avatar

Send message
Joined: 28 Sep 04
Posts: 760
Credit: 54,077,868
RAC: 41,326
Message 50576 - Posted: 25 Aug 2024, 21:04:33 UTC

I am suddenly having this same problem again. Failing to connect these same services.
ID: 50576 · Report as offensive     Reply Quote
Saturn911

Send message
Joined: 3 Nov 12
Posts: 69
Credit: 156,312,027
RAC: 118,313
Message 50577 - Posted: 26 Aug 2024, 14:16:17 UTC - in response to Message 50576.  

I am suddenly having this same problem again. Failing to connect these same services.


same here:
Connection to 137.138.55.253 failed: Connection timed out
again and again since yesterday
ID: 50577 · Report as offensive     Reply Quote
computezrmle
Volunteer moderator
Volunteer developer
Volunteer tester
Help desk expert
Avatar

Send message
Joined: 15 Jun 08
Posts: 2629
Credit: 268,650,265
RAC: 133,962
Message 50578 - Posted: 26 Aug 2024, 14:54:21 UTC - in response to Message 50577.  

Sent a mail to CERN this morning and asked to investigate.
Got this answer:

"I have notified the problem againt to the CMS Submission and Infrastricture team.
The problem is they are doing the update of the factory (and other services) to el9 and we need new glidein wrappers to use the new one."

Notes:
Upgrades to el9 are a must since older linux versions used by CERN are not supported any more.
See: https://linux.web.cern.ch/
"factory" means vocms0205.cern.ch, which is the system currently not responding.


I don't know why the BOINC service still creates tasks but out of experience I would guess they let it run to test if all changes finally lead to a fully operational service chain.
ID: 50578 · Report as offensive     Reply Quote
ivan
Volunteer moderator
Project tester
Volunteer developer
Volunteer tester
Project scientist
Avatar

Send message
Joined: 29 Aug 05
Posts: 1082
Credit: 8,889,210
RAC: 12,892
Message 50582 - Posted: 27 Aug 2024, 11:05:45 UTC - in response to Message 50578.  

Sent a mail to CERN this morning and asked to investigate.
Got this answer:

"I have notified the problem againt to the CMS Submission and Infrastricture team.
The problem is they are doing the update of the factory (and other services) to el9 and we need new glidein wrappers to use the new one."

Notes:
Upgrades to el9 are a must since older linux versions used by CERN are not supported any more.
See: https://linux.web.cern.ch/
"factory" means vocms0205.cern.ch, which is the system currently not responding.


I don't know why the BOINC service still creates tasks but out of experience I would guess they let it run to test if all changes finally lead to a fully operational service chain.

Sorry I've been quiet -- it was a long weekend here so I was out of the loop.
As noted above, we are waiting for some upgrades from Submission and Infrastructure. There are jobs in the condor pool, so the BOINC server is creating tasks -- which fail due to the above problem. By now I expect most people will be in my situation, no new tasks because of all the failures (in fact there are no volunteer hosts asking for condor jobs). It's probably best to set No New Tasks for the interim.
We'll let you know when things change.
ID: 50582 · Report as offensive     Reply Quote
ivan
Volunteer moderator
Project tester
Volunteer developer
Volunteer tester
Project scientist
Avatar

Send message
Joined: 29 Aug 05
Posts: 1082
Credit: 8,889,210
RAC: 12,892
Message 50583 - Posted: 27 Aug 2024, 13:55:14 UTC - in response to Message 50582.  


As noted above, we are waiting for some upgrades from Submission and Infrastructure. There are jobs in the condor pool, so the BOINC server is creating tasks -- which fail due to the above problem. By now I expect most people will be in my situation, no new tasks because of all the failures (in fact there are no volunteer hosts asking for condor jobs). It's probably best to set No New Tasks for the interim.
We'll let you know when things change.

We now have the new glide-in updates, just need to get them installed and checked out.
ID: 50583 · Report as offensive     Reply Quote
ivan
Volunteer moderator
Project tester
Volunteer developer
Volunteer tester
Project scientist
Avatar

Send message
Joined: 29 Aug 05
Posts: 1082
Credit: 8,889,210
RAC: 12,892
Message 50584 - Posted: 27 Aug 2024, 15:39:56 UTC - in response to Message 50583.  


As noted above, we are waiting for some upgrades from Submission and Infrastructure. There are jobs in the condor pool, so the BOINC server is creating tasks -- which fail due to the above problem. By now I expect most people will be in my situation, no new tasks because of all the failures (in fact there are no volunteer hosts asking for condor jobs). It's probably best to set No New Tasks for the interim.
We'll let you know when things change.

We now have the new glide-in updates, just need to get them installed and checked out.

Unfortunately, we might still have a firewall problem. My first task failed because it couldn't establish a connection to the CMS-Factory. I'm trying to run on another machine (my other PCs are at quota limit because of yesterday's failures).
ID: 50584 · Report as offensive     Reply Quote
Harri Liljeroos
Avatar

Send message
Joined: 28 Sep 04
Posts: 760
Credit: 54,077,868
RAC: 41,326
Message 50586 - Posted: 27 Aug 2024, 17:47:10 UTC

Two tasks running here at the moment. They are at 2 hours so far.
ID: 50586 · Report as offensive     Reply Quote
ivan
Volunteer moderator
Project tester
Volunteer developer
Volunteer tester
Project scientist
Avatar

Send message
Joined: 29 Aug 05
Posts: 1082
Credit: 8,889,210
RAC: 12,892
Message 50587 - Posted: 28 Aug 2024, 2:05:14 UTC - in response to Message 50586.  

Two tasks running here at the moment. They are at 2 hours so far.

Yes, I've had another -dev task fail, and a standard one, but another standard one is still ticking along. The running-job graph continues to rise pleasingly so a lot of people are having success.
ID: 50587 · Report as offensive     Reply Quote
Erich56

Send message
Joined: 18 Dec 15
Posts: 1863
Credit: 132,034,900
RAC: 111,486
Message 50589 - Posted: 28 Aug 2024, 13:34:15 UTC - in response to Message 50587.  

I started several tasks 3 hours ago, they are / have been running fine.
ID: 50589 · Report as offensive     Reply Quote
Harri Liljeroos
Avatar

Send message
Joined: 28 Sep 04
Posts: 760
Credit: 54,077,868
RAC: 41,326
Message 50602 - Posted: 1 Sep 2024, 10:33:27 UTC

For the past hour and a half all new tasks are failing with this same error. During the weekend I've had these errors every once in a while but also some successful ones.
ID: 50602 · Report as offensive     Reply Quote
Erich56

Send message
Joined: 18 Dec 15
Posts: 1863
Credit: 132,034,900
RAC: 111,486
Message 50603 - Posted: 1 Sep 2024, 13:11:47 UTC - in response to Message 50602.  

For the past hour and a half all new tasks are failing with this same error. During the weekend I've had these errors every once in a while but also some successful ones.
hm, that's strange. Here, all tasks running on several hosts are running okay.
ID: 50603 · Report as offensive     Reply Quote
Harri Liljeroos
Avatar

Send message
Joined: 28 Sep 04
Posts: 760
Credit: 54,077,868
RAC: 41,326
Message 50604 - Posted: 1 Sep 2024, 17:44:17 UTC

I allowed Atlas and CMS tasks for my Hosts, So it is now up to the server to decide what it sends to me. So far only Atlas is downloaded.
ID: 50604 · Report as offensive     Reply Quote

Message boards : CMS Application : CMS VM tasks started to fail suddenly


©2025 CERN