Message boards : CMS Application : CMS tasks failing
Message board moderation
Previous · 1 · 2 · 3 · 4 · 5 · Next
| Author | Message |
|---|---|
|
Send message Joined: 29 Aug 05 Posts: 1134 Credit: 11,579,777 RAC: 14,842 |
Hi Fede; The messages I see are: 2025-12-22 15:01:44 (2296): Guest Log: [INFO] Environment HTTP proxy: not set 2025-12-22 15:01:44 (2296): Guest Log: [INFO] Reading volunteer information 2025-12-22 15:01:49 (2296): Guest Log: [INFO] Requesting an X509 credential from LHC@home 2025-12-22 15:01:50 (2296): Guest Log: [INFO] Requesting an idtoken from LHC@home 2025-12-22 15:01:51 (2296): Guest Log: [INFO] CMS application starting. Check log files. 2025-12-22 15:28:18 (2296): Guest Log: [INFO] glidein exited with return value 0. 2025-12-22 15:28:18 (2296): Guest Log: [INFO] Shutting Down. 2025-12-22 15:28:18 (2296): VM Completion File Detected. 2025-12-22 15:28:18 (2296): VM Completion Message: glidein exited with return value 0.so the problem seems to be that the job doesn't start running in the task VM, and times out. They aren't showing up as failures in grafana, nor in WMStats as far as I can see. |
Magic Quantum MechanicSend message Joined: 24 Oct 04 Posts: 1261 Credit: 92,106,969 RAC: 109,679 |
https://lhcathome.cern.ch/lhcathome/result.php?resultid=431151623 Several hundred of these starting around 21 Dec 2025, 18:21:32 UTC |
|
Send message Joined: 18 Dec 15 Posts: 1941 Credit: 156,018,218 RAC: 107,484 |
Ivan - any hope that CMS will be back before the holiday season starts ? |
|
Send message Joined: 29 Aug 05 Posts: 1134 Credit: 11,579,777 RAC: 14,842 |
Ivan - any hope that CMS will be back before the holiday season starts ? I doubt it, most people are on holiday. Neither Federica nor I have access to the condor servers at the moment, to pore through logs, and efforts to contact those who might be able to help have so far been fruitless. Sorry 'bout that! |
|
Send message Joined: 18 Dec 15 Posts: 1941 Credit: 156,018,218 RAC: 107,484 |
Ivan, thank you for your continued efforts anyway. What would make sense though is to stop distribution of tasks - are you able to do this? |
|
Send message Joined: 7 May 08 Posts: 266 Credit: 2,118,147 RAC: 2,032 |
Well, thanks everyone for your patience this year. It's been frustrating for me, being forced into retirement and gradually losing my accounts and access while other things crumble as well. I'm not particularly looking forward to next year, but I'll try to soldier on for a while. Thank you for all the time and the effort for this project!! |
|
Send message Joined: 18 Dec 15 Posts: 1941 Credit: 156,018,218 RAC: 107,484 |
Thank you for all the time and the effort for this project!!+1 |
|
Send message Joined: 29 Aug 05 Posts: 1134 Credit: 11,579,777 RAC: 14,842 |
Ivan - any hope that CMS will be back before the holiday season starts ? Well, one of the chaps from the CMS Submissions Infrastructure group managed to manually run one of our jobs yesterday (you can see it on the jobs graph if you adjust the time range appropriately :-), so I'm back to my original hypothesis that there has been a change in the permissions chain that we haven't yet picked up. |
Magic Quantum MechanicSend message Joined: 24 Oct 04 Posts: 1261 Credit: 92,106,969 RAC: 109,679 |
|
|
Send message Joined: 29 Aug 05 Posts: 1134 Credit: 11,579,777 RAC: 14,842 |
Wake up! I was right, an IDtoken had expired, but we had to wait for people to get back from holidays to identify this and ask for a new one to be generated. The old one had a lifetime of 1 year, so memories were dim... I'll set an alarm in my Google calendar so that at least I'll be reminded of the need to renew it next New Year! |
|
Send message Joined: 14 Jan 10 Posts: 1491 Credit: 9,985,216 RAC: 970 |
Thanks Ivan and Happy New Year. 4 cmsExternalGenerators extGen870+ running! Grafana shows 445 jobs running. |
Magic Quantum MechanicSend message Joined: 24 Oct 04 Posts: 1261 Credit: 92,106,969 RAC: 109,679 |
Thanks Ivan and good thing my computers work in their sleep Mine started doing actual work again at 5 Jan 2026, 18:25:45 UTC Around 120 Valids since that time. |
Magic Quantum MechanicSend message Joined: 24 Oct 04 Posts: 1261 Credit: 92,106,969 RAC: 109,679 |
I just checked over at -dev and those CMS are still not doing actual work |
Magic Quantum MechanicSend message Joined: 24 Oct 04 Posts: 1261 Credit: 92,106,969 RAC: 109,679 |
https://lhcathome.cern.ch/lhcathome/result.php?resultid=431749914 https://lhcathome.cern.ch/lhcathome/result.php?resultid=431752361 I see hundreds crashing over and over now and some not on a few other hosts It looks like the failed ones are all on Windows hosts |
Magic Quantum MechanicSend message Joined: 24 Oct 04 Posts: 1261 Credit: 92,106,969 RAC: 109,679 |
<message> The global filename characters, * or ?, are entered incorrectly or too many global filename characters are specified. (0xd0) - exit code 208 (0xd0)</message> unbelievable are you trying to promote linux again? |
|
Send message Joined: 18 Dec 15 Posts: 1941 Credit: 156,018,218 RAC: 107,484 |
https://lhcathome.cern.ch/lhcathome/result.php?resultid=431749914I had several tasks failing last night, on some of my Windows hosts, but a few hours later everything was back to normal. |
|
Send message Joined: 14 Jan 10 Posts: 1491 Credit: 9,985,216 RAC: 970 |
I had several tasks failing last night, on some of my Windows hosts, but a few hours later everything was back to normal.Almost normal. For me the first job started > 30 minutes init-phase. |
Magic Quantum MechanicSend message Joined: 24 Oct 04 Posts: 1261 Credit: 92,106,969 RAC: 109,679 |
I'm still getting these The global filename characters, * or ?, are entered incorrectly or too many global filename characters are specified. (0xd0) - exit code 208 (0xd0)</message> Back to Theory Volunteer Mad Scientist For Life unbelievable are you trying to promote linux again? |
|
Send message Joined: 18 Dec 15 Posts: 1941 Credit: 156,018,218 RAC: 107,484 |
Here, after many such tasks last night, today I had just one - downloaded at 19 Jan 2026, 12:42:39 UTC, started about half an hour later, and crashed after some 3 minutes. |
|
Send message Joined: 29 Aug 05 Posts: 1134 Credit: 11,579,777 RAC: 14,842 |
https://lhcathome.cern.ch/lhcathome/result.php?resultid=431749914 The prominent failure rate I see at the moment is this: Fatal Exception (Exit Code: 8002) An exception of category 'StdException' occurred while [0] Constructing the EventProcessor [1] Constructing ESSource: class=PoolDBESSource label='GlobalTag' Exception Message: A std::exception was thrown. Can not get data (Additional Information: [frontier.c:1170]: No more servers/proxies. Last error was: Request 223 on chan 1 failed at Tue Jan 20 15:16:12 2026: -9 [fn-socket.c:107]: network error on connect to [2606:4700:3034::ac43:9e0b]:8080: Network is unreachable) ( CORAL : "coral::FrontierAccess::Statement::execute" from "CORAL/RelationalPlugins/frontier" ) These come mainly from just a few volunteers. We're not sure, but we suspect network congestion, the users tend to have a large number of multi-core machines but no caching proxy assigned. We have a list of four frontier servers that hold the "conditions database", i.e. the geometrical description of the detectors, etc., in the experiment. Each of them has both IPv4 and IPv6 addresses. What the message means is that the job has cycled through all eight possible connection attempts without securing a connection, so it gives up. There are other failure modes too, that we've not got a handle on, where the job doesn't secure a "token" to allow it to access the DataBridge storage. Again, we suspect network congestion but it's harder to get a handle on these modes. They are usually less than 5% of jobs. |
©2026 CERN