Message boards : CMS Application : CMS tasks failing
Message board moderation

To post messages, you must log in.

Previous · 1 · 2 · 3 · 4 · 5 · Next

AuthorMessage
ivan
Volunteer moderator
Project tester
Volunteer developer
Volunteer tester
Project scientist
Avatar

Send message
Joined: 29 Aug 05
Posts: 1134
Credit: 11,580,762
RAC: 14,248
Message 52779 - Posted: 22 Dec 2025, 15:36:47 UTC - in response to Message 52778.  

Hi Fede;
The messages I see are:
2025-12-22 15:01:44 (2296): Guest Log: [INFO] Environment HTTP proxy: not set
2025-12-22 15:01:44 (2296): Guest Log: [INFO] Reading volunteer information
2025-12-22 15:01:49 (2296): Guest Log: [INFO] Requesting an X509 credential from LHC@home
2025-12-22 15:01:50 (2296): Guest Log: [INFO] Requesting an idtoken from LHC@home
2025-12-22 15:01:51 (2296): Guest Log: [INFO] CMS application starting. Check log files.
2025-12-22 15:28:18 (2296): Guest Log: [INFO] glidein exited with return value 0.
2025-12-22 15:28:18 (2296): Guest Log: [INFO] Shutting Down.
2025-12-22 15:28:18 (2296): VM Completion File Detected.
2025-12-22 15:28:18 (2296): VM Completion Message: glidein exited with return value 0.
so the problem seems to be that the job doesn't start running in the task VM, and times out. They aren't showing up as failures in grafana, nor in WMStats as far as I can see.
ID: 52779 · Report as offensive     Reply Quote
ProfileMagic Quantum Mechanic
Avatar

Send message
Joined: 24 Oct 04
Posts: 1261
Credit: 92,121,184
RAC: 109,808
Message 52780 - Posted: 22 Dec 2025, 16:03:13 UTC - in response to Message 52778.  

https://lhcathome.cern.ch/lhcathome/result.php?resultid=431151623
Several hundred of these starting around 21 Dec 2025, 18:21:32 UTC
ID: 52780 · Report as offensive     Reply Quote
Erich56

Send message
Joined: 18 Dec 15
Posts: 1941
Credit: 156,028,503
RAC: 107,246
Message 52785 - Posted: 23 Dec 2025, 13:38:33 UTC

Ivan - any hope that CMS will be back before the holiday season starts ?
ID: 52785 · Report as offensive     Reply Quote
ivan
Volunteer moderator
Project tester
Volunteer developer
Volunteer tester
Project scientist
Avatar

Send message
Joined: 29 Aug 05
Posts: 1134
Credit: 11,580,762
RAC: 14,248
Message 52786 - Posted: 23 Dec 2025, 14:42:44 UTC - in response to Message 52785.  

Ivan - any hope that CMS will be back before the holiday season starts ?

I doubt it, most people are on holiday. Neither Federica nor I have access to the condor servers at the moment, to pore through logs, and efforts to contact those who might be able to help have so far been fruitless.
Sorry 'bout that!
ID: 52786 · Report as offensive     Reply Quote
Erich56

Send message
Joined: 18 Dec 15
Posts: 1941
Credit: 156,028,503
RAC: 107,246
Message 52787 - Posted: 23 Dec 2025, 17:48:41 UTC - in response to Message 52786.  

Ivan, thank you for your continued efforts anyway. What would make sense though is to stop distribution of tasks - are you able to do this?
ID: 52787 · Report as offensive     Reply Quote
[VENETO] boboviz
Avatar

Send message
Joined: 7 May 08
Posts: 266
Credit: 2,118,209
RAC: 2,007
Message 52788 - Posted: 24 Dec 2025, 8:17:59 UTC - in response to Message 52777.  

Well, thanks everyone for your patience this year. It's been frustrating for me, being forced into retirement and gradually losing my accounts and access while other things crumble as well. I'm not particularly looking forward to next year, but I'll try to soldier on for a while.


Thank you for all the time and the effort for this project!!
ID: 52788 · Report as offensive     Reply Quote
Erich56

Send message
Joined: 18 Dec 15
Posts: 1941
Credit: 156,028,503
RAC: 107,246
Message 52789 - Posted: 24 Dec 2025, 8:38:52 UTC - in response to Message 52788.  

Thank you for all the time and the effort for this project!!
+1
ID: 52789 · Report as offensive     Reply Quote
ivan
Volunteer moderator
Project tester
Volunteer developer
Volunteer tester
Project scientist
Avatar

Send message
Joined: 29 Aug 05
Posts: 1134
Credit: 11,580,762
RAC: 14,248
Message 52798 - Posted: 25 Dec 2025, 16:44:50 UTC - in response to Message 52786.  

Ivan - any hope that CMS will be back before the holiday season starts ?

I doubt it, most people are on holiday. Neither Federica nor I have access to the condor servers at the moment, to pore through logs, and efforts to contact those who might be able to help have so far been fruitless.
Sorry 'bout that!

Well, one of the chaps from the CMS Submissions Infrastructure group managed to manually run one of our jobs yesterday (you can see it on the jobs graph if you adjust the time range appropriately :-), so I'm back to my original hypothesis that there has been a change in the permissions chain that we haven't yet picked up.
ID: 52798 · Report as offensive     Reply Quote
ProfileMagic Quantum Mechanic
Avatar

Send message
Joined: 24 Oct 04
Posts: 1261
Credit: 92,121,184
RAC: 109,808
Message 52799 - Posted: 25 Dec 2025, 22:17:27 UTC

ID: 52799 · Report as offensive     Reply Quote
ivan
Volunteer moderator
Project tester
Volunteer developer
Volunteer tester
Project scientist
Avatar

Send message
Joined: 29 Aug 05
Posts: 1134
Credit: 11,580,762
RAC: 14,248
Message 52818 - Posted: 5 Jan 2026, 16:39:17 UTC - in response to Message 52799.  


Wake up!

I was right, an IDtoken had expired, but we had to wait for people to get back from holidays to identify this and ask for a new one to be generated. The old one had a lifetime of 1 year, so memories were dim... I'll set an alarm in my Google calendar so that at least I'll be reminded of the need to renew it next New Year!
ID: 52818 · Report as offensive     Reply Quote
Crystal Pellet
Volunteer moderator
Volunteer tester

Send message
Joined: 14 Jan 10
Posts: 1491
Credit: 9,985,249
RAC: 968
Message 52821 - Posted: 5 Jan 2026, 17:31:30 UTC - in response to Message 52818.  
Last modified: 5 Jan 2026, 17:38:42 UTC


Wake up!

I was right, an IDtoken had expired, but we had to wait for people to get back from holidays to identify this and ask for a new one to be generated. The old one had a lifetime of 1 year, so memories were dim... I'll set an alarm in my Google calendar so that at least I'll be reminded of the need to renew it next New Year!

Thanks Ivan and Happy New Year.
4 cmsExternalGenerators extGen870+ running!
Grafana shows 445 jobs running.
ID: 52821 · Report as offensive     Reply Quote
ProfileMagic Quantum Mechanic
Avatar

Send message
Joined: 24 Oct 04
Posts: 1261
Credit: 92,121,184
RAC: 109,808
Message 52823 - Posted: 6 Jan 2026, 20:23:11 UTC

Thanks Ivan and good thing my computers work in their sleep
Mine started doing actual work again at 5 Jan 2026, 18:25:45 UTC

Around 120 Valids since that time.
ID: 52823 · Report as offensive     Reply Quote
ProfileMagic Quantum Mechanic
Avatar

Send message
Joined: 24 Oct 04
Posts: 1261
Credit: 92,121,184
RAC: 109,808
Message 52824 - Posted: 7 Jan 2026, 0:12:39 UTC

I just checked over at -dev and those CMS are still not doing actual work
ID: 52824 · Report as offensive     Reply Quote
ProfileMagic Quantum Mechanic
Avatar

Send message
Joined: 24 Oct 04
Posts: 1261
Credit: 92,121,184
RAC: 109,808
Message 52852 - Posted: 18 Jan 2026, 23:52:00 UTC
Last modified: 18 Jan 2026, 23:55:53 UTC

https://lhcathome.cern.ch/lhcathome/result.php?resultid=431749914
https://lhcathome.cern.ch/lhcathome/result.php?resultid=431752361

I see hundreds crashing over and over now and some not on a few other hosts
It looks like the failed ones are all on Windows hosts
ID: 52852 · Report as offensive     Reply Quote
ProfileMagic Quantum Mechanic
Avatar

Send message
Joined: 24 Oct 04
Posts: 1261
Credit: 92,121,184
RAC: 109,808
Message 52853 - Posted: 19 Jan 2026, 4:54:26 UTC

<message>
The global filename characters, * or ?, are entered incorrectly or too many global filename characters are specified.
(0xd0) - exit code 208 (0xd0)</message>
unbelievable are you trying to promote linux again?
ID: 52853 · Report as offensive     Reply Quote
Erich56

Send message
Joined: 18 Dec 15
Posts: 1941
Credit: 156,028,503
RAC: 107,246
Message 52854 - Posted: 19 Jan 2026, 8:06:46 UTC - in response to Message 52852.  

https://lhcathome.cern.ch/lhcathome/result.php?resultid=431749914
https://lhcathome.cern.ch/lhcathome/result.php?resultid=431752361

I see hundreds crashing over and over now and some not on a few other hosts
It looks like the failed ones are all on Windows hosts
I had several tasks failing last night, on some of my Windows hosts, but a few hours later everything was back to normal.
ID: 52854 · Report as offensive     Reply Quote
Crystal Pellet
Volunteer moderator
Volunteer tester

Send message
Joined: 14 Jan 10
Posts: 1491
Credit: 9,985,249
RAC: 968
Message 52856 - Posted: 19 Jan 2026, 13:43:55 UTC - in response to Message 52854.  

I had several tasks failing last night, on some of my Windows hosts, but a few hours later everything was back to normal.
Almost normal. For me the first job started > 30 minutes init-phase.
ID: 52856 · Report as offensive     Reply Quote
ProfileMagic Quantum Mechanic
Avatar

Send message
Joined: 24 Oct 04
Posts: 1261
Credit: 92,121,184
RAC: 109,808
Message 52858 - Posted: 19 Jan 2026, 19:06:09 UTC

I'm still getting these The global filename characters, * or ?, are entered incorrectly or too many global filename characters are specified.
(0xd0) - exit code 208 (0xd0)</message>

Back to Theory
Volunteer Mad Scientist For Life

unbelievable are you trying to promote linux again?
ID: 52858 · Report as offensive     Reply Quote
Erich56

Send message
Joined: 18 Dec 15
Posts: 1941
Credit: 156,028,503
RAC: 107,246
Message 52859 - Posted: 19 Jan 2026, 19:19:52 UTC - in response to Message 52858.  

Here, after many such tasks last night, today I had just one - downloaded at 19 Jan 2026, 12:42:39 UTC, started about half an hour later, and crashed after some 3 minutes.
ID: 52859 · Report as offensive     Reply Quote
ivan
Volunteer moderator
Project tester
Volunteer developer
Volunteer tester
Project scientist
Avatar

Send message
Joined: 29 Aug 05
Posts: 1134
Credit: 11,580,762
RAC: 14,248
Message 52860 - Posted: 20 Jan 2026, 14:58:54 UTC - in response to Message 52852.  

https://lhcathome.cern.ch/lhcathome/result.php?resultid=431749914
https://lhcathome.cern.ch/lhcathome/result.php?resultid=431752361

I see hundreds crashing over and over now and some not on a few other hosts
It looks like the failed ones are all on Windows hosts


The prominent failure rate I see at the moment is this:
Fatal Exception (Exit Code: 8002)
An exception of category 'StdException' occurred while
   [0] Constructing the EventProcessor
   [1] Constructing ESSource: class=PoolDBESSource label='GlobalTag'
Exception Message:
A std::exception was thrown.
Can not get data (Additional Information: [frontier.c:1170]: No more servers/proxies. Last error was: Request 223 on chan 1 failed at Tue Jan 20 15:16:12 2026: -9 [fn-socket.c:107]: network error on connect to [2606:4700:3034::ac43:9e0b]:8080: Network is unreachable) ( CORAL : "coral::FrontierAccess::Statement::execute" from "CORAL/RelationalPlugins/frontier" )

These come mainly from just a few volunteers. We're not sure, but we suspect network congestion, the users tend to have a large number of multi-core machines but no caching proxy assigned. We have a list of four frontier servers that hold the "conditions database", i.e. the geometrical description of the detectors, etc., in the experiment. Each of them has both IPv4 and IPv6 addresses. What the message means is that the job has cycled through all eight possible connection attempts without securing a connection, so it gives up.
There are other failure modes too, that we've not got a handle on, where the job doesn't secure a "token" to allow it to access the DataBridge storage. Again, we suspect network congestion but it's harder to get a handle on these modes. They are usually less than 5% of jobs.
ID: 52860 · Report as offensive     Reply Quote
Previous · 1 · 2 · 3 · 4 · 5 · Next

Message boards : CMS Application : CMS tasks failing


©2026 CERN