Thread 'CMS tasks failing'

Author	Message
ivan Volunteer moderator Project tester Volunteer developer Volunteer tester Project scientist Send message Joined: 29 Aug 05 Posts: 1152 Credit: 11,734,920 RAC: 657	Message 52779 - Posted: 22 Dec 2025, 15:36:47 UTC - in response to Message 52778. Hi Fede; The messages I see are: 2025-12-22 15:01:44 (2296): Guest Log: [INFO] Environment HTTP proxy: not set 2025-12-22 15:01:44 (2296): Guest Log: [INFO] Reading volunteer information 2025-12-22 15:01:49 (2296): Guest Log: [INFO] Requesting an X509 credential from LHC@home 2025-12-22 15:01:50 (2296): Guest Log: [INFO] Requesting an idtoken from LHC@home 2025-12-22 15:01:51 (2296): Guest Log: [INFO] CMS application starting. Check log files. 2025-12-22 15:28:18 (2296): Guest Log: [INFO] glidein exited with return value 0. 2025-12-22 15:28:18 (2296): Guest Log: [INFO] Shutting Down. 2025-12-22 15:28:18 (2296): VM Completion File Detected. 2025-12-22 15:28:18 (2296): VM Completion Message: glidein exited with return value 0. so the problem seems to be that the job doesn't start running in the task VM, and times out. They aren't showing up as failures in grafana, nor in WMStats as far as I can see. ID: 52779 · Reply Quote

Magic Quantum Mechanic Send message Joined: 24 Oct 04 Posts: 1291 Credit: 95,276,708 RAC: 34,055	Message 52780 - Posted: 22 Dec 2025, 16:03:13 UTC - in response to Message 52778. https://lhcathome.cern.ch/lhcathome/result.php?resultid=431151623 Several hundred of these starting around 21 Dec 2025, 18:21:32 UTC ID: 52780 · Reply Quote

Erich56 Send message Joined: 18 Dec 15 Posts: 1967 Credit: 159,360,666 RAC: 45,407	Message 52785 - Posted: 23 Dec 2025, 13:38:33 UTC Ivan - any hope that CMS will be back before the holiday season starts ? ID: 52785 · Reply Quote

ivan Volunteer moderator Project tester Volunteer developer Volunteer tester Project scientist Send message Joined: 29 Aug 05 Posts: 1152 Credit: 11,734,920 RAC: 657	Message 52786 - Posted: 23 Dec 2025, 14:42:44 UTC - in response to Message 52785. Ivan - any hope that CMS will be back before the holiday season starts ? I doubt it, most people are on holiday. Neither Federica nor I have access to the condor servers at the moment, to pore through logs, and efforts to contact those who might be able to help have so far been fruitless. Sorry 'bout that! ID: 52786 · Reply Quote

Erich56 Send message Joined: 18 Dec 15 Posts: 1967 Credit: 159,360,666 RAC: 45,407	Message 52787 - Posted: 23 Dec 2025, 17:48:41 UTC - in response to Message 52786. Ivan, thank you for your continued efforts anyway. What would make sense though is to stop distribution of tasks - are you able to do this? ID: 52787 · Reply Quote

[VENETO] boboviz Send message Joined: 7 May 08 Posts: 273 Credit: 2,131,245 RAC: 252	Message 52788 - Posted: 24 Dec 2025, 8:17:59 UTC - in response to Message 52777. Well, thanks everyone for your patience this year. It's been frustrating for me, being forced into retirement and gradually losing my accounts and access while other things crumble as well. I'm not particularly looking forward to next year, but I'll try to soldier on for a while. Thank you for all the time and the effort for this project!! ID: 52788 · Reply Quote

Erich56 Send message Joined: 18 Dec 15 Posts: 1967 Credit: 159,360,666 RAC: 45,407	Message 52789 - Posted: 24 Dec 2025, 8:38:52 UTC - in response to Message 52788. Thank you for all the time and the effort for this project!! +1 ID: 52789 · Reply Quote

ivan Volunteer moderator Project tester Volunteer developer Volunteer tester Project scientist Send message Joined: 29 Aug 05 Posts: 1152 Credit: 11,734,920 RAC: 657	Message 52798 - Posted: 25 Dec 2025, 16:44:50 UTC - in response to Message 52786. Ivan - any hope that CMS will be back before the holiday season starts ? I doubt it, most people are on holiday. Neither Federica nor I have access to the condor servers at the moment, to pore through logs, and efforts to contact those who might be able to help have so far been fruitless. Sorry 'bout that! Well, one of the chaps from the CMS Submissions Infrastructure group managed to manually run one of our jobs yesterday (you can see it on the jobs graph if you adjust the time range appropriately :-), so I'm back to my original hypothesis that there has been a change in the permissions chain that we haven't yet picked up. ID: 52798 · Reply Quote

Magic Quantum Mechanic Send message Joined: 24 Oct 04 Posts: 1291 Credit: 95,276,708 RAC: 34,055	Message 52799 - Posted: 25 Dec 2025, 22:17:27 UTC ID: 52799 · Reply Quote

ivan Volunteer moderator Project tester Volunteer developer Volunteer tester Project scientist Send message Joined: 29 Aug 05 Posts: 1152 Credit: 11,734,920 RAC: 657	Message 52818 - Posted: 5 Jan 2026, 16:39:17 UTC - in response to Message 52799. Wake up! I was right, an IDtoken had expired, but we had to wait for people to get back from holidays to identify this and ask for a new one to be generated. The old one had a lifetime of 1 year, so memories were dim... I'll set an alarm in my Google calendar so that at least I'll be reminded of the need to renew it next New Year! ID: 52818 · Reply Quote

Crystal Pellet Volunteer moderator Volunteer tester Send message Joined: 14 Jan 10 Posts: 1533 Credit: 10,042,485 RAC: 1,074	Message 52821 - Posted: 5 Jan 2026, 17:31:30 UTC - in response to Message 52818. Last modified: 5 Jan 2026, 17:38:42 UTC Wake up! I was right, an IDtoken had expired, but we had to wait for people to get back from holidays to identify this and ask for a new one to be generated. The old one had a lifetime of 1 year, so memories were dim... I'll set an alarm in my Google calendar so that at least I'll be reminded of the need to renew it next New Year! Thanks Ivan and Happy New Year. 4 cmsExternalGenerators extGen870+ running! Grafana shows 445 jobs running. ID: 52821 · Reply Quote

Magic Quantum Mechanic Send message Joined: 24 Oct 04 Posts: 1291 Credit: 95,276,708 RAC: 34,055	Message 52823 - Posted: 6 Jan 2026, 20:23:11 UTC Thanks Ivan and good thing my computers work in their sleep Mine started doing actual work again at 5 Jan 2026, 18:25:45 UTC Around 120 Valids since that time. ID: 52823 · Reply Quote

Magic Quantum Mechanic Send message Joined: 24 Oct 04 Posts: 1291 Credit: 95,276,708 RAC: 34,055	Message 52824 - Posted: 7 Jan 2026, 0:12:39 UTC I just checked over at -dev and those CMS are still not doing actual work ID: 52824 · Reply Quote

Magic Quantum Mechanic Send message Joined: 24 Oct 04 Posts: 1291 Credit: 95,276,708 RAC: 34,055	Message 52852 - Posted: 18 Jan 2026, 23:52:00 UTC Last modified: 18 Jan 2026, 23:55:53 UTC https://lhcathome.cern.ch/lhcathome/result.php?resultid=431749914 https://lhcathome.cern.ch/lhcathome/result.php?resultid=431752361 I see hundreds crashing over and over now and some not on a few other hosts It looks like the failed ones are all on Windows hosts ID: 52852 · Reply Quote

Magic Quantum Mechanic Send message Joined: 24 Oct 04 Posts: 1291 Credit: 95,276,708 RAC: 34,055	Message 52853 - Posted: 19 Jan 2026, 4:54:26 UTC <message> The global filename characters, * or ?, are entered incorrectly or too many global filename characters are specified. (0xd0) - exit code 208 (0xd0)</message> unbelievable are you trying to promote linux again? ID: 52853 · Reply Quote

Erich56 Send message Joined: 18 Dec 15 Posts: 1967 Credit: 159,360,666 RAC: 45,407	Message 52854 - Posted: 19 Jan 2026, 8:06:46 UTC - in response to Message 52852. https://lhcathome.cern.ch/lhcathome/result.php?resultid=431749914 https://lhcathome.cern.ch/lhcathome/result.php?resultid=431752361 I see hundreds crashing over and over now and some not on a few other hosts It looks like the failed ones are all on Windows hosts I had several tasks failing last night, on some of my Windows hosts, but a few hours later everything was back to normal. ID: 52854 · Reply Quote

Crystal Pellet Volunteer moderator Volunteer tester Send message Joined: 14 Jan 10 Posts: 1533 Credit: 10,042,485 RAC: 1,074	Message 52856 - Posted: 19 Jan 2026, 13:43:55 UTC - in response to Message 52854. I had several tasks failing last night, on some of my Windows hosts, but a few hours later everything was back to normal. Almost normal. For me the first job started > 30 minutes init-phase. ID: 52856 · Reply Quote

Magic Quantum Mechanic Send message Joined: 24 Oct 04 Posts: 1291 Credit: 95,276,708 RAC: 34,055	Message 52858 - Posted: 19 Jan 2026, 19:06:09 UTC I'm still getting these The global filename characters, * or ?, are entered incorrectly or too many global filename characters are specified. (0xd0) - exit code 208 (0xd0)</message> Back to Theory Volunteer Mad Scientist For Life unbelievable are you trying to promote linux again? ID: 52858 · Reply Quote

Erich56 Send message Joined: 18 Dec 15 Posts: 1967 Credit: 159,360,666 RAC: 45,407	Message 52859 - Posted: 19 Jan 2026, 19:19:52 UTC - in response to Message 52858. Here, after many such tasks last night, today I had just one - downloaded at 19 Jan 2026, 12:42:39 UTC, started about half an hour later, and crashed after some 3 minutes. ID: 52859 · Reply Quote

ivan Volunteer moderator Project tester Volunteer developer Volunteer tester Project scientist Send message Joined: 29 Aug 05 Posts: 1152 Credit: 11,734,920 RAC: 657	Message 52860 - Posted: 20 Jan 2026, 14:58:54 UTC - in response to Message 52852. https://lhcathome.cern.ch/lhcathome/result.php?resultid=431749914 https://lhcathome.cern.ch/lhcathome/result.php?resultid=431752361 I see hundreds crashing over and over now and some not on a few other hosts It looks like the failed ones are all on Windows hosts The prominent failure rate I see at the moment is this: Fatal Exception (Exit Code: 8002) An exception of category 'StdException' occurred while [0] Constructing the EventProcessor [1] Constructing ESSource: class=PoolDBESSource label='GlobalTag' Exception Message: A std::exception was thrown. Can not get data (Additional Information: [frontier.c:1170]: No more servers/proxies. Last error was: Request 223 on chan 1 failed at Tue Jan 20 15:16:12 2026: -9 [fn-socket.c:107]: network error on connect to [2606:4700:3034::ac43:9e0b]:8080: Network is unreachable) ( CORAL : "coral::FrontierAccess::Statement::execute" from "CORAL/RelationalPlugins/frontier" ) These come mainly from just a few volunteers. We're not sure, but we suspect network congestion, the users tend to have a large number of multi-core machines but no caching proxy assigned. We have a list of four frontier servers that hold the "conditions database", i.e. the geometrical description of the detectors, etc., in the experiment. Each of them has both IPv4 and IPv6 addresses. What the message means is that the job has cycled through all eight possible connection attempts without securing a connection, so it gives up. There are other failure modes too, that we've not got a handle on, where the job doesn't secure a "token" to allow it to access the DataBridge storage. Again, we suspect network congestion but it's harder to get a handle on these modes. They are usually less than 5% of jobs. ID: 52860 · Reply Quote