Message boards : CMS Application : CMS tasks failing
Message board moderation

To post messages, you must log in.

Previous · 1 · 2

AuthorMessage
ivan
Volunteer moderator
Project tester
Volunteer developer
Volunteer tester
Project scientist
Avatar

Send message
Joined: 29 Aug 05
Posts: 1110
Credit: 9,377,047
RAC: 4,747
Message 52483 - Posted: 7 Oct 2025, 14:48:08 UTC - in response to Message 52482.  

for the past few hours, all CMS tasks on all of my hosts are failing after about 1/2 hour.

Stderr says:
https://lhcathome.cern.ch/lhcathome/result.php?resultid=427929038

Thanks for reporting that. I've noticed that the running jobs were falling off but don't yet see any reason why -- the WMAgent seems to be in good shape so I'm surmising a network error somewhere.
ID: 52483 · Report as offensive     Reply Quote
Profile Magic Quantum Mechanic
Avatar

Send message
Joined: 24 Oct 04
Posts: 1234
Credit: 79,645,603
RAC: 98,738
Message 52486 - Posted: 7 Oct 2025, 16:48:34 UTC

I got about 100 of these since it was during the couple hours when I was asleep
<core_client_version>8.2.4</core_client_version>
<![CDATA[
<message>
The global filename characters, * or ?, are entered incorrectly or too many global filename characters are specified.
(0xd0) - exit code 208 (0xd0)</message>
<stderr_txt>
ID: 52486 · Report as offensive     Reply Quote
Garrulus glandarius

Send message
Joined: 5 Apr 25
Posts: 51
Credit: 937,989
RAC: 22,851
Message 52488 - Posted: 7 Oct 2025, 20:33:54 UTC
Last modified: 7 Oct 2025, 20:35:01 UTC

I had 2 taks fail, luckily I saw the earlier post and blocked further CMS tasks. Both look like:

Exit status 208 (0x000000D0) EXIT_SUB_TASK_FAILURE

and

2025-10-07 21:51:09 (2841802): Guest Log: [INFO] CMS application starting. Check log files.
2025-10-07 22:11:46 (2841802): Guest Log: [ERROR] glidein exited with return value 1.
2025-10-07 22:11:46 (2841802): Guest Log: [DEBUG] Volunteer: Garrulus glandarius (2359357)
2025-10-07 22:11:46 (2841802): Guest Log: [INFO] Shutting Down.
2025-10-07 22:12:15 (2841802): VM Completion File Detected.
2025-10-07 22:12:15 (2841802): VM Completion Message: glidein exited with return value 1.

ID: 52488 · Report as offensive     Reply Quote
Aaron

Send message
Joined: 5 May 10
Posts: 8
Credit: 5,736,920
RAC: 36,635
Message 52489 - Posted: 8 Oct 2025, 1:59:40 UTC

I just noticed this also. Have over 300 tasks in a row that failed the same way.

Just suspended all tasks that haven't started yet until they fix it.
ID: 52489 · Report as offensive     Reply Quote
Erich56

Send message
Joined: 18 Dec 15
Posts: 1908
Credit: 144,528,429
RAC: 76,299
Message 52491 - Posted: 8 Oct 2025, 6:21:45 UTC - in response to Message 52483.  

Ivan wrote:
Thanks for reporting that. I've noticed that the running jobs were falling off but don't yet see any reason why -- the WMAgent seems to be in good shape so I'm surmising a network error somewhere.
good morning, Ivan - wouldn't it make sense to stop task distribution until the problem gets solved?
ID: 52491 · Report as offensive     Reply Quote
ivan
Volunteer moderator
Project tester
Volunteer developer
Volunteer tester
Project scientist
Avatar

Send message
Joined: 29 Aug 05
Posts: 1110
Credit: 9,377,047
RAC: 4,747
Message 52495 - Posted: 8 Oct 2025, 9:59:52 UTC - in response to Message 52491.  

Ivan wrote:
Thanks for reporting that. I've noticed that the running jobs were falling off but don't yet see any reason why -- the WMAgent seems to be in good shape so I'm surmising a network error somewhere.
good morning, Ivan - wouldn't it make sense to stop task distribution until the problem gets solved?

Hmm, the problem there is that we can't debug if there are no tasks asking for jobs... I've alerted our HTCondor specialist to the problem but haven't heard back today; last night she couldn't see anything obvious. As far as I can tell we must be having some mismatch between the requirements specified by the tasks (i.e. the VM that wants jobs to run) and the requirements of the jobs that condor has available for distribution, but I'm no expert at querying the condor server. I'm now wondering if something has changed in the job submission infrastructure that we haven't been told about.
I'll ask Laurence if he can limit the size of the task queue in the BOINC server until we have an answer.
ID: 52495 · Report as offensive     Reply Quote
ivan
Volunteer moderator
Project tester
Volunteer developer
Volunteer tester
Project scientist
Avatar

Send message
Joined: 29 Aug 05
Posts: 1110
Credit: 9,377,047
RAC: 4,747
Message 52496 - Posted: 8 Oct 2025, 12:21:32 UTC - in response to Message 52495.  

The task queue has been cut back to 25 instead of 200, but it will take some time for failures to whittle it down.
ID: 52496 · Report as offensive     Reply Quote
ivan
Volunteer moderator
Project tester
Volunteer developer
Volunteer tester
Project scientist
Avatar

Send message
Joined: 29 Aug 05
Posts: 1110
Credit: 9,377,047
RAC: 4,747
Message 52498 - Posted: 9 Oct 2025, 12:17:39 UTC

The current problem seems to be that "new" VMs (i.e. tasks from a volunteer's point of view) are unable to join the HTCondor "pool" of available machines and thus they don't acquire jobs to run. The reason is still unclear...
ID: 52498 · Report as offensive     Reply Quote
Erich56

Send message
Joined: 18 Dec 15
Posts: 1908
Credit: 144,528,429
RAC: 76,299
Message 52499 - Posted: 9 Oct 2025, 12:21:42 UTC - in response to Message 52498.  
Last modified: 9 Oct 2025, 12:22:59 UTC

The current problem seems to be that "new" VMs (i.e. tasks from a volunteer's point of view) are unable to join the HTCondor "pool" of available machines and thus they don't acquire jobs to run. The reason is still unclear...
Ivan, thanks for the information. So let's keep our fingers crossed that the problem will be solved soon.

P.S. tasks are still available for download, I think it would make sense to stop this.
ID: 52499 · Report as offensive     Reply Quote
[VENETO] boboviz
Avatar

Send message
Joined: 7 May 08
Posts: 248
Credit: 1,878,216
RAC: 11,578
Message 52504 - Posted: 10 Oct 2025, 12:05:29 UTC

Strange new message after 20/25 minutes of calculation:

<message>
The file name wildcard characters * or ? were entered incorrectly or too many were specified.
(0xd0) - exit code 208 (0xd0)</message>
<stderr_txt>
ID: 52504 · Report as offensive     Reply Quote
Profile rilian
Avatar

Send message
Joined: 12 Jul 08
Posts: 21
Credit: 647,249
RAC: 10,046
Message 52510 - Posted: 10 Oct 2025, 20:45:28 UTC - in response to Message 52496.  

The task queue has been cut back to 25 instead of 200, but it will take some time for failures to whittle it down.


i'm still getting
[ERROR] glidein exited with return value 1.

should we keep CMS enabled to help process this queue, or pause it for now ? It takes about 20m for the task before it fails on my computer
I crunch for Ukraine
ID: 52510 · Report as offensive     Reply Quote
Erich56

Send message
Joined: 18 Dec 15
Posts: 1908
Credit: 144,528,429
RAC: 76,299
Message 52519 - Posted: 13 Oct 2025, 11:52:37 UTC

Ivan, any idea when CMS will be working again?
ID: 52519 · Report as offensive     Reply Quote
computezrmle
Volunteer moderator
Volunteer developer
Volunteer tester
Help desk expert
Avatar

Send message
Joined: 15 Jun 08
Posts: 2679
Credit: 286,802,489
RAC: 74,001
Message 52520 - Posted: 13 Oct 2025, 13:20:26 UTC - in response to Message 52504.  

Strange new message after 20/25 minutes of calculation:

<message>
The file name wildcard characters * or ? were entered incorrectly or too many were specified.
(0xd0) - exit code 208 (0xd0)</message>
<stderr_txt>

This error text is caused by a mismatch between Windows and BOINC.
BOINC reports #208 as "EXIT_SUB_TASK_FAILURE" while Windows expands it to the "wildcard" text.

Not sure if Windows or BOINC was first to use that error number.
The text can safely be ignored since the error number in connection with BOINC's final status clearly shows what happened.
ID: 52520 · Report as offensive     Reply Quote
ivan
Volunteer moderator
Project tester
Volunteer developer
Volunteer tester
Project scientist
Avatar

Send message
Joined: 29 Aug 05
Posts: 1110
Credit: 9,377,047
RAC: 4,747
Message 52528 - Posted: 14 Oct 2025, 13:31:20 UTC - in response to Message 52519.  

Ivan, any idea when CMS will be working again?

Not sure, might know better after a meeting with Federica and Laurence just now. Principal problem has been corrected and tasks are starting, but they are not being assigned jobs in the condor server. More news soon, I hope.
ID: 52528 · Report as offensive     Reply Quote
ivan
Volunteer moderator
Project tester
Volunteer developer
Volunteer tester
Project scientist
Avatar

Send message
Joined: 29 Aug 05
Posts: 1110
Credit: 9,377,047
RAC: 4,747
Message 52538 - Posted: 15 Oct 2025, 13:20:45 UTC - in response to Message 52528.  

Ivan, any idea when CMS will be working again?

Not sure, might know better after a meeting with Federica and Laurence just now. Principal problem has been corrected and tasks are starting, but they are not being assigned jobs in the condor server. More news soon, I hope.

Well, we fixed one problem (someone decided to clean out a file-store that was filling up, and deleted some files that we still used), so now tasks are joning the pool again, but for some reason condor isn't matching them to available jobs. Debugging continues...
ID: 52538 · Report as offensive     Reply Quote
Erich56

Send message
Joined: 18 Dec 15
Posts: 1908
Credit: 144,528,429
RAC: 76,299
Message 52539 - Posted: 17 Oct 2025, 6:25:21 UTC - in response to Message 52538.  

... so now tasks are joning the pool again, but for some reason condor isn't matching them to available jobs. Debugging continues...
good morning Ivan, what's the status? Debugging still not successful so far?
ID: 52539 · Report as offensive     Reply Quote
Crystal Pellet
Volunteer moderator
Volunteer tester

Send message
Joined: 14 Jan 10
Posts: 1461
Credit: 9,852,678
RAC: 3,020
Message 52541 - Posted: 17 Oct 2025, 12:49:34 UTC - in response to Message 52539.  

... so now tasks are joning the pool again, but for some reason condor isn't matching them to available jobs. Debugging continues...
good morning Ivan, what's the status? Debugging still not successful so far?

VM-tasks are not failing, but do not process real internal jobs: https://lhcathome.cern.ch/lhcathome/result.php?resultid=429041036
ID: 52541 · Report as offensive     Reply Quote
Erich56

Send message
Joined: 18 Dec 15
Posts: 1908
Credit: 144,528,429
RAC: 76,299
Message 52542 - Posted: 17 Oct 2025, 13:31:00 UTC - in response to Message 52541.  

VM-tasks are not failing, but do not process real internal jobs: https://lhcathome.cern.ch/lhcathome/result.php?resultid=429041036
yes - that's the problem; and I am afraid that it won't be solved until next week :-(
ID: 52542 · Report as offensive     Reply Quote
Previous · 1 · 2

Message boards : CMS Application : CMS tasks failing


©2025 CERN