Message boards :
CMS Application :
CMS tasks failing
Message board moderation
Previous · 1 · 2
Author | Message |
---|---|
![]() Send message Joined: 29 Aug 05 Posts: 1110 Credit: 9,381,594 RAC: 5,065 ![]() |
for the past few hours, all CMS tasks on all of my hosts are failing after about 1/2 hour. Thanks for reporting that. I've noticed that the running jobs were falling off but don't yet see any reason why -- the WMAgent seems to be in good shape so I'm surmising a network error somewhere. ![]() |
![]() ![]() Send message Joined: 24 Oct 04 Posts: 1234 Credit: 79,650,549 RAC: 96,588 ![]() ![]() |
I got about 100 of these since it was during the couple hours when I was asleep <core_client_version>8.2.4</core_client_version> <![CDATA[ <message> The global filename characters, * or ?, are entered incorrectly or too many global filename characters are specified. (0xd0) - exit code 208 (0xd0)</message> <stderr_txt> |
Send message Joined: 5 Apr 25 Posts: 51 Credit: 945,151 RAC: 22,944 ![]() ![]() ![]() |
I had 2 taks fail, luckily I saw the earlier post and blocked further CMS tasks. Both look like: Exit status 208 (0x000000D0) EXIT_SUB_TASK_FAILURE and 2025-10-07 21:51:09 (2841802): Guest Log: [INFO] CMS application starting. Check log files. ![]() |
Send message Joined: 5 May 10 Posts: 8 Credit: 5,738,284 RAC: 36,646 ![]() ![]() ![]() |
I just noticed this also. Have over 300 tasks in a row that failed the same way. Just suspended all tasks that haven't started yet until they fix it. |
Send message Joined: 18 Dec 15 Posts: 1908 Credit: 144,550,824 RAC: 76,453 ![]() ![]() ![]() |
Ivan wrote: Thanks for reporting that. I've noticed that the running jobs were falling off but don't yet see any reason why -- the WMAgent seems to be in good shape so I'm surmising a network error somewhere.good morning, Ivan - wouldn't it make sense to stop task distribution until the problem gets solved? |
![]() Send message Joined: 29 Aug 05 Posts: 1110 Credit: 9,381,594 RAC: 5,065 ![]() |
Ivan wrote: Hmm, the problem there is that we can't debug if there are no tasks asking for jobs... I've alerted our HTCondor specialist to the problem but haven't heard back today; last night she couldn't see anything obvious. As far as I can tell we must be having some mismatch between the requirements specified by the tasks (i.e. the VM that wants jobs to run) and the requirements of the jobs that condor has available for distribution, but I'm no expert at querying the condor server. I'm now wondering if something has changed in the job submission infrastructure that we haven't been told about. I'll ask Laurence if he can limit the size of the task queue in the BOINC server until we have an answer. ![]() |
![]() Send message Joined: 29 Aug 05 Posts: 1110 Credit: 9,381,594 RAC: 5,065 ![]() |
|
![]() Send message Joined: 29 Aug 05 Posts: 1110 Credit: 9,381,594 RAC: 5,065 ![]() |
|
Send message Joined: 18 Dec 15 Posts: 1908 Credit: 144,550,824 RAC: 76,453 ![]() ![]() ![]() |
The current problem seems to be that "new" VMs (i.e. tasks from a volunteer's point of view) are unable to join the HTCondor "pool" of available machines and thus they don't acquire jobs to run. The reason is still unclear...Ivan, thanks for the information. So let's keep our fingers crossed that the problem will be solved soon. P.S. tasks are still available for download, I think it would make sense to stop this. |
![]() Send message Joined: 7 May 08 Posts: 248 Credit: 1,879,226 RAC: 11,328 ![]() ![]() |
Strange new message after 20/25 minutes of calculation: <message> |
![]() ![]() Send message Joined: 12 Jul 08 Posts: 21 Credit: 647,310 RAC: 9,406 ![]() ![]() |
The task queue has been cut back to 25 instead of 200, but it will take some time for failures to whittle it down. i'm still getting [ERROR] glidein exited with return value 1. should we keep CMS enabled to help process this queue, or pause it for now ? It takes about 20m for the task before it fails on my computer I crunch for Ukraine |
Send message Joined: 18 Dec 15 Posts: 1908 Credit: 144,550,824 RAC: 76,453 ![]() ![]() ![]() |
Ivan, any idea when CMS will be working again? |
![]() Send message Joined: 15 Jun 08 Posts: 2679 Credit: 286,806,599 RAC: 72,396 ![]() ![]() |
Strange new message after 20/25 minutes of calculation: This error text is caused by a mismatch between Windows and BOINC. BOINC reports #208 as "EXIT_SUB_TASK_FAILURE" while Windows expands it to the "wildcard" text. Not sure if Windows or BOINC was first to use that error number. The text can safely be ignored since the error number in connection with BOINC's final status clearly shows what happened. |
![]() Send message Joined: 29 Aug 05 Posts: 1110 Credit: 9,381,594 RAC: 5,065 ![]() |
|
![]() Send message Joined: 29 Aug 05 Posts: 1110 Credit: 9,381,594 RAC: 5,065 ![]() |
Ivan, any idea when CMS will be working again? Well, we fixed one problem (someone decided to clean out a file-store that was filling up, and deleted some files that we still used), so now tasks are joning the pool again, but for some reason condor isn't matching them to available jobs. Debugging continues... ![]() |
Send message Joined: 18 Dec 15 Posts: 1908 Credit: 144,550,824 RAC: 76,453 ![]() ![]() ![]() |
... so now tasks are joning the pool again, but for some reason condor isn't matching them to available jobs. Debugging continues...good morning Ivan, what's the status? Debugging still not successful so far? |
Send message Joined: 14 Jan 10 Posts: 1461 Credit: 9,852,993 RAC: 3,041 ![]() ![]() |
... so now tasks are joning the pool again, but for some reason condor isn't matching them to available jobs. Debugging continues...good morning Ivan, what's the status? Debugging still not successful so far? VM-tasks are not failing, but do not process real internal jobs: https://lhcathome.cern.ch/lhcathome/result.php?resultid=429041036 |
Send message Joined: 18 Dec 15 Posts: 1908 Credit: 144,550,824 RAC: 76,453 ![]() ![]() ![]() |
VM-tasks are not failing, but do not process real internal jobs: https://lhcathome.cern.ch/lhcathome/result.php?resultid=429041036yes - that's the problem; and I am afraid that it won't be solved until next week :-( |
©2025 CERN