Message boards :
CMS Application :
tasks now running unusual long time without CPU usage
Message board moderation
Author | Message |
---|---|
Send message Joined: 18 Dec 15 Posts: 1827 Credit: 119,550,002 RAC: 43,533 |
Has there been any change in the CMS tasks recently? The reason why I'm asking is: since this morning, I notice that the tasks on all of my machines are running considerably longer than usual (so far, more than double time, and still not fnished yet), with almost no CPU usage. |
Send message Joined: 29 Aug 05 Posts: 1065 Credit: 7,896,399 RAC: 12,060 |
Has there been any change in the CMS tasks recently? We have some changes being implemented, but we're not sure if they will have any affect yet. The number of running jobs has fallen over the last three hours, so we are currently looking to see if there is any obvious problem. |
Send message Joined: 14 Jan 10 Posts: 1427 Credit: 9,510,310 RAC: 2,545 |
We have some changes being implemented, but we're not sure if they will have any affect yet. The number of running jobs has fallen over the last three hours, so we are currently looking to see if there is any obvious problem.Still running normal uninterrupted. Last job I got (on dev-system, but same pool, I suppose) 2024-02-06 11:20:57 UTC: fanzago_TC_SLC7_FF_CMS_Home_240126_155045_1909 |
Send message Joined: 2 May 07 Posts: 2245 Credit: 174,006,243 RAC: 8,727 |
Have one Task from 8:30 UTC also with no interrupt. |
Send message Joined: 18 Dec 15 Posts: 1827 Credit: 119,550,002 RAC: 43,533 |
just giving a closer look at one of my faster systems (CPU Intel i9-10900KF running at 4.6GHz): tasks which before were finished after 2 1/2 and 3 hours now are running more than 6 hours already, CPU between 0.3% and 0,7%, and no idea when they will get finished. Seems strange ... |
Send message Joined: 18 Dec 15 Posts: 1827 Credit: 119,550,002 RAC: 43,533 |
just seeing several failed tasks on one of my machines; they failed after about 20 minutes - see here: https://lhcathome.cern.ch/lhcathome/result.php?resultid=405218808 excerpt from stderr (right on top): <![CDATA[ <message> Die Platzhalterzeichen f�r Dateinamen (* oder ?) wurden falsch eingegeben, oder es wurden zu viele Platzhalterzeichen angegeben. (0xd0) - exit code 208 (0xd0)</message> <stderr_txt> |
Send message Joined: 2 May 07 Posts: 2245 Credit: 174,006,243 RAC: 8,727 |
Saw this also on Sunday and Monday. You can stop them, or wait and control CPU. They stopped after 20 min. duration and a few seconds CPU. |
Send message Joined: 18 Dec 15 Posts: 1827 Credit: 119,550,002 RAC: 43,533 |
just seeing several failed tasks on one of my machines; they failed after about 20 minutes - see here:looks exactly like when tasks receive no jobs - is there, in addition, a problem with jobs submission? |
Send message Joined: 2 May 07 Posts: 2245 Credit: 174,006,243 RAC: 8,727 |
This is in the Hand of the CMS-Group, thinking it's a combination of update, upgrade and... |
Send message Joined: 29 Aug 05 Posts: 1065 Credit: 7,896,399 RAC: 12,060 |
This is in the Hand of the CMS-Group, Yes, thanks for your patience. The running for 20 minutes and then stopping is the behaviour I see on my machines too. We're trying to set up for multi-core jobs, although currently CMS@Home will only send 1-core VMs. We can request more cores on CMS@Home-dev and the tests we did today show that the number of cores is being passed down the line, but Condor doesn't start running jobs. There's some head-scratching going on. |
Send message Joined: 29 Aug 05 Posts: 1065 Credit: 7,896,399 RAC: 12,060 |
|
Send message Joined: 2 May 07 Posts: 2245 Credit: 174,006,243 RAC: 8,727 |
We have done it also in this way in the past. Our slogan was "Cooking a coffee" and showing again next day, what happens. |
Send message Joined: 18 Dec 15 Posts: 1827 Credit: 119,550,002 RAC: 43,533 |
We've decided to roll back the changes from this morning, to try to get a handle on what didn't go the way we thought it would. Keep us informed if anything doesn't revert to "normality".Ivan, how about the tasks that have currently been running for many hours withoug finishing yet? Will they ever finish, or should they be aborted in order to allow newly downloaded tasks to get started ? |
Send message Joined: 18 Dec 15 Posts: 1827 Credit: 119,550,002 RAC: 43,533 |
I started another task, and after 23 minutes it failed: https://lhcathome.cern.ch/lhcathome/result.php?resultid=405214350 So things are still not working the way they are supposed to :-( |
Send message Joined: 14 Jan 10 Posts: 1427 Credit: 9,510,310 RAC: 2,545 |
I did the same and got the same computation error: EXIT_SUB_TASK_FAILURE Meanwhile the running CMS-jobs (not BOINC-tasks) slowly decreasing - 126 the lowest, but now increased to 141. Maybe a revival now? |
Send message Joined: 18 Dec 15 Posts: 1827 Credit: 119,550,002 RAC: 43,533 |
still the question is what one should do with all the tasks that have been running for 10 hours and longer without finishing. Aborting them because they most probably are dead ? |
Send message Joined: 14 Jan 10 Posts: 1427 Credit: 9,510,310 RAC: 2,545 |
. . . . . . 126 the lowest, but now increased to 141. Maybe a revival now?It seems something is fixed. I've a CMS job running inside a BOINC VM now. |
Send message Joined: 18 Dec 15 Posts: 1827 Credit: 119,550,002 RAC: 43,533 |
It seems something is fixed. I've a CMS job running inside a BOINC VM now.for how long has the task been running now? |
Send message Joined: 14 Jan 10 Posts: 1427 Credit: 9,510,310 RAC: 2,545 |
still the question is what one should do with all the tasks that have been running for 10 hours and longer without finishing. The easiest way is just to abort them. Else running until 18 hours. More work: Suspend the task without keeping the job in memory. Task should be saved to disk. Remove the saved state with VirtualBox Manager. Start the VM outside of BOINC until you have a CMS job running. Stop and Save the VM to disk and restart the task within BOINC. |
Send message Joined: 14 Jan 10 Posts: 1427 Credit: 9,510,310 RAC: 2,545 |
Task is running 25 minutes and cmsRun inside VM 13 minutes now (95% Cpu)It seems something is fixed. I've a CMS job running inside a BOINC VM now.for how long has the task been running now? |
©2025 CERN