Message boards : CMS Application : tasks now running unusual long time without CPU usage
Message board moderation

To post messages, you must log in.

1 · 2 · 3 · 4 · Next

AuthorMessage
Erich56

Send message
Joined: 18 Dec 15
Posts: 1690
Credit: 104,113,784
RAC: 122,340
Message 49405 - Posted: 6 Feb 2024, 11:24:51 UTC

Has there been any change in the CMS tasks recently?

The reason why I'm asking is: since this morning, I notice that the tasks on all of my machines are running considerably longer than usual (so far, more than double time, and still not fnished yet), with almost no CPU usage.
ID: 49405 · Report as offensive     Reply Quote
ivan
Volunteer moderator
Project tester
Volunteer developer
Volunteer tester
Project scientist
Avatar

Send message
Joined: 29 Aug 05
Posts: 1006
Credit: 6,272,232
RAC: 315
Message 49406 - Posted: 6 Feb 2024, 11:46:03 UTC - in response to Message 49405.  

Has there been any change in the CMS tasks recently?

The reason why I'm asking is: since this morning, I notice that the tasks on all of my machines are running considerably longer than usual (so far, more than double time, and still not fnished yet), with almost no CPU usage.

We have some changes being implemented, but we're not sure if they will have any affect yet. The number of running jobs has fallen over the last three hours, so we are currently looking to see if there is any obvious problem.
ID: 49406 · Report as offensive     Reply Quote
Crystal Pellet
Volunteer moderator
Volunteer tester

Send message
Joined: 14 Jan 10
Posts: 1280
Credit: 8,496,817
RAC: 2,016
Message 49407 - Posted: 6 Feb 2024, 12:13:41 UTC - in response to Message 49406.  
Last modified: 6 Feb 2024, 12:14:00 UTC

We have some changes being implemented, but we're not sure if they will have any affect yet. The number of running jobs has fallen over the last three hours, so we are currently looking to see if there is any obvious problem.
Still running normal uninterrupted.
Last job I got (on dev-system, but same pool, I suppose) 2024-02-06 11:20:57 UTC: fanzago_TC_SLC7_FF_CMS_Home_240126_155045_1909
ID: 49407 · Report as offensive     Reply Quote
maeax

Send message
Joined: 2 May 07
Posts: 2108
Credit: 159,819,192
RAC: 107,232
Message 49408 - Posted: 6 Feb 2024, 12:23:01 UTC - in response to Message 49407.  

Have one Task from 8:30 UTC also with no interrupt.
ID: 49408 · Report as offensive     Reply Quote
Erich56

Send message
Joined: 18 Dec 15
Posts: 1690
Credit: 104,113,784
RAC: 122,340
Message 49409 - Posted: 6 Feb 2024, 12:25:50 UTC - in response to Message 49407.  

just giving a closer look at one of my faster systems (CPU Intel i9-10900KF running at 4.6GHz):
tasks which before were finished after 2 1/2 and 3 hours now are running more than 6 hours already, CPU between 0.3% and 0,7%, and no idea when they will get finished.
Seems strange ...
ID: 49409 · Report as offensive     Reply Quote
Erich56

Send message
Joined: 18 Dec 15
Posts: 1690
Credit: 104,113,784
RAC: 122,340
Message 49410 - Posted: 6 Feb 2024, 12:54:21 UTC

just seeing several failed tasks on one of my machines; they failed after about 20 minutes - see here:

https://lhcathome.cern.ch/lhcathome/result.php?resultid=405218808

excerpt from stderr (right on top):

<![CDATA[
<message>
Die Platzhalterzeichen f�r Dateinamen (* oder ?) wurden falsch eingegeben, oder es wurden zu viele Platzhalterzeichen angegeben.
(0xd0) - exit code 208 (0xd0)</message>
<stderr_txt>
ID: 49410 · Report as offensive     Reply Quote
maeax

Send message
Joined: 2 May 07
Posts: 2108
Credit: 159,819,192
RAC: 107,232
Message 49411 - Posted: 6 Feb 2024, 13:05:32 UTC - in response to Message 49410.  
Last modified: 6 Feb 2024, 13:07:09 UTC

Saw this also on Sunday and Monday.
You can stop them, or wait and control CPU.
They stopped after 20 min. duration and a few seconds CPU.
ID: 49411 · Report as offensive     Reply Quote
Erich56

Send message
Joined: 18 Dec 15
Posts: 1690
Credit: 104,113,784
RAC: 122,340
Message 49412 - Posted: 6 Feb 2024, 13:08:31 UTC - in response to Message 49410.  

just seeing several failed tasks on one of my machines; they failed after about 20 minutes - see here:

https://lhcathome.cern.ch/lhcathome/result.php?resultid=405218808
looks exactly like when tasks receive no jobs - is there, in addition, a problem with jobs submission?
ID: 49412 · Report as offensive     Reply Quote
maeax

Send message
Joined: 2 May 07
Posts: 2108
Credit: 159,819,192
RAC: 107,232
Message 49413 - Posted: 6 Feb 2024, 14:17:49 UTC - in response to Message 49412.  

This is in the Hand of the CMS-Group,
thinking it's a combination of update, upgrade and...
ID: 49413 · Report as offensive     Reply Quote
ivan
Volunteer moderator
Project tester
Volunteer developer
Volunteer tester
Project scientist
Avatar

Send message
Joined: 29 Aug 05
Posts: 1006
Credit: 6,272,232
RAC: 315
Message 49414 - Posted: 6 Feb 2024, 15:27:44 UTC - in response to Message 49413.  

This is in the Hand of the CMS-Group,
thinking it's a combination of update, upgrade and...

Yes, thanks for your patience. The running for 20 minutes and then stopping is the behaviour I see on my machines too.
We're trying to set up for multi-core jobs, although currently CMS@Home will only send 1-core VMs. We can request more cores on CMS@Home-dev and the tests we did today show that the number of cores is being passed down the line, but Condor doesn't start running jobs. There's some head-scratching going on.
ID: 49414 · Report as offensive     Reply Quote
ivan
Volunteer moderator
Project tester
Volunteer developer
Volunteer tester
Project scientist
Avatar

Send message
Joined: 29 Aug 05
Posts: 1006
Credit: 6,272,232
RAC: 315
Message 49415 - Posted: 6 Feb 2024, 15:40:36 UTC - in response to Message 49414.  

We've decided to roll back the changes from this morning, to try to get a handle on what didn't go the way we thought it would. Keep us informed if anything doesn't revert to "normality".
ID: 49415 · Report as offensive     Reply Quote
maeax

Send message
Joined: 2 May 07
Posts: 2108
Credit: 159,819,192
RAC: 107,232
Message 49416 - Posted: 6 Feb 2024, 15:54:43 UTC - in response to Message 49415.  

We have done it also in this way in the past.
Our slogan was "Cooking a coffee" and showing again next day, what happens.
ID: 49416 · Report as offensive     Reply Quote
Erich56

Send message
Joined: 18 Dec 15
Posts: 1690
Credit: 104,113,784
RAC: 122,340
Message 49417 - Posted: 6 Feb 2024, 16:26:04 UTC - in response to Message 49415.  

We've decided to roll back the changes from this morning, to try to get a handle on what didn't go the way we thought it would. Keep us informed if anything doesn't revert to "normality".
Ivan, how about the tasks that have currently been running for many hours withoug finishing yet? Will they ever finish, or should they be aborted in order to allow newly downloaded tasks to get started ?
ID: 49417 · Report as offensive     Reply Quote
Erich56

Send message
Joined: 18 Dec 15
Posts: 1690
Credit: 104,113,784
RAC: 122,340
Message 49418 - Posted: 6 Feb 2024, 17:00:00 UTC - in response to Message 49417.  

I started another task, and after 23 minutes it failed:

https://lhcathome.cern.ch/lhcathome/result.php?resultid=405214350

So things are still not working the way they are supposed to :-(
ID: 49418 · Report as offensive     Reply Quote
Crystal Pellet
Volunteer moderator
Volunteer tester

Send message
Joined: 14 Jan 10
Posts: 1280
Credit: 8,496,817
RAC: 2,016
Message 49421 - Posted: 6 Feb 2024, 17:52:17 UTC - in response to Message 49418.  

I did the same and got the same computation error: EXIT_SUB_TASK_FAILURE

Meanwhile the running CMS-jobs (not BOINC-tasks) slowly decreasing - 126 the lowest, but now increased to 141. Maybe a revival now?
ID: 49421 · Report as offensive     Reply Quote
Erich56

Send message
Joined: 18 Dec 15
Posts: 1690
Credit: 104,113,784
RAC: 122,340
Message 49422 - Posted: 6 Feb 2024, 18:07:43 UTC

still the question is what one should do with all the tasks that have been running for 10 hours and longer without finishing.
Aborting them because they most probably are dead ?
ID: 49422 · Report as offensive     Reply Quote
Crystal Pellet
Volunteer moderator
Volunteer tester

Send message
Joined: 14 Jan 10
Posts: 1280
Credit: 8,496,817
RAC: 2,016
Message 49423 - Posted: 6 Feb 2024, 18:13:09 UTC - in response to Message 49421.  

. . . . . . 126 the lowest, but now increased to 141. Maybe a revival now?
It seems something is fixed. I've a CMS job running inside a BOINC VM now.
ID: 49423 · Report as offensive     Reply Quote
Erich56

Send message
Joined: 18 Dec 15
Posts: 1690
Credit: 104,113,784
RAC: 122,340
Message 49424 - Posted: 6 Feb 2024, 18:14:49 UTC - in response to Message 49423.  

It seems something is fixed. I've a CMS job running inside a BOINC VM now.
for how long has the task been running now?
ID: 49424 · Report as offensive     Reply Quote
Crystal Pellet
Volunteer moderator
Volunteer tester

Send message
Joined: 14 Jan 10
Posts: 1280
Credit: 8,496,817
RAC: 2,016
Message 49425 - Posted: 6 Feb 2024, 18:18:51 UTC - in response to Message 49422.  

still the question is what one should do with all the tasks that have been running for 10 hours and longer without finishing.
Aborting them because they most probably are dead ?

The easiest way is just to abort them. Else running until 18 hours.

More work: Suspend the task without keeping the job in memory. Task should be saved to disk.
Remove the saved state with VirtualBox Manager. Start the VM outside of BOINC until you have a CMS job running. Stop and Save the VM to disk and restart the task within BOINC.
ID: 49425 · Report as offensive     Reply Quote
Crystal Pellet
Volunteer moderator
Volunteer tester

Send message
Joined: 14 Jan 10
Posts: 1280
Credit: 8,496,817
RAC: 2,016
Message 49426 - Posted: 6 Feb 2024, 18:20:10 UTC - in response to Message 49424.  

It seems something is fixed. I've a CMS job running inside a BOINC VM now.
for how long has the task been running now?
Task is running 25 minutes and cmsRun inside VM 13 minutes now (95% Cpu)
ID: 49426 · Report as offensive     Reply Quote
1 · 2 · 3 · 4 · Next

Message boards : CMS Application : tasks now running unusual long time without CPU usage


©2024 CERN