Message boards : Theory Application : Task fails few minutes after start but keeps running, running, running ...
Message board moderation

To post messages, you must log in.

AuthorMessage
Erich56

Send message
Joined: 18 Dec 15
Posts: 1451
Credit: 35,179,299
RAC: 40,270
Message 44080 - Posted: 10 Jan 2021, 9:17:29 UTC

I keep having cases where a Theory task failes a few minutes after it starts, but does not close down and abort, but rather stays running without CPU use.
Only by either seeing in the Windows task manager that VBoxHeadless.exe is not using a CPU core, or by switching to the VMConsole where it says
"[ERROR] 'cvmfs_config probe grid.cern.ch' failed"
one detects the problem and manually aborts the task. Which is a big waste of unused time if one finds out only after a long time.

Here is the example of such a task:

https://lhcathome.cern.ch/lhcathome/result.php?resultid=293971418

It started on 21:15 hrs, and on 21:22 it shows the above cited entry "[ERROR] 'cvmfs_config probe grid.cern.ch' failed".

Only 1 day later I found out by coincidence and aborted the task.

My question here is: why does the task not recognize the problem and abort itself? A setting like this is a real waste.
What if I don't have a chance to have a look at the computer for several days? The task acts as if it would work, but it doesn't. Thus not making room for the next task in the queue.
The fact is that I have Theory running on several of my machines, but I simply can't have a look at all these computers every other hour just to make sure that the Theory tasks are still running the way they are supposed to.

Any ideas how to get this problem solved?
ID: 44080 · Report as offensive     Reply Quote
maeax

Send message
Joined: 2 May 07
Posts: 1301
Credit: 39,583,040
RAC: 11,356
Message 44081 - Posted: 10 Jan 2021, 10:42:25 UTC

A watching once a day is useful.
There are so many possiblilities for a disconnection or other interruptions.
ID: 44081 · Report as offensive     Reply Quote
Erich56

Send message
Joined: 18 Dec 15
Posts: 1451
Credit: 35,179,299
RAC: 40,270
Message 44083 - Posted: 10 Jan 2021, 11:19:57 UTC - in response to Message 44081.  

A watching once a day is useful.
There are so many possiblilities for a disconnection or other interruptions.
This is true, of course. Whenever I can, I take a look at the machines in intervals of several hours.
Still - and this is my point - it would be great that if a task fails for exactly this reason, it would abort itself instead of running uselessly forever. Just a matter of code, I guess.
ID: 44083 · Report as offensive     Reply Quote

Message boards : Theory Application : Task fails few minutes after start but keeps running, running, running ...


©2021 CERN