Message boards :
Theory Application :
Task fails few minutes after start but keeps running, running, running ...
Message board moderation
Author | Message |
---|---|
Send message Joined: 18 Dec 15 Posts: 1686 Credit: 100,422,554 RAC: 102,574 |
I keep having cases where a Theory task failes a few minutes after it starts, but does not close down and abort, but rather stays running without CPU use. Only by either seeing in the Windows task manager that VBoxHeadless.exe is not using a CPU core, or by switching to the VMConsole where it says "[ERROR] 'cvmfs_config probe grid.cern.ch' failed" one detects the problem and manually aborts the task. Which is a big waste of unused time if one finds out only after a long time. Here is the example of such a task: https://lhcathome.cern.ch/lhcathome/result.php?resultid=293971418 It started on 21:15 hrs, and on 21:22 it shows the above cited entry "[ERROR] 'cvmfs_config probe grid.cern.ch' failed". Only 1 day later I found out by coincidence and aborted the task. My question here is: why does the task not recognize the problem and abort itself? A setting like this is a real waste. What if I don't have a chance to have a look at the computer for several days? The task acts as if it would work, but it doesn't. Thus not making room for the next task in the queue. The fact is that I have Theory running on several of my machines, but I simply can't have a look at all these computers every other hour just to make sure that the Theory tasks are still running the way they are supposed to. Any ideas how to get this problem solved? |
Send message Joined: 2 May 07 Posts: 2071 Credit: 156,167,079 RAC: 105,421 |
A watching once a day is useful. There are so many possiblilities for a disconnection or other interruptions. |
Send message Joined: 18 Dec 15 Posts: 1686 Credit: 100,422,554 RAC: 102,574 |
A watching once a day is useful.This is true, of course. Whenever I can, I take a look at the machines in intervals of several hours. Still - and this is my point - it would be great that if a task fails for exactly this reason, it would abort itself instead of running uselessly forever. Just a matter of code, I guess. |
©2024 CERN