Thread 'Task fails few minutes after start but keeps running, running, running ...'

Author	Message
Erich56 Send message Joined: 18 Dec 15 Posts: 1967 Credit: 159,358,764 RAC: 45,827	Message 44080 - Posted: 10 Jan 2021, 9:17:29 UTC I keep having cases where a Theory task failes a few minutes after it starts, but does not close down and abort, but rather stays running without CPU use. Only by either seeing in the Windows task manager that VBoxHeadless.exe is not using a CPU core, or by switching to the VMConsole where it says "[ERROR] 'cvmfs_config probe grid.cern.ch' failed" one detects the problem and manually aborts the task. Which is a big waste of unused time if one finds out only after a long time. Here is the example of such a task: https://lhcathome.cern.ch/lhcathome/result.php?resultid=293971418 It started on 21:15 hrs, and on 21:22 it shows the above cited entry "[ERROR] 'cvmfs_config probe grid.cern.ch' failed". Only 1 day later I found out by coincidence and aborted the task. My question here is: why does the task not recognize the problem and abort itself? A setting like this is a real waste. What if I don't have a chance to have a look at the computer for several days? The task acts as if it would work, but it doesn't. Thus not making room for the next task in the queue. The fact is that I have Theory running on several of my machines, but I simply can't have a look at all these computers every other hour just to make sure that the Theory tasks are still running the way they are supposed to. Any ideas how to get this problem solved? ID: 44080 · Reply Quote

maeax Send message Joined: 2 May 07 Posts: 2286 Credit: 178,847,258 RAC: 1,689	Message 44081 - Posted: 10 Jan 2021, 10:42:25 UTC A watching once a day is useful. There are so many possiblilities for a disconnection or other interruptions. ID: 44081 · Reply Quote

Erich56 Send message Joined: 18 Dec 15 Posts: 1967 Credit: 159,358,764 RAC: 45,827	Message 44083 - Posted: 10 Jan 2021, 11:19:57 UTC - in response to Message 44081. A watching once a day is useful. There are so many possiblilities for a disconnection or other interruptions. This is true, of course. Whenever I can, I take a look at the machines in intervals of several hours. Still - and this is my point - it would be great that if a task fails for exactly this reason, it would abort itself instead of running uselessly forever. Just a matter of code, I guess. ID: 44083 · Reply Quote