Message boards : ATLAS application : Potentially failing vbox tasks today
Message board moderation

To post messages, you must log in.

AuthorMessage
David Cameron
Project administrator
Project developer
Project scientist

Send message
Joined: 13 May 14
Posts: 387
Credit: 15,314,184
RAC: 0
Message 40941 - Posted: 13 Dec 2019, 14:54:19 UTC

Hi all,

A change last night in an upstream component that we use for ATLAS tasks means that vbox tasks send between then and now are likely to fail. The tasks will succeed in producing the HITS result file but will fail to copy it to the shared directory for upload by boinc client. You may not notice the failures since the tasks will be validated. I have introduced a fix for this in the ATLAS bootstrap script so that tasks starting from around now should work properly. The native tasks are unaffected by this problem.
ID: 40941 · Report as offensive     Reply Quote
Harri Liljeroos
Avatar

Send message
Joined: 28 Sep 04
Posts: 674
Credit: 43,150,492
RAC: 15,942
Message 40942 - Posted: 13 Dec 2019, 15:02:47 UTC

So far I have 7 failed Atlas tasks today that match your description.
ID: 40942 · Report as offensive     Reply Quote
lazlo_vii
Avatar

Send message
Joined: 20 Nov 19
Posts: 21
Credit: 1,074,330
RAC: 0
Message 40947 - Posted: 13 Dec 2019, 19:19:13 UTC - in response to Message 40942.  
Last modified: 13 Dec 2019, 19:22:59 UTC

I have 6 Atlas (4 CPU) task running on two systems right now. In all 6 athena.py stopped running simultaneously. My systems went from a load average of 12+ to just 1+ for several minutes. I stopped boinc-client, autofs, and squid. I brought up squid, autofs, and boinc-client and saw no change. According to boinc-manager all tasks were running, but they had not written check points. After a few minutes of watching athena.py trying to restart itself several times everything returned to normal. I don't know if these tasks will fail or turn out to be invalid or not.

I think The Grinch has invaded CERN.

EDIT: I should add that these are NOT vbox tasks, these are Altas Native.
ID: 40947 · Report as offensive     Reply Quote
Henry Nebrensky

Send message
Joined: 13 Jul 05
Posts: 165
Credit: 14,925,288
RAC: 34
Message 40975 - Posted: 16 Dec 2019, 10:07:59 UTC - in response to Message 40947.  

I have 6 Atlas (4 CPU) task running on two systems right now. In all 6 athena.py stopped running simultaneously. My systems went from a load average of 12+ to just 1+ for several minutes. I stopped boinc-client, autofs, and squid. I brought up squid, autofs, and boinc-client and saw no change. According to boinc-manager all tasks were running, but they had not written check points. After a few minutes of watching athena.py trying to restart itself several times everything returned to normal. I don't know if these tasks will fail or turn out to be invalid or not.
How close together is "simultaneously"?
It sounds like the normal behaviour as the task reaches phase 4 prior to completion.
ID: 40975 · Report as offensive     Reply Quote
Harri Liljeroos
Avatar

Send message
Joined: 28 Sep 04
Posts: 674
Credit: 43,150,492
RAC: 15,942
Message 41008 - Posted: 18 Dec 2019, 16:03:34 UTC

I have leftover directories xxxxx_ATLAS_hits from the 13th of December in projects\lhcathome.cern.ch_lhcathome folder. I think that these are left from the failed tasks this thread is talking about. They contain the HITS.xxxx files among other things. Can I delete them?
ID: 41008 · Report as offensive     Reply Quote
David Cameron
Project administrator
Project developer
Project scientist

Send message
Joined: 13 May 14
Posts: 387
Credit: 15,314,184
RAC: 0
Message 41014 - Posted: 19 Dec 2019, 13:17:19 UTC - in response to Message 41008.  

I have leftover directories xxxxx_ATLAS_hits from the 13th of December in projects\lhcathome.cern.ch_lhcathome folder. I think that these are left from the failed tasks this thread is talking about. They contain the HITS.xxxx files among other things. Can I delete them?


Yes, you can delete them. Even though you have valid results from those tasks it is not possible to use them.
ID: 41014 · Report as offensive     Reply Quote

Message boards : ATLAS application : Potentially failing vbox tasks today


©2024 CERN