Message boards :
ATLAS application :
pilotErrorDiag": "Payload failed: Interrupt failure code: 1201
Message board moderation
Author | Message |
---|---|
Send message Joined: 13 Apr 18 Posts: 443 Credit: 8,438,885 RAC: 0 |
Native ATLAS tasks on my https://lhcathome.cern.ch/lhcathome/show_host_detail.php?hostid=10557960 are running to completion and validating but they seem to be taking far longer than they should. True the CPU has a weak floating point arithmetic unit but maybe the long run times have something to do with the error in the subject line which I see in the section below in some (but not all) of the most recently returned tasks. running cmd return value is 0 ***********************log_extracts.txt************************* (/var/lib/boinc-client/slots/2/Panda_Pilot_3038_1534222770/PandaJob/athena_stdout.txt does not exist) - Walltime - JobRetrival=0, StageIn=10, Execution=434, StageOut=0, CleanUp=0 ***********************pilot_error_report.json********************* { "4024931982": { "2": [ { "pilotErrorCode": 1201, "pilotErrorDiag": "Payload failed: Interrupt failure code: 1201" } ] } } |
Send message Joined: 13 May 14 Posts: 387 Credit: 15,314,184 RAC: 0 |
Looking at this one: https://lhcathome.cern.ch/lhcathome/result.php?resultid=204793909 This is trying to run the run_atlas wrapper for the 2nd time,but it is not an Event Service job,so will restart the job Maybe the WU was suspended and resumed? The native app doesn't handle this very well unless you keep the suspended jobs in memory so it normally restarts when the WU is resumed. The error you see is from when the WU was suspended, but it doesn't affect the result after it was restarted - you can see the HITS file was correctly produced. We'll try to improve how the restart is handled. |
Send message Joined: 13 Apr 18 Posts: 443 Credit: 8,438,885 RAC: 0 |
I think I rebooted while those 3 were running which would mean the saved tasks would not have been in memory when they resumed. Actually, I don't just reboot while tasks are running, I shutdown BOINC client first and wait for CVMFS and the wrapper to wind down first. OK that explains the error, thanks, all is well. |
Send message Joined: 7 Jan 07 Posts: 41 Credit: 15,959,427 RAC: 271 |
Hello, I confirm I get same kind of error after restarting the WU, even with keeping jobs in memory, in this case by switching "on the fly" from 1 thread to 2 threads because having a very long time to complete. This 1 thread complete in 21 hours : https://lhcathome.cern.ch/lhcathome/result.php?resultid=204126530 This 2 threads complete in 18 hours (after switching from 1 thread) : https://lhcathome.cern.ch/lhcathome/result.php?resultid=204577991 BTW, this 2 threads task is still in running state according to the PandID https://bigpanda.cern.ch/job/4023144304/. |
Send message Joined: 15 Jun 08 Posts: 2386 Credit: 222,892,477 RAC: 138,173 |
... having a very long time to complete. Monitoring tip for impatient ATLAS native volunteers. Open a console window and run: watch -n10 "find ~/BOINC_ATLAS/slots/ \( -name \"log.EVNTtoHITS\" -o -name \"AthenaMP.log\" \) |sort |xargs -I {} -n1 sh -c \"egrep 'INFO.*Event nr. ' {} |tail -n1\"" Replace "~/BOINC_ATLAS/slots/" with the location of your ATLAS slots folder. |
Send message Joined: 13 Apr 18 Posts: 443 Credit: 8,438,885 RAC: 0 |
So that's where it's all hiding. Thanks. A treasure trove of data to obsess over compulsively. |
©2024 CERN