pilotErrorDiag": "Payload failed: Interrupt failure code: 1201

Author	Message
bronco Send message Joined: 13 Apr 18 Posts: 443 Credit: 8,438,885 RAC: 0	Message 36428 - Posted: 14 Aug 2018, 23:51:27 UTC Native ATLAS tasks on my https://lhcathome.cern.ch/lhcathome/show_host_detail.php?hostid=10557960 are running to completion and validating but they seem to be taking far longer than they should. True the CPU has a weak floating point arithmetic unit but maybe the long run times have something to do with the error in the subject line which I see in the section below in some (but not all) of the most recently returned tasks. running cmd return value is 0 *********************log_extracts.txt********************* (/var/lib/boinc-client/slots/2/Panda_Pilot_3038_1534222770/PandaJob/athena_stdout.txt does not exist) - Walltime - JobRetrival=0, StageIn=10, Execution=434, StageOut=0, CleanUp=0 *******************pilot_error_report.json******************* { "4024931982": { "2": [ { "pilotErrorCode": 1201, "pilotErrorDiag": "Payload failed: Interrupt failure code: 1201" } ] } } ID: 36428 · Reply Quote

David Cameron Project administrator Project developer Project scientist Send message Joined: 13 May 14 Posts: 387 Credit: 15,314,184 RAC: 0	Message 36436 - Posted: 15 Aug 2018, 11:33:43 UTC - in response to Message 36428. Looking at this one: https://lhcathome.cern.ch/lhcathome/result.php?resultid=204793909 This is trying to run the run_atlas wrapper for the 2nd time,but it is not an Event Service job,so will restart the job Maybe the WU was suspended and resumed? The native app doesn't handle this very well unless you keep the suspended jobs in memory so it normally restarts when the WU is resumed. The error you see is from when the WU was suspended, but it doesn't affect the result after it was restarted - you can see the HITS file was correctly produced. We'll try to improve how the restart is handled. ID: 36436 · Reply Quote

bronco Send message Joined: 13 Apr 18 Posts: 443 Credit: 8,438,885 RAC: 0	Message 36438 - Posted: 15 Aug 2018, 15:14:09 UTC - in response to Message 36436. I think I rebooted while those 3 were running which would mean the saved tasks would not have been in memory when they resumed. Actually, I don't just reboot while tasks are running, I shutdown BOINC client first and wait for CVMFS and the wrapper to wind down first. OK that explains the error, thanks, all is well. ID: 36438 · Reply Quote

zepingouin Send message Joined: 7 Jan 07 Posts: 41 Credit: 16,105,862 RAC: 0	Message 36439 - Posted: 15 Aug 2018, 15:20:05 UTC - in response to Message 36436. Hello, I confirm I get same kind of error after restarting the WU, even with keeping jobs in memory, in this case by switching "on the fly" from 1 thread to 2 threads because having a very long time to complete. This 1 thread complete in 21 hours : https://lhcathome.cern.ch/lhcathome/result.php?resultid=204126530 This 2 threads complete in 18 hours (after switching from 1 thread) : https://lhcathome.cern.ch/lhcathome/result.php?resultid=204577991 BTW, this 2 threads task is still in running state according to the PandID https://bigpanda.cern.ch/job/4023144304/. ID: 36439 · Reply Quote

computezrmle Volunteer moderator Volunteer developer Volunteer tester Help desk expert Send message Joined: 15 Jun 08 Posts: 2636 Credit: 275,522,929 RAC: 131,312	Message 36440 - Posted: 15 Aug 2018, 15:58:48 UTC - in response to Message 36439. ... having a very long time to complete. Monitoring tip for impatient ATLAS native volunteers. Open a console window and run: watch -n10 "find ~/BOINC_ATLAS/slots/ \( -name \"log.EVNTtoHITS\" -o -name \"AthenaMP.log\" \) \|sort \|xargs -I {} -n1 sh -c \"egrep 'INFO.*Event nr. ' {} \|tail -n1\"" Replace "~/BOINC_ATLAS/slots/" with the location of your ATLAS slots folder. ID: 36440 · Reply Quote

bronco Send message Joined: 13 Apr 18 Posts: 443 Credit: 8,438,885 RAC: 0	Message 36468 - Posted: 16 Aug 2018, 15:33:36 UTC - in response to Message 36440. So that's where it's all hiding. Thanks. A treasure trove of data to obsess over compulsively. ID: 36468 · Reply Quote

LHC@home