Message boards : ATLAS application : pilotErrorDiag": "Payload failed: Interrupt failure code: 1201
Message board moderation

To post messages, you must log in.

AuthorMessage
bronco

Send message
Joined: 13 Apr 18
Posts: 443
Credit: 8,438,885
RAC: 0
Message 36428 - Posted: 14 Aug 2018, 23:51:27 UTC

Native ATLAS tasks on my https://lhcathome.cern.ch/lhcathome/show_host_detail.php?hostid=10557960 are running to completion and validating but they seem to be taking far longer than they should. True the CPU has a weak floating point arithmetic unit but maybe the long run times have something to do with the error in the subject line which I see in the section below in some (but not all) of the most recently returned tasks.
running cmd return value is 0

***********************log_extracts.txt*************************

(/var/lib/boinc-client/slots/2/Panda_Pilot_3038_1534222770/PandaJob/athena_stdout.txt does not exist)

- Walltime -
JobRetrival=0, StageIn=10, Execution=434, StageOut=0, CleanUp=0

***********************pilot_error_report.json*********************
{
    "4024931982": {
        "2": [
            {
                "pilotErrorCode": 1201,
                "pilotErrorDiag": "Payload failed: Interrupt failure code: 1201"
            }
        ]
    }
}
ID: 36428 · Report as offensive     Reply Quote
David Cameron
Project administrator
Project developer
Project scientist

Send message
Joined: 13 May 14
Posts: 387
Credit: 15,314,184
RAC: 0
Message 36436 - Posted: 15 Aug 2018, 11:33:43 UTC - in response to Message 36428.  

Looking at this one: https://lhcathome.cern.ch/lhcathome/result.php?resultid=204793909

This is trying to run the run_atlas wrapper for the 2nd time,but it is not an Event Service job,so will restart the job


Maybe the WU was suspended and resumed? The native app doesn't handle this very well unless you keep the suspended jobs in memory so it normally restarts when the WU is resumed.

The error you see is from when the WU was suspended, but it doesn't affect the result after it was restarted - you can see the HITS file was correctly produced. We'll try to improve how the restart is handled.
ID: 36436 · Report as offensive     Reply Quote
bronco

Send message
Joined: 13 Apr 18
Posts: 443
Credit: 8,438,885
RAC: 0
Message 36438 - Posted: 15 Aug 2018, 15:14:09 UTC - in response to Message 36436.  

I think I rebooted while those 3 were running which would mean the saved tasks would not have been in memory when they resumed. Actually, I don't just reboot while tasks are running, I shutdown BOINC client first and wait for CVMFS and the wrapper to wind down first. OK that explains the error, thanks, all is well.
ID: 36438 · Report as offensive     Reply Quote
Profile zepingouin
Avatar

Send message
Joined: 7 Jan 07
Posts: 41
Credit: 15,959,427
RAC: 271
Message 36439 - Posted: 15 Aug 2018, 15:20:05 UTC - in response to Message 36436.  

Hello,

I confirm I get same kind of error after restarting the WU, even with keeping jobs in memory, in this case by switching "on the fly" from 1 thread to 2 threads because having a very long time to complete.

This 1 thread complete in 21 hours : https://lhcathome.cern.ch/lhcathome/result.php?resultid=204126530
This 2 threads complete in 18 hours (after switching from 1 thread) : https://lhcathome.cern.ch/lhcathome/result.php?resultid=204577991

BTW, this 2 threads task is still in running state according to the PandID https://bigpanda.cern.ch/job/4023144304/.
ID: 36439 · Report as offensive     Reply Quote
computezrmle
Volunteer moderator
Volunteer developer
Volunteer tester
Help desk expert
Avatar

Send message
Joined: 15 Jun 08
Posts: 2386
Credit: 222,892,477
RAC: 138,173
Message 36440 - Posted: 15 Aug 2018, 15:58:48 UTC - in response to Message 36439.  

... having a very long time to complete.

Monitoring tip for impatient ATLAS native volunteers.
Open a console window and run:
watch -n10 "find ~/BOINC_ATLAS/slots/ \( -name \"log.EVNTtoHITS\" -o -name \"AthenaMP.log\" \) |sort |xargs -I {} -n1 sh -c \"egrep 'INFO.*Event nr. ' {} |tail -n1\""

Replace "~/BOINC_ATLAS/slots/" with the location of your ATLAS slots folder.
ID: 36440 · Report as offensive     Reply Quote
bronco

Send message
Joined: 13 Apr 18
Posts: 443
Credit: 8,438,885
RAC: 0
Message 36468 - Posted: 16 Aug 2018, 15:33:36 UTC - in response to Message 36440.  

So that's where it's all hiding. Thanks. A treasure trove of data to obsess over compulsively.
ID: 36468 · Report as offensive     Reply Quote

Message boards : ATLAS application : pilotErrorDiag": "Payload failed: Interrupt failure code: 1201


©2024 CERN