Message boards : ATLAS application : tasks fail with "Radical guest time change"
Message board moderation

To post messages, you must log in.

AuthorMessage
Erich56

Send message
Joined: 18 Dec 15
Posts: 1459
Credit: 35,617,792
RAC: 43,725
Message 45144 - Posted: 20 Jul 2021, 4:25:56 UTC

Since yesterday, I have had several cases where a WU failed after about 4 minutes.
Excerpt from the stderr:

2021-07-20 01:39:12 (7380): Guest Log: Failed to produce a result! Shutting down the machine.
2021-07-20 01:39:19 (7380): Guest Log: 00:00:10.025871 timesync vgsvcTimeSyncWorker: Radical guest time change: -7 189 176 023 000ns (GuestNow=1 626 737 958 918 963 000 ns GuestLast=1 626 745 148 094 986 000 ns fSetTimeLastLoop=true )
2021-07-20 01:42:32 (7380): Guest Log: 00:03:22.995487 control Guest control service stopped

the complete report can be seen here: https://lhcathome.cern.ch/lhcathome/result.php?resultid=322554079

Is this a misconfigured WU, or is there something wrong with my system?
ID: 45144 · Report as offensive     Reply Quote
Crystal Pellet
Volunteer moderator
Volunteer tester

Send message
Joined: 14 Jan 10
Posts: 1046
Credit: 6,603,873
RAC: 275
Message 45145 - Posted: 20 Jul 2021, 5:01:03 UTC - in response to Message 45144.  

This line is in all results, also the valid ones. At least one time after starting the VM, because the vdi is much older than the actual time.
ID: 45145 · Report as offensive     Reply Quote
computezrmle
Volunteer moderator
Volunteer developer
Volunteer tester
Avatar

Send message
Joined: 15 Jun 08
Posts: 1824
Credit: 123,758,172
RAC: 86,110
Message 45146 - Posted: 20 Jul 2021, 5:09:28 UTC - in response to Message 45144.  

The time sync errors have nothing to do with the task failures - they also appear when a tasks succeed:
https://lhcathome.cern.ch/lhcathome/result.php?resultid=322554716

Nonetheless you may check the time and time sync settings on your host.


The failed tasks also fail on wingcomputers, Windows (vbox64) as well as Linux (native):
https://lhcathome.cern.ch/lhcathome/workunit.php?wuid=168126490

Looks like a batch with a faulty input file.
The native logfiles print a more detailed hint:
https://lhcathome.cern.ch/lhcathome/result.php?resultid=322548651
[2021-07-19 16:01:51] gzip: stdin: unexpected end of file
[2021-07-19 16:01:51] tar: Child returned status 1
[2021-07-19 16:01:51] tar: Error is not recoverable: exiting now
[2021-07-19 16:01:51] Failed to extract job description from input tarball
16:11:52 (3899757): run_atlas exited; CPU time 0.462507
16:11:52 (3899757): app exit status: 0x1
16:11:52 (3899757): called boinc_finish(195)
ID: 45146 · Report as offensive     Reply Quote
Erich56

Send message
Joined: 18 Dec 15
Posts: 1459
Credit: 35,617,792
RAC: 43,725
Message 45147 - Posted: 20 Jul 2021, 6:31:13 UTC - in response to Message 45146.  

thanks, computezrmle, for your detailed analysis :-)
ID: 45147 · Report as offensive     Reply Quote
Erich56

Send message
Joined: 18 Dec 15
Posts: 1459
Credit: 35,617,792
RAC: 43,725
Message 45148 - Posted: 20 Jul 2021, 9:53:09 UTC

I had two more of these failures within the past two hours.
Same happened with the wingcomputers.
ID: 45148 · Report as offensive     Reply Quote

Message boards : ATLAS application : tasks fail with "Radical guest time change"


©2021 CERN