Message boards :
ATLAS application :
tasks fail with "Radical guest time change"
Message board moderation
Author | Message |
---|---|
Send message Joined: 18 Dec 15 Posts: 1688 Credit: 103,123,227 RAC: 124,210 |
Since yesterday, I have had several cases where a WU failed after about 4 minutes. Excerpt from the stderr: 2021-07-20 01:39:12 (7380): Guest Log: Failed to produce a result! Shutting down the machine. 2021-07-20 01:39:19 (7380): Guest Log: 00:00:10.025871 timesync vgsvcTimeSyncWorker: Radical guest time change: -7 189 176 023 000ns (GuestNow=1 626 737 958 918 963 000 ns GuestLast=1 626 745 148 094 986 000 ns fSetTimeLastLoop=true ) 2021-07-20 01:42:32 (7380): Guest Log: 00:03:22.995487 control Guest control service stopped the complete report can be seen here: https://lhcathome.cern.ch/lhcathome/result.php?resultid=322554079 Is this a misconfigured WU, or is there something wrong with my system? |
Send message Joined: 14 Jan 10 Posts: 1274 Credit: 8,480,870 RAC: 2,011 |
This line is in all results, also the valid ones. At least one time after starting the VM, because the vdi is much older than the actual time. |
Send message Joined: 15 Jun 08 Posts: 2401 Credit: 225,536,846 RAC: 122,267 |
The time sync errors have nothing to do with the task failures - they also appear when a tasks succeed: https://lhcathome.cern.ch/lhcathome/result.php?resultid=322554716 Nonetheless you may check the time and time sync settings on your host. The failed tasks also fail on wingcomputers, Windows (vbox64) as well as Linux (native): https://lhcathome.cern.ch/lhcathome/workunit.php?wuid=168126490 Looks like a batch with a faulty input file. The native logfiles print a more detailed hint: https://lhcathome.cern.ch/lhcathome/result.php?resultid=322548651 [2021-07-19 16:01:51] gzip: stdin: unexpected end of file [2021-07-19 16:01:51] tar: Child returned status 1 [2021-07-19 16:01:51] tar: Error is not recoverable: exiting now [2021-07-19 16:01:51] Failed to extract job description from input tarball 16:11:52 (3899757): run_atlas exited; CPU time 0.462507 16:11:52 (3899757): app exit status: 0x1 16:11:52 (3899757): called boinc_finish(195) |
Send message Joined: 18 Dec 15 Posts: 1688 Credit: 103,123,227 RAC: 124,210 |
thanks, computezrmle, for your detailed analysis :-) |
Send message Joined: 18 Dec 15 Posts: 1688 Credit: 103,123,227 RAC: 124,210 |
I had two more of these failures within the past two hours. Same happened with the wingcomputers. |
©2024 CERN