Thread 'some tasks failing after about 20 minutes with heartbeat error'

Author	Message
Erich56 Send message Joined: 18 Dec 15 Posts: 1971 Credit: 159,561,899 RAC: 47,887	Message 48825 - Posted: 27 Oct 2023, 10:57:37 UTC within the past few hours, I am experiencing failing tasks after about 20 minutes from start, on different computers. Stderr says: 2023-10-27 12:49:25 (31400): VM Heartbeat file specified, but missing. 2023-10-27 12:49:25 (31400): VM Heartbeat file specified, but missing file system status. (errno = '2') 2023-10-27 12:49:25 (31400): Powering off VM. Examples see here: https://lhcathome.cern.ch/lhcathome/result.php?resultid=400934683 https://lhcathome.cern.ch/lhcathome/result.php?resultid=400934354 what's going wrong? ID: 48825 · Reply Quote

Erich56 Send message Joined: 18 Dec 15 Posts: 1971 Credit: 159,561,899 RAC: 47,887	Message 48826 - Posted: 27 Oct 2023, 12:53:12 UTC - in response to Message 48825. in fact, the title of this posting should read "ALL tasks failing after about 20 minutes ..." Which means that all tasks which were downloaded by all of my computers since late morning failed. Faulty batch? Does no one else make the same experience ? ID: 48826 · Reply Quote

Crystal Pellet Volunteer moderator Volunteer tester Send message Joined: 14 Jan 10 Posts: 1540 Credit: 10,049,804 RAC: 1,347	Message 48827 - Posted: 27 Oct 2023, 14:20:34 UTC - in response to Message 48826. Faulty batch? Does no one else make the same experience ? No faulty batch! A cmsRun started fine on my system. INFO:root:Executing CMSSW. args: ['/bin/bash', '/srv/job/WMTaskSpace/cmsRun1/cmsRun1-main.sh', '', 'slc7_amd64_gcc700', 'scramv1', 'CMSSW', 'CMSSW_11_0_0_pre1', 'FrameworkJobReport.xml', 'cmsRun', 'PSet.py', ''] INFO:root:PSS: 654406; RSS: 654564; PCPU: 63.9; PMEM: 32.1 ID: 48827 · Reply Quote

Erich56 Send message Joined: 18 Dec 15 Posts: 1971 Credit: 159,561,899 RAC: 47,887	Message 48828 - Posted: 27 Oct 2023, 15:09:28 UTC - in response to Message 48827. No faulty batch! ... so what else could then be the reason for all the tasks failing on various machines at various times after about 20 minutes ? ID: 48828 · Reply Quote

computezrmle Volunteer moderator Volunteer developer Volunteer tester Help desk expert Send message Joined: 15 Jun 08 Posts: 2725 Credit: 300,256,616 RAC: 45,914	Message 48829 - Posted: 27 Oct 2023, 15:37:49 UTC - in response to Message 48828. Looks like you have network issues, either inside your LAN or to your ISP. All your computers are affected since this morning. It's very unlikely that the CMS vdi is broken on all computers at the same time. Did you run any (windows) updates? Did you do an update on your router? Did your ISP force a reconnect? Most tasks stop when they try to start the bootstrap script which would load additional logging functions from CVMFS. It appears that the latter does not happen. As a result the watchdog service also doesn't start, hence finally the message (from vboxwrapper!) is: VM Heartbeat file specified, but missing. Other tasks also point out network issues, e.g.: 2023-10-27 11:53:53 (16704): Guest Log: [INFO] Testing connection to Frontier 2023-10-27 11:54:23 (16704): Guest Log: [DEBUG] Status run 1 of up to 3: 1 2023-10-27 11:54:58 (16704): Guest Log: [DEBUG] Status run 2 of up to 3: 1 2023-10-27 11:57:07 (12236): Guest Log: Ncat: Connection to 188.114.96.10 failed: Connection timed out. 2023-10-27 11:57:07 (12236): Guest Log: Ncat: Trying next address... 2023-10-27 11:57:07 (12236): Guest Log: Ncat: Network is unreachable. 2023-10-27 13:01:05 (7992): Guest Log: [ERROR] Could not get an x509 credential 2023-10-27 13:21:23 (11492): Guest Log: [ERROR] Probing /cvmfs/oasis.opensciencegrid.org... Failed! All of that requires a network connection to CERN/Cloudflare. CERN Grafana shows that other volunteer's computers process CMS as usual. Suggestion: Check your router and your Squid box. Restart them if necessary. After the restart test only a few fresh tasks before you go all in. ID: 48829 · Reply Quote

Erich56 Send message Joined: 18 Dec 15 Posts: 1971 Credit: 159,561,899 RAC: 47,887	Message 48834 - Posted: 28 Oct 2023, 5:46:44 UTC - in response to Message 48829. Looks like you have network issues, either inside your LAN or to your ISP. Thanks, computezrmle, for the thorough analysis of my problem. Most probably, it was caused by numerous short internet outages from side of my ISP which I didn't catch right away. Since late afternoon of yesterday, everything seems to be right again. So I'll keep my fingers crossed :-) ID: 48834 · Reply Quote

Erich56 Send message Joined: 18 Dec 15 Posts: 1971 Credit: 159,561,899 RAC: 47,887	Message 48841 - Posted: 29 Oct 2023, 6:29:11 UTC this morning, when looking up my tasks list, I noticed that there were a lot of failing tasks on all of my computers - in all cases they broke at about the time of the shift back from summer time to "normal" time. I remember that this happened before with some other BOINC projects, like GPUGRID, but I don't think it ever happened with LHC@home - however, I am not sure. Did someone else experience the same thing last night? If yes, I can preclude any network issues like I seemed to have last Friday. If not, I might have a problem with my network. ID: 48841 · Reply Quote

maeax Send message Joined: 2 May 07 Posts: 2289 Credit: 178,859,762 RAC: 2,128	Message 48842 - Posted: 29 Oct 2023, 7:16:12 UTC - in response to Message 48841. All PC need one timestamp in the LAN. For me it's the router. Yes, you see this only two times in the year. ID: 48842 · Reply Quote

Crystal Pellet Volunteer moderator Volunteer tester Send message Joined: 14 Jan 10 Posts: 1540 Credit: 10,049,804 RAC: 1,347	Message 48843 - Posted: 29 Oct 2023, 7:39:27 UTC Last modified: 29 Oct 2023, 7:42:41 UTC I had two error tasks because of: "VM Heartbeat file specified, but missing heartbeat" after about 10 hours runtime. Probably 2 cmsRuns each have been done and a third one busy, I think. Both errored out and reported at 29 Oct 2023, 1:10:55 UTC I surely did not set the Windows flag to use UTC on that machine. Maybe computezrmle can recall to me the register entry needed. ID: 48843 · Reply Quote

Erich56 Send message Joined: 18 Dec 15 Posts: 1971 Credit: 159,561,899 RAC: 47,887	Message 48844 - Posted: 29 Oct 2023, 7:49:55 UTC - in response to Message 48843. Both errored out and reported at 29 Oct 2023, 1:10:55 UTC this was about the time all my tasks failed: ... reported 29 Oct 2023, 1:04:17 UTC ... so seemingly I was not the only one with this kind of problem. ID: 48844 · Reply Quote

maeax Send message Joined: 2 May 07 Posts: 2289 Credit: 178,859,762 RAC: 2,128	Message 48845 - Posted: 29 Oct 2023, 8:14:02 UTC CMS is the only project in LHC@Home, having some trouble at this two timestamps in the year. ID: 48845 · Reply Quote

Harri Liljeroos Send message Joined: 28 Sep 04 Posts: 799 Credit: 65,013,080 RAC: 27,575	Message 48846 - Posted: 29 Oct 2023, 9:41:32 UTC - in response to Message 48845. CMS is the only project in LHC@Home, having some trouble at this two timestamps in the year. All running Theory tasks also failed when the clock was adjusted at the end of daylight saving time. ID: 48846 · Reply Quote

Crystal Pellet Volunteer moderator Volunteer tester Send message Joined: 14 Jan 10 Posts: 1540 Credit: 10,049,804 RAC: 1,347	Message 48847 - Posted: 29 Oct 2023, 10:20:28 UTC - in response to Message 48846. Last modified: 29 Oct 2023, 11:11:40 UTC CMS is the only project in LHC@Home, having some trouble at this two timestamps in the year. All running Theory tasks also failed when the clock was adjusted at the end of daylight saving time. Yeah, I think all Windows machines are affected when not using UTC-time for the system. The Linux VM is using UTC and find in the shared directory a heartbeat file that suddenly is ~1 hour old. You have to add/set the DWORD RealTimeIsUniversal in Windows register "Computer\HKEY_LOCAL_MACHINE\SYSTEM\CurrentControlSet\Control\TimeZoneInformation" to 1. ID: 48847 · Reply Quote

Erich56 Send message Joined: 18 Dec 15 Posts: 1971 Credit: 159,561,899 RAC: 47,887	Message 48848 - Posted: 29 Oct 2023, 10:51:43 UTC - in response to Message 48843. Crystal Pellet wrote: ... I surely did not set the Windows flag to use UTC on that machine. Maybe computezrmle can recall to me the register entry needed. I, too, would be badly interested in this information :-) ID: 48848 · Reply Quote

Crystal Pellet Volunteer moderator Volunteer tester Send message Joined: 14 Jan 10 Posts: 1540 Credit: 10,049,804 RAC: 1,347	Message 48849 - Posted: 29 Oct 2023, 11:08:46 UTC - in response to Message 48848. Last modified: 29 Oct 2023, 11:09:00 UTC I, too, would be badly interested in this information :-) It's in my previous post . . . ID: 48849 · Reply Quote

computezrmle Volunteer moderator Volunteer developer Volunteer tester Help desk expert Send message Joined: 15 Jun 08 Posts: 2725 Credit: 300,256,616 RAC: 45,914	Message 48850 - Posted: 29 Oct 2023, 12:48:54 UTC Last modified: 29 Oct 2023, 12:49:40 UTC ommand tells Windows to use UTC instead of local time as time base: [pre]reg add "HKEY_LOCAL_MACHINE\System\CurrentControlSet\Control\TimeZoneInformation" /v RealTimeIsUniversal /d 1 /t REG_DWORD /f[/pre] See: https://github.com/BOINC/boinc/pull/4631 As for the reported errors: 1. The VM runs a "touch" command via cron once every minute to update the timestamp of the heartbeat file. 2. That file resides on a network drive mounted by the VM. 3. The "real" file resides in the shared folder of the running task, hence on a filesystem controlled by the host. 4. Vboxwrapper (running on the host) reads the heartbeat file attributes (including the timestamps) via the standard function "stat" which gets them from the underlying OS. 5. Since Windows uses local time by default the timebase changes twice a year (summertime/wintertime switch) This confuses all programs doing a simple timestamp compare. Suggestion: To avoid this - use UTC for the computer's time base (Apple and Linux do this by default) - keep the time in sync with a reliable NTP source ID: 48850 · Reply Quote