Message boards :
CMS Application :
some tasks failing after about 20 minutes with heartbeat error
Message board moderation
Author | Message |
---|---|
Send message Joined: 18 Dec 15 Posts: 1689 Credit: 103,978,756 RAC: 121,929 |
within the past few hours, I am experiencing failing tasks after about 20 minutes from start, on different computers. Stderr says: 2023-10-27 12:49:25 (31400): VM Heartbeat file specified, but missing. 2023-10-27 12:49:25 (31400): VM Heartbeat file specified, but missing file system status. (errno = '2') 2023-10-27 12:49:25 (31400): Powering off VM. Examples see here: https://lhcathome.cern.ch/lhcathome/result.php?resultid=400934683 https://lhcathome.cern.ch/lhcathome/result.php?resultid=400934354 what's going wrong? |
Send message Joined: 18 Dec 15 Posts: 1689 Credit: 103,978,756 RAC: 121,929 |
in fact, the title of this posting should read "ALL tasks failing after about 20 minutes ..." Which means that all tasks which were downloaded by all of my computers since late morning failed. Faulty batch? Does no one else make the same experience ? |
Send message Joined: 14 Jan 10 Posts: 1280 Credit: 8,496,817 RAC: 2,374 |
Faulty batch? Does no one else make the same experience ?No faulty batch! A cmsRun started fine on my system. INFO:root:Executing CMSSW. args: ['/bin/bash', '/srv/job/WMTaskSpace/cmsRun1/cmsRun1-main.sh', '', 'slc7_amd64_gcc700', 'scramv1', 'CMSSW', 'CMSSW_11_0_0_pre1', 'FrameworkJobReport.xml', 'cmsRun', 'PSet.py', ''] INFO:root:PSS: 654406; RSS: 654564; PCPU: 63.9; PMEM: 32.1 |
Send message Joined: 18 Dec 15 Posts: 1689 Credit: 103,978,756 RAC: 121,929 |
No faulty batch! ...so what else could then be the reason for all the tasks failing on various machines at various times after about 20 minutes ? |
Send message Joined: 15 Jun 08 Posts: 2413 Credit: 226,567,701 RAC: 131,024 |
Looks like you have network issues, either inside your LAN or to your ISP. All your computers are affected since this morning. It's very unlikely that the CMS vdi is broken on all computers at the same time. Did you run any (windows) updates? Did you do an update on your router? Did your ISP force a reconnect? Most tasks stop when they try to start the bootstrap script which would load additional logging functions from CVMFS. It appears that the latter does not happen. As a result the watchdog service also doesn't start, hence finally the message (from vboxwrapper!) is: VM Heartbeat file specified, but missing. Other tasks also point out network issues, e.g.: 2023-10-27 11:53:53 (16704): Guest Log: [INFO] Testing connection to Frontier 2023-10-27 11:54:23 (16704): Guest Log: [DEBUG] Status run 1 of up to 3: 1 2023-10-27 11:54:58 (16704): Guest Log: [DEBUG] Status run 2 of up to 3: 1 2023-10-27 11:57:07 (12236): Guest Log: Ncat: Connection to 188.114.96.10 failed: Connection timed out. 2023-10-27 11:57:07 (12236): Guest Log: Ncat: Trying next address... 2023-10-27 11:57:07 (12236): Guest Log: Ncat: Network is unreachable. 2023-10-27 13:01:05 (7992): Guest Log: [ERROR] Could not get an x509 credential 2023-10-27 13:21:23 (11492): Guest Log: [ERROR] Probing /cvmfs/oasis.opensciencegrid.org... Failed! All of that requires a network connection to CERN/Cloudflare. CERN Grafana shows that other volunteer's computers process CMS as usual. Suggestion: Check your router and your Squid box. Restart them if necessary. After the restart test only a few fresh tasks before you go all in. |
Send message Joined: 18 Dec 15 Posts: 1689 Credit: 103,978,756 RAC: 121,929 |
Looks like you have network issues, either inside your LAN or to your ISP.Thanks, computezrmle, for the thorough analysis of my problem. Most probably, it was caused by numerous short internet outages from side of my ISP which I didn't catch right away. Since late afternoon of yesterday, everything seems to be right again. So I'll keep my fingers crossed :-) |
Send message Joined: 18 Dec 15 Posts: 1689 Credit: 103,978,756 RAC: 121,929 |
this morning, when looking up my tasks list, I noticed that there were a lot of failing tasks on all of my computers - in all cases they broke at about the time of the shift back from summer time to "normal" time. I remember that this happened before with some other BOINC projects, like GPUGRID, but I don't think it ever happened with LHC@home - however, I am not sure. Did someone else experience the same thing last night? If yes, I can preclude any network issues like I seemed to have last Friday. If not, I might have a problem with my network. |
Send message Joined: 2 May 07 Posts: 2103 Credit: 159,819,191 RAC: 123,837 |
All PC need one timestamp in the LAN. For me it's the router. Yes, you see this only two times in the year. |
Send message Joined: 14 Jan 10 Posts: 1280 Credit: 8,496,817 RAC: 2,374 |
I had two error tasks because of: "VM Heartbeat file specified, but missing heartbeat" after about 10 hours runtime. Probably 2 cmsRuns each have been done and a third one busy, I think. Both errored out and reported at 29 Oct 2023, 1:10:55 UTC I surely did not set the Windows flag to use UTC on that machine. Maybe computezrmle can recall to me the register entry needed. |
Send message Joined: 18 Dec 15 Posts: 1689 Credit: 103,978,756 RAC: 121,929 |
this was about the time all my tasks failed: ... reported 29 Oct 2023, 1:04:17 UTC ... so seemingly I was not the only one with this kind of problem. |
Send message Joined: 2 May 07 Posts: 2103 Credit: 159,819,191 RAC: 123,837 |
CMS is the only project in LHC@Home, having some trouble at this two timestamps in the year. |
Send message Joined: 28 Sep 04 Posts: 675 Credit: 43,665,867 RAC: 15,961 |
CMS is the only project in LHC@Home, All running Theory tasks also failed when the clock was adjusted at the end of daylight saving time. |
Send message Joined: 14 Jan 10 Posts: 1280 Credit: 8,496,817 RAC: 2,374 |
CMS is the only project in LHC@Home, Yeah, I think all Windows machines are affected when not using UTC-time for the system. The Linux VM is using UTC and find in the shared directory a heartbeat file that suddenly is ~1 hour old. You have to add/set the DWORD RealTimeIsUniversal in Windows register "Computer\HKEY_LOCAL_MACHINE\SYSTEM\CurrentControlSet\Control\TimeZoneInformation" to 1. |
Send message Joined: 18 Dec 15 Posts: 1689 Credit: 103,978,756 RAC: 121,929 |
Crystal Pellet wrote: I, too, would be badly interested in this information :-) |
Send message Joined: 14 Jan 10 Posts: 1280 Credit: 8,496,817 RAC: 2,374 |
I, too, would be badly interested in this information :-)It's in my previous post . . . |
Send message Joined: 15 Jun 08 Posts: 2413 Credit: 226,567,701 RAC: 131,024 |
This command tells Windows to use UTC instead of local time as time base: reg add "HKEY_LOCAL_MACHINE\System\CurrentControlSet\Control\TimeZoneInformation" /v RealTimeIsUniversal /d 1 /t REG_DWORD /f See: https://github.com/BOINC/boinc/pull/4631 As for the reported errors: 1. The VM runs a "touch" command via cron once every minute to update the timestamp of the heartbeat file. 2. That file resides on a network drive mounted by the VM. 3. The "real" file resides in the shared folder of the running task, hence on a filesystem controlled by the host. 4. Vboxwrapper (running on the host) reads the heartbeat file attributes (including the timestamps) via the standard function "stat" which gets them from the underlying OS. 5. Since Windows uses local time by default the timebase changes twice a year (summertime/wintertime switch) This confuses all programs doing a simple timestamp compare. Suggestion: To avoid this - use UTC for the computer's time base (Apple and Linux do this by default) - keep the time in sync with a reliable NTP source |
©2024 CERN