Message boards : CMS Application : some tasks failing after about 20 minutes with heartbeat error
Message board moderation

To post messages, you must log in.

AuthorMessage
Erich56

Send message
Joined: 18 Dec 15
Posts: 1689
Credit: 103,978,756
RAC: 121,929
Message 48825 - Posted: 27 Oct 2023, 10:57:37 UTC

within the past few hours, I am experiencing failing tasks after about 20 minutes from start, on different computers.
Stderr says:

2023-10-27 12:49:25 (31400): VM Heartbeat file specified, but missing.
2023-10-27 12:49:25 (31400): VM Heartbeat file specified, but missing file system status. (errno = '2')
2023-10-27 12:49:25 (31400): Powering off VM.

Examples see here:
https://lhcathome.cern.ch/lhcathome/result.php?resultid=400934683
https://lhcathome.cern.ch/lhcathome/result.php?resultid=400934354

what's going wrong?
ID: 48825 · Report as offensive     Reply Quote
Erich56

Send message
Joined: 18 Dec 15
Posts: 1689
Credit: 103,978,756
RAC: 121,929
Message 48826 - Posted: 27 Oct 2023, 12:53:12 UTC - in response to Message 48825.  

in fact, the title of this posting should read "ALL tasks failing after about 20 minutes ..."
Which means that all tasks which were downloaded by all of my computers since late morning failed.
Faulty batch? Does no one else make the same experience ?
ID: 48826 · Report as offensive     Reply Quote
Crystal Pellet
Volunteer moderator
Volunteer tester

Send message
Joined: 14 Jan 10
Posts: 1280
Credit: 8,496,817
RAC: 2,374
Message 48827 - Posted: 27 Oct 2023, 14:20:34 UTC - in response to Message 48826.  

Faulty batch? Does no one else make the same experience ?
No faulty batch!

A cmsRun started fine on my system.

INFO:root:Executing CMSSW. args: ['/bin/bash', '/srv/job/WMTaskSpace/cmsRun1/cmsRun1-main.sh', '', 'slc7_amd64_gcc700', 'scramv1', 'CMSSW', 'CMSSW_11_0_0_pre1', 'FrameworkJobReport.xml', 'cmsRun', 'PSet.py', '']
INFO:root:PSS: 654406; RSS: 654564; PCPU: 63.9; PMEM: 32.1
ID: 48827 · Report as offensive     Reply Quote
Erich56

Send message
Joined: 18 Dec 15
Posts: 1689
Credit: 103,978,756
RAC: 121,929
Message 48828 - Posted: 27 Oct 2023, 15:09:28 UTC - in response to Message 48827.  

No faulty batch! ...
so what else could then be the reason for all the tasks failing on various machines at various times after about 20 minutes ?
ID: 48828 · Report as offensive     Reply Quote
computezrmle
Volunteer moderator
Volunteer developer
Volunteer tester
Help desk expert
Avatar

Send message
Joined: 15 Jun 08
Posts: 2413
Credit: 226,567,701
RAC: 131,024
Message 48829 - Posted: 27 Oct 2023, 15:37:49 UTC - in response to Message 48828.  

Looks like you have network issues, either inside your LAN or to your ISP.


All your computers are affected since this morning.
It's very unlikely that the CMS vdi is broken on all computers at the same time.

Did you run any (windows) updates?
Did you do an update on your router?
Did your ISP force a reconnect?


Most tasks stop when they try to start the bootstrap script which would load additional logging functions from CVMFS.
It appears that the latter does not happen.
As a result the watchdog service also doesn't start, hence finally the message (from vboxwrapper!) is:
VM Heartbeat file specified, but missing.

Other tasks also point out network issues, e.g.:
2023-10-27 11:53:53 (16704): Guest Log: [INFO] Testing connection to Frontier
2023-10-27 11:54:23 (16704): Guest Log: [DEBUG] Status run 1 of up to 3: 1
2023-10-27 11:54:58 (16704): Guest Log: [DEBUG] Status run 2 of up to 3: 1

2023-10-27 11:57:07 (12236): Guest Log: Ncat: Connection to 188.114.96.10 failed: Connection timed out.
2023-10-27 11:57:07 (12236): Guest Log: Ncat: Trying next address...
2023-10-27 11:57:07 (12236): Guest Log: Ncat: Network is unreachable.

2023-10-27 13:01:05 (7992): Guest Log: [ERROR] Could not get an x509 credential

2023-10-27 13:21:23 (11492): Guest Log: [ERROR] Probing /cvmfs/oasis.opensciencegrid.org... Failed!


All of that requires a network connection to CERN/Cloudflare.
CERN Grafana shows that other volunteer's computers process CMS as usual.


Suggestion:
Check your router and your Squid box.
Restart them if necessary.
After the restart test only a few fresh tasks before you go all in.
ID: 48829 · Report as offensive     Reply Quote
Erich56

Send message
Joined: 18 Dec 15
Posts: 1689
Credit: 103,978,756
RAC: 121,929
Message 48834 - Posted: 28 Oct 2023, 5:46:44 UTC - in response to Message 48829.  

Looks like you have network issues, either inside your LAN or to your ISP.
Thanks, computezrmle, for the thorough analysis of my problem.
Most probably, it was caused by numerous short internet outages from side of my ISP which I didn't catch right away. Since late afternoon of yesterday, everything seems to be right again. So I'll keep my fingers crossed :-)
ID: 48834 · Report as offensive     Reply Quote
Erich56

Send message
Joined: 18 Dec 15
Posts: 1689
Credit: 103,978,756
RAC: 121,929
Message 48841 - Posted: 29 Oct 2023, 6:29:11 UTC

this morning, when looking up my tasks list, I noticed that there were a lot of failing tasks on all of my computers - in all cases they broke at about the time of the shift back from summer time to "normal" time.
I remember that this happened before with some other BOINC projects, like GPUGRID, but I don't think it ever happened with LHC@home - however, I am not sure.

Did someone else experience the same thing last night?
If yes, I can preclude any network issues like I seemed to have last Friday. If not, I might have a problem with my network.
ID: 48841 · Report as offensive     Reply Quote
maeax

Send message
Joined: 2 May 07
Posts: 2103
Credit: 159,819,191
RAC: 123,837
Message 48842 - Posted: 29 Oct 2023, 7:16:12 UTC - in response to Message 48841.  

All PC need one timestamp in the LAN.
For me it's the router.
Yes, you see this only two times in the year.
ID: 48842 · Report as offensive     Reply Quote
Crystal Pellet
Volunteer moderator
Volunteer tester

Send message
Joined: 14 Jan 10
Posts: 1280
Credit: 8,496,817
RAC: 2,374
Message 48843 - Posted: 29 Oct 2023, 7:39:27 UTC
Last modified: 29 Oct 2023, 7:42:41 UTC

I had two error tasks because of: "VM Heartbeat file specified, but missing heartbeat" after about 10 hours runtime.
Probably 2 cmsRuns each have been done and a third one busy, I think.
Both errored out and reported at 29 Oct 2023, 1:10:55 UTC
I surely did not set the Windows flag to use UTC on that machine.
Maybe computezrmle can recall to me the register entry needed.
ID: 48843 · Report as offensive     Reply Quote
Erich56

Send message
Joined: 18 Dec 15
Posts: 1689
Credit: 103,978,756
RAC: 121,929
Message 48844 - Posted: 29 Oct 2023, 7:49:55 UTC - in response to Message 48843.  


Both errored out and reported at 29 Oct 2023, 1:10:55 UTC
this was about the time all my tasks failed:
...
reported 29 Oct 2023, 1:04:17 UTC
...
so seemingly I was not the only one with this kind of problem.
ID: 48844 · Report as offensive     Reply Quote
maeax

Send message
Joined: 2 May 07
Posts: 2103
Credit: 159,819,191
RAC: 123,837
Message 48845 - Posted: 29 Oct 2023, 8:14:02 UTC

CMS is the only project in LHC@Home,
having some trouble at this two timestamps in the year.
ID: 48845 · Report as offensive     Reply Quote
Harri Liljeroos
Avatar

Send message
Joined: 28 Sep 04
Posts: 675
Credit: 43,665,867
RAC: 15,961
Message 48846 - Posted: 29 Oct 2023, 9:41:32 UTC - in response to Message 48845.  

CMS is the only project in LHC@Home,
having some trouble at this two timestamps in the year.

All running Theory tasks also failed when the clock was adjusted at the end of daylight saving time.
ID: 48846 · Report as offensive     Reply Quote
Crystal Pellet
Volunteer moderator
Volunteer tester

Send message
Joined: 14 Jan 10
Posts: 1280
Credit: 8,496,817
RAC: 2,374
Message 48847 - Posted: 29 Oct 2023, 10:20:28 UTC - in response to Message 48846.  
Last modified: 29 Oct 2023, 11:11:40 UTC

CMS is the only project in LHC@Home,
having some trouble at this two timestamps in the year.

All running Theory tasks also failed when the clock was adjusted at the end of daylight saving time.

Yeah, I think all Windows machines are affected when not using UTC-time for the system.
The Linux VM is using UTC and find in the shared directory a heartbeat file that suddenly is ~1 hour old.

You have to add/set the DWORD RealTimeIsUniversal in Windows register "Computer\HKEY_LOCAL_MACHINE\SYSTEM\CurrentControlSet\Control\TimeZoneInformation" to 1.
ID: 48847 · Report as offensive     Reply Quote
Erich56

Send message
Joined: 18 Dec 15
Posts: 1689
Credit: 103,978,756
RAC: 121,929
Message 48848 - Posted: 29 Oct 2023, 10:51:43 UTC - in response to Message 48843.  

Crystal Pellet wrote:

...
I surely did not set the Windows flag to use UTC on that machine.
Maybe computezrmle can recall to me the register entry needed.
I, too, would be badly interested in this information :-)
ID: 48848 · Report as offensive     Reply Quote
Crystal Pellet
Volunteer moderator
Volunteer tester

Send message
Joined: 14 Jan 10
Posts: 1280
Credit: 8,496,817
RAC: 2,374
Message 48849 - Posted: 29 Oct 2023, 11:08:46 UTC - in response to Message 48848.  
Last modified: 29 Oct 2023, 11:09:00 UTC

I, too, would be badly interested in this information :-)
It's in my previous post . . .
ID: 48849 · Report as offensive     Reply Quote
computezrmle
Volunteer moderator
Volunteer developer
Volunteer tester
Help desk expert
Avatar

Send message
Joined: 15 Jun 08
Posts: 2413
Credit: 226,567,701
RAC: 131,024
Message 48850 - Posted: 29 Oct 2023, 12:48:54 UTC
Last modified: 29 Oct 2023, 12:49:40 UTC

This command tells Windows to use UTC instead of local time as time base:
reg add "HKEY_LOCAL_MACHINE\System\CurrentControlSet\Control\TimeZoneInformation" /v RealTimeIsUniversal /d 1 /t REG_DWORD /f

See:
https://github.com/BOINC/boinc/pull/4631


As for the reported errors:
1. The VM runs a "touch" command via cron once every minute to update the timestamp of the heartbeat file.
2. That file resides on a network drive mounted by the VM.
3. The "real" file resides in the shared folder of the running task, hence on a filesystem controlled by the host.
4. Vboxwrapper (running on the host) reads the heartbeat file attributes (including the timestamps) via the standard function "stat" which gets them from the underlying OS.
5. Since Windows uses local time by default the timebase changes twice a year (summertime/wintertime switch)

This confuses all programs doing a simple timestamp compare.


Suggestion:
To avoid this
- use UTC for the computer's time base (Apple and Linux do this by default)
- keep the time in sync with a reliable NTP source
ID: 48850 · Report as offensive     Reply Quote

Message boards : CMS Application : some tasks failing after about 20 minutes with heartbeat error


©2024 CERN