Message boards : Number crunching : all tasks errored out about end of summertime last night
Message board moderation

To post messages, you must log in.

AuthorMessage
Erich56

Send message
Joined: 18 Dec 15
Posts: 1862
Credit: 129,340,284
RAC: 116,437
Message 50933 - Posted: 27 Oct 2024, 4:32:43 UTC
Last modified: 27 Oct 2024, 5:20:13 UTC

When i woke up this morning, I found out that on all of my hosts all tasks (Atlas, CMS, Theory) errored out at about the time when summer time was changed back to regular time.
Interestingly enough, they all failed with "VM Heartbeat file specified, but missing heartbeat."
Example: https://lhcathome.cern.ch/lhcathome/result.php?resultid=415324518

I remember cases in the past (but I am not sure which project this was) where exactly the same problem occurred during these time shifts.

Did anyone else make the same experience last night? If I am the only one, it's probably the case that there was some problem with my ISP when the time change took place.

For me, this is just too bad as on one host 13 Theory Herwig7 tasks had been running for 4, 5 and 6 days. So they're all gone :-(

Edit:
I now looked up tasks of other crunchers, they were all going well during the time shift.
So obviously my suspicion is correct that my ISP had some kind of outage :-(
ID: 50933 · Report as offensive     Reply Quote
Crystal Pellet
Volunteer moderator
Volunteer tester

Send message
Joined: 14 Jan 10
Posts: 1443
Credit: 9,701,584
RAC: 1,379
Message 50935 - Posted: 27 Oct 2024, 6:26:27 UTC
Last modified: 27 Oct 2024, 6:27:10 UTC

Hurray for Microsoft Windows:

2024-10-27 02:56:06 (10380): Status Report: Elapsed Time: '312000.000000'
2024-10-27 02:56:06 (10380): Status Report: CPU Time: '309770.468750'
2024-10-27 02:16:06 (10380): VM Heartbeat file specified, but missing heartbeat.

https://lhcathome.cern.ch/lhcathome/result.php?resultid=415184357
ID: 50935 · Report as offensive     Reply Quote
Erich56

Send message
Joined: 18 Dec 15
Posts: 1862
Credit: 129,340,284
RAC: 116,437
Message 50936 - Posted: 27 Oct 2024, 6:31:52 UTC
Last modified: 27 Oct 2024, 6:37:54 UTC

Happened to meet one of my neigbors who has the same ISP in the staircase and asked him whether he by chance noticed any internet interruption last night.
He answered that he watched TV via WLAN and and noticed "a disruption of a few minutes".
This is somewhat strange, as the hearbeat check interval is 20 minutes, if I interpret 'heartbeat' every 1200.000000 seconds correctly. As I can see, this value is set in the Theory_2024_04_30_prod.xml (and in similar files for the other subprojects).
So, at least some of the tasks could be affected given the above time indications. But I am surprised that ALL tasks on all my hosts were affected. The only explanation for me would be that my neighbor was wrong when he talked about "few minutes", and that in reality the interruption was >20 minutes. Or, who knows, the hearbeat check is actually made in intervals <20 minutes.

At any rate, in order to avoid for the future that tasks which have already run for several days will fail just because of this hearbeat check, my question is whether I can simply increase the 20 minutes interval by changing thes value in the Theory_2024_04_30_prod.xml.

Edit: in view of what CP was saying minutes ago (message 50937), my tasks obviously did not fails because of the internet interrruption "of a few minutes" (and my neighbor may be right with this time span indication), but rather because they were crunched on Windows :-(
ID: 50936 · Report as offensive     Reply Quote
Erich56

Send message
Joined: 18 Dec 15
Posts: 1862
Credit: 129,340,284
RAC: 116,437
Message 50937 - Posted: 27 Oct 2024, 6:34:36 UTC - in response to Message 50935.  
Last modified: 27 Oct 2024, 6:52:26 UTC

https://lhcathome.cern.ch/lhcathome/result.php?resultid=415184357
oh, how nice: 3 days 15 minutes' work for nothing :-(

So when I looked up a few colleagues' tasks I probably didn't notice that all of them were crunched NOT on Windows :-(
ID: 50937 · Report as offensive     Reply Quote
maeax

Send message
Joined: 2 May 07
Posts: 2262
Credit: 175,581,097
RAC: 1,442
Message 50938 - Posted: 27 Oct 2024, 7:23:47 UTC
Last modified: 27 Oct 2024, 7:29:38 UTC

What show the Eventmessages at this time in Windows?
Yes, it's realy no fun, the EU will cancel this nonsens of timechange.
There was a petition two year ago, with no success.

Oh, seeing one CMS with this problem.
Computer ID 10795955
Laufzeit 5 Stunden 40 min. 37 sek.
CPU Zeit 14 Stunden 50 min. 14 sek.

2024-10-27 02:59:24 (4468): Setting checkpoint interval to 1200 seconds. (Higher value of (Preference: 1200 seconds) or (Vbox_job.xml: 600 seconds))
2024-10-27 02:59:57 (4468): Preference change detected
2024-10-27 02:59:57 (4468): Setting CPU throttle for VM. (100%)
2024-10-27 02:59:57 (4468): Setting checkpoint interval to 1200 seconds. (Higher value of (Preference: 1200 seconds) or (Vbox_job.xml: 600 seconds))
2024-10-27 02:02:33 (4468): Preference change detected
2024-10-27 02:02:33 (4468): Setting CPU throttle for VM. (100%)
2024-10-27 02:02:33 (4468): Setting checkpoint interval to 1200 seconds. (Higher value of (Preference: 1200 seconds) or (Vbox_job.xml: 600 seconds))
2024-10-27 02:02:47 (4468): VM Heartbeat file specified, but missing heartbeat.
2024-10-27 02:02:47 (4468): Powering off VM.
ID: 50938 · Report as offensive     Reply Quote
Profile Magic Quantum Mechanic
Avatar

Send message
Joined: 24 Oct 04
Posts: 1194
Credit: 60,769,786
RAC: 71,071
Message 50939 - Posted: 27 Oct 2024, 7:37:27 UTC

Mine will be Sun, Nov 3, 2024
Pacific Time.......and not – Sun, Oct 27, 2024
Central European Time
We have this every year here
ID: 50939 · Report as offensive     Reply Quote
Crystal Pellet
Volunteer moderator
Volunteer tester

Send message
Joined: 14 Jan 10
Posts: 1443
Credit: 9,701,584
RAC: 1,379
Message 50940 - Posted: 27 Oct 2024, 8:05:39 UTC - in response to Message 50938.  

What show the Eventmessages at this time in Windows?
LHC@home 27 Oct 01:07:17 Project requested delay of 6 seconds
LHC@home 27 Oct 02:16:22 Computation for task Theory_2794-3244360-278_0 finished
LHC@home 27 Oct 02:16:22 Output file Theory_2794-3244360-278_0_r1840616054_result for task Theory_2794-3244360-278_0 absent
LHC@home 27 Oct 02:17:49 Sending scheduler request: To report completed tasks.
LHC@home 27 Oct 02:17:49 Reporting 1 completed tasks
ID: 50940 · Report as offensive     Reply Quote
Erich56

Send message
Joined: 18 Dec 15
Posts: 1862
Credit: 129,340,284
RAC: 116,437
Message 50941 - Posted: 27 Oct 2024, 9:56:16 UTC

What I am wondering is: what kind of change was made to LHC tasks (in fact: to all of the subprojects) between the previous time change in spring 2024 (change from winter time to summer time) and now; I don't remember this problem occurring in spring, also not last fall (change from summer time to winter time), and also not any time before (except for several years ago, while I don't even know whether LHC was affected or another project).

Anyone any idea what happened?
Or, in other words: if this problem will not be straightened out by someone, will have it come up again next spring.
ID: 50941 · Report as offensive     Reply Quote
maeax

Send message
Joined: 2 May 07
Posts: 2262
Credit: 175,581,097
RAC: 1,442
Message 50942 - Posted: 27 Oct 2024, 10:06:31 UTC - in response to Message 50941.  

WCG (Krembil) show this problem not.
Running the most CPU's with WCG atm.
ID: 50942 · Report as offensive     Reply Quote
Harri Liljeroos
Avatar

Send message
Joined: 28 Sep 04
Posts: 757
Credit: 53,072,083
RAC: 41,728
Message 50943 - Posted: 27 Oct 2024, 10:11:36 UTC

I have two host here at LHC@Home and did have this time change happen last night. No problems with the tasks, they continued running normally. The hosts have very different versions of Boinc running. The win10 machine has Boinc 7.16.5 and VM 5.2.44. The win11 machine has Boinc 8.0.2 and VM 7.0.6.

I remember having same problems as described above previous years but I don't remember if it was in spring or fall.
ID: 50943 · Report as offensive     Reply Quote
Erich56

Send message
Joined: 18 Dec 15
Posts: 1862
Credit: 129,340,284
RAC: 116,437
Message 50944 - Posted: 27 Oct 2024, 10:19:17 UTC - in response to Message 50942.  

WCG (Krembil) show this problem not.
Running the most CPU's with WCG atm.
I've running GPUGRID tasks on 4 hosts - they all survived the time change without any problem.
ID: 50944 · Report as offensive     Reply Quote
Crystal Pellet
Volunteer moderator
Volunteer tester

Send message
Joined: 14 Jan 10
Posts: 1443
Credit: 9,701,584
RAC: 1,379
Message 50945 - Posted: 27 Oct 2024, 10:24:16 UTC - in response to Message 50941.  

Anyone any idea what happened?
Or, in other words: if this problem will not be straightened out by someone, will have it come up again next spring.
As far as I am aware of:
The wrapper checks every 1200 seconds (in our case), whether the heartbeat file coming from the VM, is not older than 20 minutes.
If, then the VM will be stopped (or try to stop).
I suppose in spring the heartbeat time stamp is > than last check, so no kill.
And interesting: ATLAS don't use this mechanism. (You could check whether you had ATLAS-erros on Windows because of this).
ID: 50945 · Report as offensive     Reply Quote
Erich56

Send message
Joined: 18 Dec 15
Posts: 1862
Credit: 129,340,284
RAC: 116,437
Message 50946 - Posted: 27 Oct 2024, 10:25:01 UTC - in response to Message 50943.  

I have two host here at LHC@Home and did have this time change happen last night. No problems with the tasks, they continued running normally. The hosts have very different versions of Boinc running. The win10 machine has Boinc 7.16.5 and VM 5.2.44. The win11 machine has Boinc 8.0.2 and VM 7.0.6. ...
My 18 hosts run on various versions of Boinc and Oracle VM - a few older ones, more newer ones; yet all of them were affected :-(
ID: 50946 · Report as offensive     Reply Quote
Erich56

Send message
Joined: 18 Dec 15
Posts: 1862
Credit: 129,340,284
RAC: 116,437
Message 50947 - Posted: 27 Oct 2024, 10:29:13 UTC - in response to Message 50945.  

Anyone any idea what happened?
Or, in other words: if this problem will not be straightened out by someone, will have it come up again next spring.
As far as I am aware of:
The wrapper checks every 1200 seconds (in our case), whether the heartbeat file coming from the VM, is not older than 20 minutes.
If, then the VM will be stopped (or try to stop).
I suppose in spring the heartbeat time stamp is > than last check, so no kill.
And interesting: ATLAS don't use this mechanism. (You could check whether you had ATLAS-erros on Windows because of this).
yes, you are right: ATLAS did NOT fail (so my initial statement that ALL subprojects were affected was wrong). However, all CMS tasks failed (besides the Theory tasks).
ID: 50947 · Report as offensive     Reply Quote
Harri Liljeroos
Avatar

Send message
Joined: 28 Sep 04
Posts: 757
Credit: 53,072,083
RAC: 41,728
Message 50948 - Posted: 27 Oct 2024, 12:59:52 UTC - in response to Message 50943.  

I have two host here at LHC@Home and did have this time change happen last night. No problems with the tasks, they continued running normally. The hosts have very different versions of Boinc running. The win10 machine has Boinc 7.16.5 and VM 5.2.44. The win11 machine has Boinc 8.0.2 and VM 7.0.6.

I remember having same problems as described above previous years but I don't remember if it was in spring or fall.

I was only running Atlas tasks. So they were working correctly in this situation.
ID: 50948 · Report as offensive     Reply Quote
Erich56

Send message
Joined: 18 Dec 15
Posts: 1862
Credit: 129,340,284
RAC: 116,437
Message 50950 - Posted: 27 Oct 2024, 14:21:46 UTC - in response to Message 50945.  

Crystal Pellet wrote:
As far as I am aware of:
The wrapper checks every 1200 seconds (in our case), whether the heartbeat file coming from the VM, is not older than 20 minutes.
If, then the VM will be stopped (or try to stop).
I suppose in spring the heartbeat time stamp is > than last check, so no kill.
And interesting: ATLAS don't use this mechanism. (You could check whether you had ATLAS-erros on Windows because of this).
so if this indeed is the way the wrapper for Theory and CMS is working, then each fall the running tasks are bound to fail. Too bad if this happens to tasks which have been running for serveral days already, like in my case.
But, as I questioned already before: what has happened to the wrapper since last year? Because last year, this problem definitely did NOT occur.

In any case, I will no longer download Theory tasks as long as Herwig7 is in the basket. Too many failures have happened here; from a total of about 20 tasks, only 3 succeeded :-( first, tasks which were stopped exaclty at the 10 days' limit; then I wanted to terminate some tasks gracefully before they reached the 10 days' limit - this did not work either, the tasks ended up invalid.
And then what happened last night, when 13 tasks with runtimes between 4 and 6 days failed due to this heartbeat error. Alone this caused about 1.500 hours CPU time wasted for nothing, not mentioning that electricity prices here have about trippled since the beginnung of the Ukraine war.
So the conclusion for me is: keep my fingers off Theory, there are too many uncertainties :-(
ID: 50950 · Report as offensive     Reply Quote
Crystal Pellet
Volunteer moderator
Volunteer tester

Send message
Joined: 14 Jan 10
Posts: 1443
Credit: 9,701,584
RAC: 1,379
Message 50951 - Posted: 27 Oct 2024, 15:11:37 UTC - in response to Message 50950.  
Last modified: 27 Oct 2024, 15:39:44 UTC

so if this indeed is the way the wrapper for Theory and CMS is working, then each fall the running tasks are bound to fail. Too bad if this happens to tasks which have been running for serveral days already, like in my case.
But, as I questioned already before: what has happened to the wrapper since last year? Because last year, this problem definitely did NOT occur.
And this all happens, where I've set in Windows registry the UTC to be the Universal time.
So something is not working like it should be.
We should get rid of this stupid summer-/wintertime switching.
It is an old-fashioned relic from the first oil crisis of the 1970s
Maybe it's even better to get rid of vboxwrapper heartbeat mechanism checking the sync between host and VM.
Over the years we had more troubles with it instead of benefit.
At least I'm testing now without heartbeat check without issues so far inclusive time change one hour backwards and an hour later one hour forwards with running Theory, CMS and ATLAS-tasks.
This problem comes every year over and over again, but now with those longrunning Herwig7 it hurts more.
2018-10-28 03:56:08 (6268): Guest Log: [INFO] MCPlots JobID: 47001352 in slot1

2018-10-28 03:07:59 (6268): VM Heartbeat file specified, but missing heartbeat.
ID: 50951 · Report as offensive     Reply Quote
Erich56

Send message
Joined: 18 Dec 15
Posts: 1862
Credit: 129,340,284
RAC: 116,437
Message 50952 - Posted: 27 Oct 2024, 16:57:45 UTC - in response to Message 50951.  

...
We should get rid of this stupid summer-/wintertime switching.
...
Maybe it's even better to get rid of vboxwrapper heartbeat mechanism checking the sync between host and VM.
Over the years we had more troubles with it instead of benefit.
...
This problem comes every year over and over again, but now with those longrunning Herwig7 it hurts more.
I fully agree (I guess nothing else would be expected :-)
ID: 50952 · Report as offensive     Reply Quote

Message boards : Number crunching : all tasks errored out about end of summertime last night


©2025 CERN