Thread 'all tasks errored out about end of summertime last night'

Author	Message
Erich56 Send message Joined: 18 Dec 15 Posts: 1941 Credit: 156,400,468 RAC: 103,860	Message 50933 - Posted: 27 Oct 2024, 4:32:43 UTC Last modified: 27 Oct 2024, 5:20:13 UTC When i woke up this morning, I found out that on all of my hosts all tasks (Atlas, CMS, Theory) errored out at about the time when summer time was changed back to regular time. Interestingly enough, they all failed with "VM Heartbeat file specified, but missing heartbeat." Example: https://lhcathome.cern.ch/lhcathome/result.php?resultid=415324518 I remember cases in the past (but I am not sure which project this was) where exactly the same problem occurred during these time shifts. Did anyone else make the same experience last night? If I am the only one, it's probably the case that there was some problem with my ISP when the time change took place. For me, this is just too bad as on one host 13 Theory Herwig7 tasks had been running for 4, 5 and 6 days. So they're all gone :-( Edit: I now looked up tasks of other crunchers, they were all going well during the time shift. So obviously my suspicion is correct that my ISP had some kind of outage :-( ID: 50933 · Reply Quote

Crystal Pellet Volunteer moderator Volunteer tester Send message Joined: 14 Jan 10 Posts: 1493 Credit: 9,987,809 RAC: 889	Message 50935 - Posted: 27 Oct 2024, 6:26:27 UTC Last modified: 27 Oct 2024, 6:27:10 UTC Hurray for Microsoft Windows: 2024-10-27 02:56:06 (10380): Status Report: Elapsed Time: '312000.000000' 2024-10-27 02:56:06 (10380): Status Report: CPU Time: '309770.468750' 2024-10-27 02:16:06 (10380): VM Heartbeat file specified, but missing heartbeat. https://lhcathome.cern.ch/lhcathome/result.php?resultid=415184357 ID: 50935 · Reply Quote

Erich56 Send message Joined: 18 Dec 15 Posts: 1941 Credit: 156,400,468 RAC: 103,860	Message 50936 - Posted: 27 Oct 2024, 6:31:52 UTC Last modified: 27 Oct 2024, 6:37:54 UTC Happened to meet one of my neigbors who has the same ISP in the staircase and asked him whether he by chance noticed any internet interruption last night. He answered that he watched TV via WLAN and and noticed "a disruption of a few minutes". This is somewhat strange, as the hearbeat check interval is 20 minutes, if I interpret 'heartbeat' every 1200.000000 seconds correctly. As I can see, this value is set in the Theory_2024_04_30_prod.xml (and in similar files for the other subprojects). So, at least some of the tasks could be affected given the above time indications. But I am surprised that ALL tasks on all my hosts were affected. The only explanation for me would be that my neighbor was wrong when he talked about "few minutes", and that in reality the interruption was >20 minutes. Or, who knows, the hearbeat check is actually made in intervals <20 minutes. At any rate, in order to avoid for the future that tasks which have already run for several days will fail just because of this hearbeat check, my question is whether I can simply increase the 20 minutes interval by changing thes value in the Theory_2024_04_30_prod.xml. Edit: in view of what CP was saying minutes ago (message 50937), my tasks obviously did not fails because of the internet interrruption "of a few minutes" (and my neighbor may be right with this time span indication), but rather because they were crunched on Windows :-( ID: 50936 · Reply Quote

Erich56 Send message Joined: 18 Dec 15 Posts: 1941 Credit: 156,400,468 RAC: 103,860	Message 50937 - Posted: 27 Oct 2024, 6:34:36 UTC - in response to Message 50935. Last modified: 27 Oct 2024, 6:52:26 UTC https://lhcathome.cern.ch/lhcathome/result.php?resultid=415184357 oh, how nice: 3 days 15 minutes' work for nothing :-( So when I looked up a few colleagues' tasks I probably didn't notice that all of them were crunched NOT on Windows :-( ID: 50937 · Reply Quote

maeax Send message Joined: 2 May 07 Posts: 2279 Credit: 178,779,667 RAC: 393	Message 50938 - Posted: 27 Oct 2024, 7:23:47 UTC Last modified: 27 Oct 2024, 7:29:38 UTC What show the Eventmessages at this time in Windows? Yes, it's realy no fun, the EU will cancel this nonsens of timechange. There was a petition two year ago, with no success. Oh, seeing one CMS with this problem. Computer ID 10795955 Laufzeit 5 Stunden 40 min. 37 sek. CPU Zeit 14 Stunden 50 min. 14 sek. 2024-10-27 02:59:24 (4468): Setting checkpoint interval to 1200 seconds. (Higher value of (Preference: 1200 seconds) or (Vbox_job.xml: 600 seconds)) 2024-10-27 02:59:57 (4468): Preference change detected 2024-10-27 02:59:57 (4468): Setting CPU throttle for VM. (100%) 2024-10-27 02:59:57 (4468): Setting checkpoint interval to 1200 seconds. (Higher value of (Preference: 1200 seconds) or (Vbox_job.xml: 600 seconds)) 2024-10-27 02:02:33 (4468): Preference change detected 2024-10-27 02:02:33 (4468): Setting CPU throttle for VM. (100%) 2024-10-27 02:02:33 (4468): Setting checkpoint interval to 1200 seconds. (Higher value of (Preference: 1200 seconds) or (Vbox_job.xml: 600 seconds)) 2024-10-27 02:02:47 (4468): VM Heartbeat file specified, but missing heartbeat. 2024-10-27 02:02:47 (4468): Powering off VM. ID: 50938 · Reply Quote

Magic Quantum Mechanic Send message Joined: 24 Oct 04 Posts: 1261 Credit: 92,498,201 RAC: 106,410	Message 50939 - Posted: 27 Oct 2024, 7:37:27 UTC Mine will be Sun, Nov 3, 2024 Pacific Time.......and not – Sun, Oct 27, 2024 Central European Time We have this every year here ID: 50939 · Reply Quote

Crystal Pellet Volunteer moderator Volunteer tester Send message Joined: 14 Jan 10 Posts: 1493 Credit: 9,987,809 RAC: 889	Message 50940 - Posted: 27 Oct 2024, 8:05:39 UTC - in response to Message 50938. What show the Eventmessages at this time in Windows? LHC@home 27 Oct 01:07:17 Project requested delay of 6 seconds LHC@home 27 Oct 02:16:22 Computation for task Theory_2794-3244360-278_0 finished LHC@home 27 Oct 02:16:22 Output file Theory_2794-3244360-278_0_r1840616054_result for task Theory_2794-3244360-278_0 absent LHC@home 27 Oct 02:17:49 Sending scheduler request: To report completed tasks. LHC@home 27 Oct 02:17:49 Reporting 1 completed tasks ID: 50940 · Reply Quote

Erich56 Send message Joined: 18 Dec 15 Posts: 1941 Credit: 156,400,468 RAC: 103,860	Message 50941 - Posted: 27 Oct 2024, 9:56:16 UTC What I am wondering is: what kind of change was made to LHC tasks (in fact: to all of the subprojects) between the previous time change in spring 2024 (change from winter time to summer time) and now; I don't remember this problem occurring in spring, also not last fall (change from summer time to winter time), and also not any time before (except for several years ago, while I don't even know whether LHC was affected or another project). Anyone any idea what happened? Or, in other words: if this problem will not be straightened out by someone, will have it come up again next spring. ID: 50941 · Reply Quote

maeax Send message Joined: 2 May 07 Posts: 2279 Credit: 178,779,667 RAC: 393	Message 50942 - Posted: 27 Oct 2024, 10:06:31 UTC - in response to Message 50941. WCG (Krembil) show this problem not. Running the most CPU's with WCG atm. ID: 50942 · Reply Quote

Harri Liljeroos Send message Joined: 28 Sep 04 Posts: 793 Credit: 63,520,130 RAC: 23,522	Message 50943 - Posted: 27 Oct 2024, 10:11:36 UTC I have two host here at LHC@Home and did have this time change happen last night. No problems with the tasks, they continued running normally. The hosts have very different versions of Boinc running. The win10 machine has Boinc 7.16.5 and VM 5.2.44. The win11 machine has Boinc 8.0.2 and VM 7.0.6. I remember having same problems as described above previous years but I don't remember if it was in spring or fall. ID: 50943 · Reply Quote

Erich56 Send message Joined: 18 Dec 15 Posts: 1941 Credit: 156,400,468 RAC: 103,860	Message 50944 - Posted: 27 Oct 2024, 10:19:17 UTC - in response to Message 50942. WCG (Krembil) show this problem not. Running the most CPU's with WCG atm. I've running GPUGRID tasks on 4 hosts - they all survived the time change without any problem. ID: 50944 · Reply Quote

Crystal Pellet Volunteer moderator Volunteer tester Send message Joined: 14 Jan 10 Posts: 1493 Credit: 9,987,809 RAC: 889	Message 50945 - Posted: 27 Oct 2024, 10:24:16 UTC - in response to Message 50941. Anyone any idea what happened? Or, in other words: if this problem will not be straightened out by someone, will have it come up again next spring. As far as I am aware of: The wrapper checks every 1200 seconds (in our case), whether the heartbeat file coming from the VM, is not older than 20 minutes. If, then the VM will be stopped (or try to stop). I suppose in spring the heartbeat time stamp is > than last check, so no kill. And interesting: ATLAS don't use this mechanism. (You could check whether you had ATLAS-erros on Windows because of this). ID: 50945 · Reply Quote

Erich56 Send message Joined: 18 Dec 15 Posts: 1941 Credit: 156,400,468 RAC: 103,860	Message 50946 - Posted: 27 Oct 2024, 10:25:01 UTC - in response to Message 50943. I have two host here at LHC@Home and did have this time change happen last night. No problems with the tasks, they continued running normally. The hosts have very different versions of Boinc running. The win10 machine has Boinc 7.16.5 and VM 5.2.44. The win11 machine has Boinc 8.0.2 and VM 7.0.6. ... My 18 hosts run on various versions of Boinc and Oracle VM - a few older ones, more newer ones; yet all of them were affected :-( ID: 50946 · Reply Quote

Erich56 Send message Joined: 18 Dec 15 Posts: 1941 Credit: 156,400,468 RAC: 103,860	Message 50947 - Posted: 27 Oct 2024, 10:29:13 UTC - in response to Message 50945. Anyone any idea what happened? Or, in other words: if this problem will not be straightened out by someone, will have it come up again next spring. As far as I am aware of: The wrapper checks every 1200 seconds (in our case), whether the heartbeat file coming from the VM, is not older than 20 minutes. If, then the VM will be stopped (or try to stop). I suppose in spring the heartbeat time stamp is > than last check, so no kill. And interesting: ATLAS don't use this mechanism. (You could check whether you had ATLAS-erros on Windows because of this). yes, you are right: ATLAS did NOT fail (so my initial statement that ALL subprojects were affected was wrong). However, all CMS tasks failed (besides the Theory tasks). ID: 50947 · Reply Quote

Harri Liljeroos Send message Joined: 28 Sep 04 Posts: 793 Credit: 63,520,130 RAC: 23,522	Message 50948 - Posted: 27 Oct 2024, 12:59:52 UTC - in response to Message 50943. I have two host here at LHC@Home and did have this time change happen last night. No problems with the tasks, they continued running normally. The hosts have very different versions of Boinc running. The win10 machine has Boinc 7.16.5 and VM 5.2.44. The win11 machine has Boinc 8.0.2 and VM 7.0.6. I remember having same problems as described above previous years but I don't remember if it was in spring or fall. I was only running Atlas tasks. So they were working correctly in this situation. ID: 50948 · Reply Quote

Erich56 Send message Joined: 18 Dec 15 Posts: 1941 Credit: 156,400,468 RAC: 103,860	Message 50950 - Posted: 27 Oct 2024, 14:21:46 UTC - in response to Message 50945. Crystal Pellet wrote: As far as I am aware of: The wrapper checks every 1200 seconds (in our case), whether the heartbeat file coming from the VM, is not older than 20 minutes. If, then the VM will be stopped (or try to stop). I suppose in spring the heartbeat time stamp is > than last check, so no kill. And interesting: ATLAS don't use this mechanism. (You could check whether you had ATLAS-erros on Windows because of this). so if this indeed is the way the wrapper for Theory and CMS is working, then each fall the running tasks are bound to fail. Too bad if this happens to tasks which have been running for serveral days already, like in my case. But, as I questioned already before: what has happened to the wrapper since last year? Because last year, this problem definitely did NOT occur. In any case, I will no longer download Theory tasks as long as Herwig7 is in the basket. Too many failures have happened here; from a total of about 20 tasks, only 3 succeeded :-( first, tasks which were stopped exaclty at the 10 days' limit; then I wanted to terminate some tasks gracefully before they reached the 10 days' limit - this did not work either, the tasks ended up invalid. And then what happened last night, when 13 tasks with runtimes between 4 and 6 days failed due to this heartbeat error. Alone this caused about 1.500 hours CPU time wasted for nothing, not mentioning that electricity prices here have about trippled since the beginnung of the Ukraine war. So the conclusion for me is: keep my fingers off Theory, there are too many uncertainties :-( ID: 50950 · Reply Quote

Crystal Pellet Volunteer moderator Volunteer tester Send message Joined: 14 Jan 10 Posts: 1493 Credit: 9,987,809 RAC: 889	Message 50951 - Posted: 27 Oct 2024, 15:11:37 UTC - in response to Message 50950. Last modified: 27 Oct 2024, 15:39:44 UTC so if this indeed is the way the wrapper for Theory and CMS is working, then each fall the running tasks are bound to fail. Too bad if this happens to tasks which have been running for serveral days already, like in my case. But, as I questioned already before: what has happened to the wrapper since last year? Because last year, this problem definitely did NOT occur. And this all happens, where I've set in Windows registry the UTC to be the Universal time. So something is not working like it should be. We should get rid of this stupid summer-/wintertime switching. It is an old-fashioned relic from the first oil crisis of the 1970s Maybe it's even better to get rid of vboxwrapper heartbeat mechanism checking the sync between host and VM. Over the years we had more troubles with it instead of benefit. At least I'm testing now without heartbeat check without issues so far inclusive time change one hour backwards and an hour later one hour forwards with running Theory, CMS and ATLAS-tasks. This problem comes every year over and over again, but now with those longrunning Herwig7 it hurts more. 2018-10-28 03:56:08 (6268): Guest Log: [INFO] MCPlots JobID: 47001352 in slot1 2018-10-28 03:07:59 (6268): VM Heartbeat file specified, but missing heartbeat. ID: 50951 · Reply Quote

Erich56 Send message Joined: 18 Dec 15 Posts: 1941 Credit: 156,400,468 RAC: 103,860	Message 50952 - Posted: 27 Oct 2024, 16:57:45 UTC - in response to Message 50951. ... We should get rid of this stupid summer-/wintertime switching. ... Maybe it's even better to get rid of vboxwrapper heartbeat mechanism checking the sync between host and VM. Over the years we had more troubles with it instead of benefit. ... This problem comes every year over and over again, but now with those longrunning Herwig7 it hurts more. I fully agree (I guess nothing else would be expected :-) ID: 50952 · Reply Quote