Message boards :
Number crunching :
all tasks errored out about end of summertime last night
Message board moderation
Author | Message |
---|---|
Send message Joined: 18 Dec 15 Posts: 1862 Credit: 129,340,284 RAC: 116,437 ![]() ![]() ![]() |
When i woke up this morning, I found out that on all of my hosts all tasks (Atlas, CMS, Theory) errored out at about the time when summer time was changed back to regular time. Interestingly enough, they all failed with "VM Heartbeat file specified, but missing heartbeat." Example: https://lhcathome.cern.ch/lhcathome/result.php?resultid=415324518 I remember cases in the past (but I am not sure which project this was) where exactly the same problem occurred during these time shifts. Did anyone else make the same experience last night? If I am the only one, it's probably the case that there was some problem with my ISP when the time change took place. For me, this is just too bad as on one host 13 Theory Herwig7 tasks had been running for 4, 5 and 6 days. So they're all gone :-( Edit: I now looked up tasks of other crunchers, they were all going well during the time shift. So obviously my suspicion is correct that my ISP had some kind of outage :-( |
Send message Joined: 14 Jan 10 Posts: 1443 Credit: 9,701,584 RAC: 1,379 ![]() ![]() ![]() |
Hurray for Microsoft Windows: 2024-10-27 02:56:06 (10380): Status Report: Elapsed Time: '312000.000000' 2024-10-27 02:56:06 (10380): Status Report: CPU Time: '309770.468750' 2024-10-27 02:16:06 (10380): VM Heartbeat file specified, but missing heartbeat. https://lhcathome.cern.ch/lhcathome/result.php?resultid=415184357 |
Send message Joined: 18 Dec 15 Posts: 1862 Credit: 129,340,284 RAC: 116,437 ![]() ![]() ![]() |
Happened to meet one of my neigbors who has the same ISP in the staircase and asked him whether he by chance noticed any internet interruption last night. He answered that he watched TV via WLAN and and noticed "a disruption of a few minutes". This is somewhat strange, as the hearbeat check interval is 20 minutes, if I interpret 'heartbeat' every 1200.000000 seconds correctly. As I can see, this value is set in the Theory_2024_04_30_prod.xml (and in similar files for the other subprojects). So, at least some of the tasks could be affected given the above time indications. But I am surprised that ALL tasks on all my hosts were affected. The only explanation for me would be that my neighbor was wrong when he talked about "few minutes", and that in reality the interruption was >20 minutes. Or, who knows, the hearbeat check is actually made in intervals <20 minutes. At any rate, in order to avoid for the future that tasks which have already run for several days will fail just because of this hearbeat check, my question is whether I can simply increase the 20 minutes interval by changing thes value in the Theory_2024_04_30_prod.xml. Edit: in view of what CP was saying minutes ago (message 50937), my tasks obviously did not fails because of the internet interrruption "of a few minutes" (and my neighbor may be right with this time span indication), but rather because they were crunched on Windows :-( |
Send message Joined: 18 Dec 15 Posts: 1862 Credit: 129,340,284 RAC: 116,437 ![]() ![]() ![]() |
https://lhcathome.cern.ch/lhcathome/result.php?resultid=415184357oh, how nice: 3 days 15 minutes' work for nothing :-( So when I looked up a few colleagues' tasks I probably didn't notice that all of them were crunched NOT on Windows :-( |
Send message Joined: 2 May 07 Posts: 2262 Credit: 175,581,097 RAC: 1,442 ![]() ![]() ![]() |
What show the Eventmessages at this time in Windows? Yes, it's realy no fun, the EU will cancel this nonsens of timechange. There was a petition two year ago, with no success. Oh, seeing one CMS with this problem. Computer ID 10795955 Laufzeit 5 Stunden 40 min. 37 sek. CPU Zeit 14 Stunden 50 min. 14 sek. 2024-10-27 02:59:24 (4468): Setting checkpoint interval to 1200 seconds. (Higher value of (Preference: 1200 seconds) or (Vbox_job.xml: 600 seconds)) 2024-10-27 02:59:57 (4468): Preference change detected 2024-10-27 02:59:57 (4468): Setting CPU throttle for VM. (100%) 2024-10-27 02:59:57 (4468): Setting checkpoint interval to 1200 seconds. (Higher value of (Preference: 1200 seconds) or (Vbox_job.xml: 600 seconds)) 2024-10-27 02:02:33 (4468): Preference change detected 2024-10-27 02:02:33 (4468): Setting CPU throttle for VM. (100%) 2024-10-27 02:02:33 (4468): Setting checkpoint interval to 1200 seconds. (Higher value of (Preference: 1200 seconds) or (Vbox_job.xml: 600 seconds)) 2024-10-27 02:02:47 (4468): VM Heartbeat file specified, but missing heartbeat. 2024-10-27 02:02:47 (4468): Powering off VM. |
![]() ![]() Send message Joined: 24 Oct 04 Posts: 1194 Credit: 60,769,786 RAC: 71,071 ![]() ![]() |
Mine will be Sun, Nov 3, 2024 Pacific Time.......and not – Sun, Oct 27, 2024 Central European Time We have this every year here |
Send message Joined: 14 Jan 10 Posts: 1443 Credit: 9,701,584 RAC: 1,379 ![]() ![]() ![]() |
What show the Eventmessages at this time in Windows?LHC@home 27 Oct 01:07:17 Project requested delay of 6 seconds LHC@home 27 Oct 02:16:22 Computation for task Theory_2794-3244360-278_0 finished LHC@home 27 Oct 02:16:22 Output file Theory_2794-3244360-278_0_r1840616054_result for task Theory_2794-3244360-278_0 absent LHC@home 27 Oct 02:17:49 Sending scheduler request: To report completed tasks. LHC@home 27 Oct 02:17:49 Reporting 1 completed tasks |
Send message Joined: 18 Dec 15 Posts: 1862 Credit: 129,340,284 RAC: 116,437 ![]() ![]() ![]() |
What I am wondering is: what kind of change was made to LHC tasks (in fact: to all of the subprojects) between the previous time change in spring 2024 (change from winter time to summer time) and now; I don't remember this problem occurring in spring, also not last fall (change from summer time to winter time), and also not any time before (except for several years ago, while I don't even know whether LHC was affected or another project). Anyone any idea what happened? Or, in other words: if this problem will not be straightened out by someone, will have it come up again next spring. |
Send message Joined: 2 May 07 Posts: 2262 Credit: 175,581,097 RAC: 1,442 ![]() ![]() ![]() |
WCG (Krembil) show this problem not. Running the most CPU's with WCG atm. |
![]() Send message Joined: 28 Sep 04 Posts: 757 Credit: 53,072,083 RAC: 41,728 ![]() ![]() ![]() |
I have two host here at LHC@Home and did have this time change happen last night. No problems with the tasks, they continued running normally. The hosts have very different versions of Boinc running. The win10 machine has Boinc 7.16.5 and VM 5.2.44. The win11 machine has Boinc 8.0.2 and VM 7.0.6. I remember having same problems as described above previous years but I don't remember if it was in spring or fall. ![]() |
Send message Joined: 18 Dec 15 Posts: 1862 Credit: 129,340,284 RAC: 116,437 ![]() ![]() ![]() |
WCG (Krembil) show this problem not.I've running GPUGRID tasks on 4 hosts - they all survived the time change without any problem. |
Send message Joined: 14 Jan 10 Posts: 1443 Credit: 9,701,584 RAC: 1,379 ![]() ![]() ![]() |
Anyone any idea what happened?As far as I am aware of: The wrapper checks every 1200 seconds (in our case), whether the heartbeat file coming from the VM, is not older than 20 minutes. If, then the VM will be stopped (or try to stop). I suppose in spring the heartbeat time stamp is > than last check, so no kill. And interesting: ATLAS don't use this mechanism. (You could check whether you had ATLAS-erros on Windows because of this). |
Send message Joined: 18 Dec 15 Posts: 1862 Credit: 129,340,284 RAC: 116,437 ![]() ![]() ![]() |
I have two host here at LHC@Home and did have this time change happen last night. No problems with the tasks, they continued running normally. The hosts have very different versions of Boinc running. The win10 machine has Boinc 7.16.5 and VM 5.2.44. The win11 machine has Boinc 8.0.2 and VM 7.0.6. ...My 18 hosts run on various versions of Boinc and Oracle VM - a few older ones, more newer ones; yet all of them were affected :-( |
Send message Joined: 18 Dec 15 Posts: 1862 Credit: 129,340,284 RAC: 116,437 ![]() ![]() ![]() |
yes, you are right: ATLAS did NOT fail (so my initial statement that ALL subprojects were affected was wrong). However, all CMS tasks failed (besides the Theory tasks).Anyone any idea what happened?As far as I am aware of: |
![]() Send message Joined: 28 Sep 04 Posts: 757 Credit: 53,072,083 RAC: 41,728 ![]() ![]() ![]() |
I have two host here at LHC@Home and did have this time change happen last night. No problems with the tasks, they continued running normally. The hosts have very different versions of Boinc running. The win10 machine has Boinc 7.16.5 and VM 5.2.44. The win11 machine has Boinc 8.0.2 and VM 7.0.6. I was only running Atlas tasks. So they were working correctly in this situation. ![]() |
Send message Joined: 18 Dec 15 Posts: 1862 Credit: 129,340,284 RAC: 116,437 ![]() ![]() ![]() |
Crystal Pellet wrote: As far as I am aware of:so if this indeed is the way the wrapper for Theory and CMS is working, then each fall the running tasks are bound to fail. Too bad if this happens to tasks which have been running for serveral days already, like in my case. But, as I questioned already before: what has happened to the wrapper since last year? Because last year, this problem definitely did NOT occur. In any case, I will no longer download Theory tasks as long as Herwig7 is in the basket. Too many failures have happened here; from a total of about 20 tasks, only 3 succeeded :-( first, tasks which were stopped exaclty at the 10 days' limit; then I wanted to terminate some tasks gracefully before they reached the 10 days' limit - this did not work either, the tasks ended up invalid. And then what happened last night, when 13 tasks with runtimes between 4 and 6 days failed due to this heartbeat error. Alone this caused about 1.500 hours CPU time wasted for nothing, not mentioning that electricity prices here have about trippled since the beginnung of the Ukraine war. So the conclusion for me is: keep my fingers off Theory, there are too many uncertainties :-( |
Send message Joined: 14 Jan 10 Posts: 1443 Credit: 9,701,584 RAC: 1,379 ![]() ![]() ![]() |
so if this indeed is the way the wrapper for Theory and CMS is working, then each fall the running tasks are bound to fail. Too bad if this happens to tasks which have been running for serveral days already, like in my case.And this all happens, where I've set in Windows registry the UTC to be the Universal time. So something is not working like it should be. We should get rid of this stupid summer-/wintertime switching. It is an old-fashioned relic from the first oil crisis of the 1970s Maybe it's even better to get rid of vboxwrapper heartbeat mechanism checking the sync between host and VM. Over the years we had more troubles with it instead of benefit. At least I'm testing now without heartbeat check without issues so far inclusive time change one hour backwards and an hour later one hour forwards with running Theory, CMS and ATLAS-tasks. This problem comes every year over and over again, but now with those longrunning Herwig7 it hurts more. 2018-10-28 03:56:08 (6268): Guest Log: [INFO] MCPlots JobID: 47001352 in slot1 |
Send message Joined: 18 Dec 15 Posts: 1862 Credit: 129,340,284 RAC: 116,437 ![]() ![]() ![]() |
...I fully agree (I guess nothing else would be expected :-) |
©2025 CERN