Thread 'Pausing and resuming CMS tasks'

Author	Message
Guy Send message Joined: 9 Feb 08 Posts: 61 Credit: 2,178,744 RAC: 20	Message 51369 - Posted: 8 Jan 2025, 9:54:13 UTC As mentioned in this post (about how BOINC tries to estimate the time a task will take), pausing and resuming CMS vm tasks causes problems. The BOINC benchmarks are not relevant since each VM runs internal benchmarks that are not reported back to BOINC. ... Each pause/resume initiated by BOINC disturbs this calculation and may result in loosing the last scientific job (18 h are a hard limit). Since this is also not reported back to BOINC it is recommended to avoid pause/resume cycles. Ah well that explains this happening when I paused and resumed a CMS task between 22:26 last night and 01:59 this morning - <core_client_version>8.0.4</core_client_version> <![CDATA[ <stderr_txt> 2025-01-07 12:33:26 (5920): vboxwrapper version 26207 2025-01-07 12:33:26 (5920): BOINC client version: 8.0.4 2025-01-07 12:33:27 (5920): Detected: VirtualBox VboxManage Interface (Version: 7.1.4) ... ... 2025-01-07 22:26:52 (5920): Stopping VM. 2025-01-07 22:26:56 (5920): Successfully stopped VM. 2025-01-08 01:59:49 (4231): vboxwrapper version 26207 2025-01-08 01:59:49 (4231): BOINC client version: 8.0.4 2025-01-08 01:59:52 (4231): Detected: VirtualBox VboxManage Interface (Version: 7.1.4) 2025-01-08 01:59:52 (4231): Detected: Heartbeat check (file: 'heartbeat' every 1200.000000 seconds) 2025-01-08 01:59:52 (4231): Guest Log: BIOS: VirtualBox 7.1.4 ... 2025-01-08 01:59:52 (4231): Starting VM using VBoxManage interface. (boinc_257993d36581a549, slot#3) 2025-01-08 01:59:58 (4231): Successfully started VM. (PID = '4297') 2025-01-08 01:59:58 (4231): Reporting VM Process ID to BOINC. 2025-01-08 01:59:58 (4231): VM state change detected. (old = 'poweredoff', new = 'running') 2025-01-08 01:59:58 (4231): Detected: Web Application Enabled (http://localhost:47895) 2025-01-08 01:59:58 (4231): Status Report: Job Duration: '64800.000000' 2025-01-08 01:59:58 (4231): Status Report: Elapsed Time: '35938.000000' 2025-01-08 01:59:58 (4231): Status Report: CPU Time: '135895.160000' 2025-01-08 01:59:58 (4231): Preference change detected 2025-01-08 01:59:58 (4231): Setting CPU throttle for VM. (100%) 2025-01-08 01:59:58 (4231): Setting checkpoint interval to 600 seconds. (Higher value of (Preference: 60 seconds) or (Vbox_job.xml: 600 seconds)) 2025-01-08 02:00:02 (4231): Guest Log: 09:52:47.542720 timesync vgsvcTimeSyncWorker: Radical host time change: 12 795 334 000 000ns (HostNow=1 736 301 601 618 000 000 ns HostLast=1 736 288 806 284 000 000 ns) 2025-01-08 02:00:12 (4231): Guest Log: 09:52:57.585671 timesync vgsvcTimeSyncWorker: Radical guest time change: 12 795 384 121 000ns (GuestNow=1 736 301 611 748 110 000 ns GuestLast=1 736 288 816 363 989 000 ns fSetTimeLastLoop=true ) 2025-01-08 02:04:37 (4231): Guest Log: [INFO] glidein exited with return value 0. 2025-01-08 02:04:37 (4231): Guest Log: [INFO] Shutting Down. 2025-01-08 02:04:37 (4231): VM Completion File Detected. 2025-01-08 02:04:37 (4231): VM Completion Message: glidein exited with return value 0. . 2025-01-08 02:04:37 (4231): Powering off VM. 2025-01-08 02:04:38 (4231): Successfully stopped VM. 2025-01-08 02:04:38 (4231): Deregistering VM. (boinc_257993d36581a549, slot#3) 2025-01-08 02:04:38 (4231): Removing network bandwidth throttle group from VM. 2025-01-08 02:04:38 (4231): Removing VM from VirtualBox. 2025-01-08 02:04:43 (4231): called boinc_finish(0) </stderr_txt> But I still got credit granted - Task Work unit Computer Sent Time reported Status Run CPU Credit Application or deadline time time --------------------------------------------------------------------------------------------------------------------------------------------------- 418699676 229706420 10860321 7 Jan 2025 8 Jan 2025 Completed and validated 35,891.24 135,935.70 702.38 CMS Simulation v70.30 (vbox64_mt_mcore_cms) x86_64-pc-linux-gnu Another task was downloaded (CMS_2779181_1736299900.108510_0). But my PC power settings caused it to enter sleep mode instead of just blanking the screen. And, when I woke it up this morning - ... 2025-01-08 02:29:31 (5323): Guest Log: [INFO] CMS application starting. Check log files. 2025-01-08 02:32:51 (5323): VM state change detected. (old = 'running', new = 'paused') 2025-01-08 07:29:33 (5323): Error in resume VM for VM: -182 Command: VBoxManage -q controlvm "boinc_52dab00a2f68e27b" resume Output: VBoxManage: error: Could not resume the machine execution (VERR_VM_INVALID_VM_STATE) VBoxManage: error: Details: code VBOX_E_VM_ERROR (0x80bb0003), component ConsoleWrap, interface IConsole, callee nsISupports VBoxManage: error: Context: "Resume()" at line 393 of file VBoxManageControlVM.cpp 2025-01-08 07:29:34 (5323): Guest Log: 00:04:00.037174 timesync vgsvcTimeSyncWorker: Radical host time change: 17 811 435 000 000ns (HostNow=1 736 321 373 456 000 000 ns HostLast=1 736 303 562 021 000 000 ns) 2025-01-08 07:29:34 (5323): VM state change detected. (old = 'paused', new = 'running') 2025-01-08 07:29:43 (5323): Guest Log: 00:04:10.037675 timesync vgsvcTimeSyncWorker: Radical guest time change: 17 811 437 786 000ns (GuestNow=1 736 321 383 456 583 000 ns GuestLast=1 736 303 572 018 797 000 ns fSetTimeLastLoop=true ) 2025-01-08 07:54:41 (5323): Guest Log: [INFO] glidein exited with return value 0. 2025-01-08 07:54:41 (5323): Guest Log: [INFO] Shutting Down. 2025-01-08 07:54:41 (5323): VM Completion File Detected. 2025-01-08 07:54:41 (5323): VM Completion Message: glidein exited with return value 0. . 2025-01-08 07:54:41 (5323): Powering off VM. 2025-01-08 07:54:41 (5323): Successfully stopped VM. 2025-01-08 07:54:41 (5323): Deregistering VM. (boinc_52dab00a2f68e27b, slot#1) 2025-01-08 07:54:41 (5323): Removing network bandwidth throttle group from VM. 2025-01-08 07:54:41 (5323): Removing VM from VirtualBox. 2025-01-08 07:54:46 (5323): called boinc_finish(0) </stderr_txt> I even got credit for this! Completed and validated - 31.99 Do all VMs have this pause/resume problem? It's a nuisance. Being an "LHC@home" user I often want to pause BOINC as, I'm sure, do many others... Probably. Thanks. ID: 51369 · Reply Quote

Crystal Pellet Volunteer moderator Volunteer tester Send message Joined: 14 Jan 10 Posts: 1491 Credit: 9,985,787 RAC: 990	Message 51370 - Posted: 8 Jan 2025, 10:23:45 UTC - in response to Message 51369. Do all VMs have this pause/resume problem? It's a nuisance. Being an "LHC@home" user I often want to pause BOINC as, I'm sure, do many others... Probably. Thanks. CMS and ATLAS do need an internet connection. When a task is paused/suspended the VM doesn't have one. Only short network interruptions are allowed - shorter than one hour. ID: 51370 · Reply Quote

computezrmle Volunteer moderator Volunteer developer Volunteer tester Help desk expert Send message Joined: 15 Jun 08 Posts: 2724 Credit: 299,002,782 RAC: 71,016	Message 51371 - Posted: 8 Jan 2025, 10:39:41 UTC - in response to Message 51369. talking about "glidein" timers (some of them but it has many more). These shouldn't be mixed with VirtualBox timers/issues and issues introduced by the OS. And, yes, CMS tends to grant credits even if errors happen. This is caused by the fact that most errors are not clear enough to blame the user (= no credit). Your first log shows the task simply paused for too long: [pre]2025-01-07 22:26:56 (5920): Successfully stopped VM. 2025-01-08 01:59:49 (4231): vboxwrapper version 26207[/pre] Hence, after the restart it got no job and finished gracefully. Messages like this are from VirtualBox and can savely be ignored: [pre]... timesync vgsvcTimeSyncWorker: Radical host time change: ...[/pre] Messages like this point out that the VM shutdown before didn't finish correctly, most likely because the OS killed the processes too early. Nonetheless vboxwrapper/BOINC identified it as an error (non recoverable) and finished the task. [pre]2025-01-08 07:29:33 (5323): Error in resume VM for VM: -182 Command: VBoxManage -q controlvm "boinc_52dab00a2f68e27b" resume Output: VBoxManage: error: Could not resume the machine execution (VERR_VM_INVALID_VM_STATE) VBoxManage: error: Details: code VBOX_E_VM_ERROR (0x80bb0003), component ConsoleWrap, interface IConsole, callee nsISupports VBoxManage: error: Context: "Resume()" at line 393 of file VBoxManageControlVM.cpp[/pre] [pre]... Being an "LHC@home" user I often want to pause BOINC as, I'm sure, do many others ...[/pre] A valid request that comes up every now and then for years, but the fact that CMS allows short breaks is already a compromise as it was originally developed for datacenters where it runs without any break. Take it or leave it. ID: 51371 · Reply Quote

Guy Send message Joined: 9 Feb 08 Posts: 61 Credit: 2,178,744 RAC: 20	Message 51372 - Posted: 8 Jan 2025, 10:47:40 UTC - in response to Message 51370. Last modified: 8 Jan 2025, 10:52:20 UTC Thanks, that's informative. I hope I'm not asking what's been clearly described elsewhere. But does that mean that being out of contact with a >1hr pause causes synchronisation issues? Which raises the question of is there somewhere a Symmetric Multi-Processor system managing the task simultaneously across multiple remote LHC@home users? Or not? Perhaps a script to sync-up my squid exists somewhere - beyond my skills to make one. What's going on, please? Thanks. ID: 51372 · Reply Quote

computezrmle Volunteer moderator Volunteer developer Volunteer tester Help desk expert Send message Joined: 15 Jun 08 Posts: 2724 Credit: 299,002,782 RAC: 71,016	Message 51373 - Posted: 8 Jan 2025, 11:01:01 UTC - in response to Message 51372. High level explanation as we (=BOINC volunteers) can't influence any backend settings. It's not a missing synchronisation (short term, like TCP). Instead, - the jobs request token(s) as a key to access certain backend systems. The tokens time out after a while. - the jobs periodically send keep alive packets to backend systems (like WMAgent). If they don't any more the backend markes them as lost. ID: 51373 · Reply Quote

Guy Send message Joined: 9 Feb 08 Posts: 61 Credit: 2,178,744 RAC: 20	Message 51374 - Posted: 8 Jan 2025, 11:39:24 UTC - in response to Message 51373. Thank you. You are very knowledgable. It sounds like a simple thing to fix though... Ah. And it's only January... LOL ID: 51374 · Reply Quote

Guy Send message Joined: 9 Feb 08 Posts: 61 Credit: 2,178,744 RAC: 20	Message 51381 - Posted: 10 Jan 2025, 17:09:22 UTC Last modified: 10 Jan 2025, 17:38:46 UTC Hello, Thanks all for your descriptions of what's happening. It's all very enlightening. Currently I've stopped downloading ATLAS and CMS because - CMS and ATLAS do need an internet connection. When a task is paused/suspended the VM doesn't have one. Only short network interruptions are allowed - shorter than one hour. - and as I said - I often pause BOINC to do other things... (Six Track and Theory are still go though.) Now. Because the >1hr pause causes a failure to complete the ATLAS/CMS task, I "suspect" that I may be wasting the LHC@home project's time in some way, like; it has to send the task out again - ...CMS tends to grant credits even if errors happen. This is caused by the fact that most errors are not clear enough to blame the user... - and the project is only being polite by granting credit. But am I wasting time? To elaborate: Is the time I spend crunching simply contributing to a work unit communal result that's serviced by many clients? Has an incomplete task still has assisted a bit? The granted credit suggests as much... And in which case, of course, I'd resume downloading the mt ATLAS and CMS tasks. Thank you sincerely for your continuing attention to my enquiries. ID: 51381 · Reply Quote

computezrmle Volunteer moderator Volunteer developer Volunteer tester Help desk expert Send message Joined: 15 Jun 08 Posts: 2724 Credit: 299,002,782 RAC: 71,016	Message 51383 - Posted: 11 Jan 2025, 11:27:44 UTC - in response to Message 51381. ... But am I wasting time? To elaborate: Is the time I spend crunching simply contributing to a work unit communal result that's serviced by many clients? Has an incomplete task still has assisted a bit? The granted credit suggests as much... And in which case, of course, I'd resume downloading the mt ATLAS and CMS tasks. Thank you sincerely for your continuing attention to my enquiries. To understand what happens with the data your computer delivers you may want to look at the mcplots homepage: http://mcplots.cern.ch/ That page is related to "Theory Simulation". In short: If your data doesn't validate it will remain unused. If your data validates it will become part of the overall statistical analysis which generates the nice plots. The data returned for ATLAS/CMS is used in a similar way related to the particular experiment/detector. Same here: If your data doesn't validate it will remain unused. If your data validates it will become part of the overall statistical analysis. In all cases: As long as there are enough validated data snippets to generate a reliable overall result lost data doesn't matter. ID: 51383 · Reply Quote

Guy Send message Joined: 9 Feb 08 Posts: 61 Credit: 2,178,744 RAC: 20	Message 51384 - Posted: 11 Jan 2025, 14:47:49 UTC - in response to Message 51383. Right... :-) Thank you. Thanks for the link. Just seeing the swathes of analysis at mcplots bolsters just how much regard I've got for everyone involved with the fantastic subatomic science in progress at CERN, elsewhere and all their collaborators. As a volunteer, no matter what, I feel part of that. It is, ultimately, excellent work. Love it. Thanks everyone. ID: 51384 · Reply Quote

Guy Send message Joined: 9 Feb 08 Posts: 61 Credit: 2,178,744 RAC: 20	Message 51454 - Posted: 27 Jan 2025, 11:14:17 UTC ATLAS and CMS resumed - with confidence! (for some time..!) Thank you. ID: 51454 · Reply Quote