Message boards : CMS Application : Pausing and resuming CMS tasks
Message board moderation

To post messages, you must log in.

AuthorMessage
Profile Guy
Avatar

Send message
Joined: 9 Feb 08
Posts: 55
Credit: 1,526,508
RAC: 2,895
Message 51369 - Posted: 8 Jan 2025, 9:54:13 UTC

As mentioned in this post (about how BOINC tries to estimate the time a task will take), pausing and resuming CMS vm tasks causes problems.
The BOINC benchmarks are not relevant since each VM runs internal benchmarks that are not reported back to BOINC.
...
Each pause/resume initiated by BOINC disturbs this calculation and may result in loosing the last scientific job (18 h are a hard limit).
Since this is also not reported back to BOINC it is recommended to avoid pause/resume cycles.
Ah well that explains this happening when I paused and resumed a CMS task between 22:26 last night and 01:59 this morning -
<core_client_version>8.0.4</core_client_version>
<![CDATA[
<stderr_txt>
2025-01-07 12:33:26 (5920): vboxwrapper version 26207
2025-01-07 12:33:26 (5920): BOINC client version: 8.0.4
2025-01-07 12:33:27 (5920): Detected: VirtualBox VboxManage Interface (Version: 7.1.4)
...
...
2025-01-07 22:26:52 (5920): Stopping VM.
2025-01-07 22:26:56 (5920): Successfully stopped VM.
2025-01-08 01:59:49 (4231): vboxwrapper version 26207
2025-01-08 01:59:49 (4231): BOINC client version: 8.0.4
2025-01-08 01:59:52 (4231): Detected: VirtualBox VboxManage Interface (Version: 7.1.4)
2025-01-08 01:59:52 (4231): Detected: Heartbeat check (file: 'heartbeat' every 1200.000000 seconds)
2025-01-08 01:59:52 (4231): Guest Log: BIOS: VirtualBox 7.1.4
...
2025-01-08 01:59:52 (4231): Starting VM using VBoxManage interface. (boinc_257993d36581a549, slot#3)
2025-01-08 01:59:58 (4231): Successfully started VM. (PID = '4297')
2025-01-08 01:59:58 (4231): Reporting VM Process ID to BOINC.
2025-01-08 01:59:58 (4231): VM state change detected. (old = 'poweredoff', new = 'running')
2025-01-08 01:59:58 (4231): Detected: Web Application Enabled (http://localhost:47895)
2025-01-08 01:59:58 (4231): Status Report: Job Duration: '64800.000000'
2025-01-08 01:59:58 (4231): Status Report: Elapsed Time: '35938.000000'
2025-01-08 01:59:58 (4231): Status Report: CPU Time: '135895.160000'
2025-01-08 01:59:58 (4231): Preference change detected
2025-01-08 01:59:58 (4231): Setting CPU throttle for VM. (100%)
2025-01-08 01:59:58 (4231): Setting checkpoint interval to 600 seconds. (Higher value of (Preference: 60 seconds) or (Vbox_job.xml: 600 seconds))
2025-01-08 02:00:02 (4231): Guest Log: 09:52:47.542720 timesync vgsvcTimeSyncWorker: Radical host time change: 12 795 334 000 000ns (HostNow=1 736 301 601 618 000 000 ns HostLast=1 736 288 806 284 000 000 ns)
2025-01-08 02:00:12 (4231): Guest Log: 09:52:57.585671 timesync vgsvcTimeSyncWorker: Radical guest time change: 12 795 384 121 000ns (GuestNow=1 736 301 611 748 110 000 ns GuestLast=1 736 288 816 363 989 000 ns fSetTimeLastLoop=true )
2025-01-08 02:04:37 (4231): Guest Log: [INFO] glidein exited with return value 0.
2025-01-08 02:04:37 (4231): Guest Log: [INFO] Shutting Down.
2025-01-08 02:04:37 (4231): VM Completion File Detected.
2025-01-08 02:04:37 (4231): VM Completion Message: glidein exited with return value 0.
.
2025-01-08 02:04:37 (4231): Powering off VM.
2025-01-08 02:04:38 (4231): Successfully stopped VM.
2025-01-08 02:04:38 (4231): Deregistering VM. (boinc_257993d36581a549, slot#3)
2025-01-08 02:04:38 (4231): Removing network bandwidth throttle group from VM.
2025-01-08 02:04:38 (4231): Removing VM from VirtualBox.
2025-01-08 02:04:43 (4231): called boinc_finish(0)

</stderr_txt>
But I still got credit granted -
Task          Work unit     Computer      Sent            Time reported   Status                     Run        CPU           Credit    Application
                                                          or deadline                                time       time
---------------------------------------------------------------------------------------------------------------------------------------------------
418699676     229706420     10860321      7 Jan 2025      8 Jan 2025      Completed and validated    35,891.24  135,935.70    702.38    CMS Simulation v70.30 (vbox64_mt_mcore_cms)
                                                                                                                                        x86_64-pc-linux-gnu

Another task was downloaded (CMS_2779181_1736299900.108510_0).
But my PC power settings caused it to enter sleep mode instead of just blanking the screen. And, when I woke it up this morning -
...
2025-01-08 02:29:31 (5323): Guest Log: [INFO] CMS application starting. Check log files.
2025-01-08 02:32:51 (5323): VM state change detected. (old = 'running', new = 'paused')
2025-01-08 07:29:33 (5323): Error in resume VM for VM: -182
Command:
VBoxManage -q controlvm "boinc_52dab00a2f68e27b" resume
Output:
VBoxManage: error: Could not resume the machine execution (VERR_VM_INVALID_VM_STATE)
VBoxManage: error: Details: code VBOX_E_VM_ERROR (0x80bb0003), component ConsoleWrap, interface IConsole, callee nsISupports
VBoxManage: error: Context: "Resume()" at line 393 of file VBoxManageControlVM.cpp

2025-01-08 07:29:34 (5323): Guest Log: 00:04:00.037174 timesync vgsvcTimeSyncWorker: Radical host time change: 17 811 435 000 000ns (HostNow=1 736 321 373 456 000 000 ns HostLast=1 736 303 562 021 000 000 ns)
2025-01-08 07:29:34 (5323): VM state change detected. (old = 'paused', new = 'running')
2025-01-08 07:29:43 (5323): Guest Log: 00:04:10.037675 timesync vgsvcTimeSyncWorker: Radical guest time change: 17 811 437 786 000ns (GuestNow=1 736 321 383 456 583 000 ns GuestLast=1 736 303 572 018 797 000 ns fSetTimeLastLoop=true )
2025-01-08 07:54:41 (5323): Guest Log: [INFO] glidein exited with return value 0.
2025-01-08 07:54:41 (5323): Guest Log: [INFO] Shutting Down.
2025-01-08 07:54:41 (5323): VM Completion File Detected.
2025-01-08 07:54:41 (5323): VM Completion Message: glidein exited with return value 0.
.
2025-01-08 07:54:41 (5323): Powering off VM.
2025-01-08 07:54:41 (5323): Successfully stopped VM.
2025-01-08 07:54:41 (5323): Deregistering VM. (boinc_52dab00a2f68e27b, slot#1)
2025-01-08 07:54:41 (5323): Removing network bandwidth throttle group from VM.
2025-01-08 07:54:41 (5323): Removing VM from VirtualBox.
2025-01-08 07:54:46 (5323): called boinc_finish(0)

</stderr_txt>
I even got credit for this! Completed and validated - 31.99

Do all VMs have this pause/resume problem?
It's a nuisance. Being an "LHC@home" user I often want to pause BOINC as, I'm sure, do many others... Probably.
Thanks.
ID: 51369 · Report as offensive     Reply Quote
Crystal Pellet
Volunteer moderator
Volunteer tester

Send message
Joined: 14 Jan 10
Posts: 1440
Credit: 9,662,889
RAC: 1,386
Message 51370 - Posted: 8 Jan 2025, 10:23:45 UTC - in response to Message 51369.  

Do all VMs have this pause/resume problem?
It's a nuisance. Being an "LHC@home" user I often want to pause BOINC as, I'm sure, do many others... Probably.
Thanks.
CMS and ATLAS do need an internet connection. When a task is paused/suspended the VM doesn't have one.
Only short network interruptions are allowed - shorter than one hour.
ID: 51370 · Report as offensive     Reply Quote
computezrmle
Volunteer moderator
Volunteer developer
Volunteer tester
Help desk expert
Avatar

Send message
Joined: 15 Jun 08
Posts: 2607
Credit: 262,481,847
RAC: 137,761
Message 51371 - Posted: 8 Jan 2025, 10:39:41 UTC - in response to Message 51369.  

I was talking about "glidein" timers (some of them but it has many more).
These shouldn't be mixed with VirtualBox timers/issues and issues introduced by the OS.

And, yes, CMS tends to grant credits even if errors happen.
This is caused by the fact that most errors are not clear enough to blame the user (= no credit).

Your first log shows the task simply paused for too long:
2025-01-07 22:26:56 (5920): Successfully stopped VM.
2025-01-08 01:59:49 (4231): vboxwrapper version 26207

Hence, after the restart it got no job and finished gracefully.


Messages like this are from VirtualBox and can savely be ignored:
... timesync vgsvcTimeSyncWorker: Radical host time change: ...



Messages like this point out that the VM shutdown before didn't finish correctly, most likely because the OS killed the processes too early.
Nonetheless vboxwrapper/BOINC identified it as an error (non recoverable) and finished the task.
2025-01-08 07:29:33 (5323): Error in resume VM for VM: -182
Command:
VBoxManage -q controlvm "boinc_52dab00a2f68e27b" resume
Output:
VBoxManage: error: Could not resume the machine execution (VERR_VM_INVALID_VM_STATE)
VBoxManage: error: Details: code VBOX_E_VM_ERROR (0x80bb0003), component ConsoleWrap, interface IConsole, callee nsISupports
VBoxManage: error: Context: "Resume()" at line 393 of file VBoxManageControlVM.cpp



... Being an "LHC@home" user I often want to pause BOINC as, I'm sure, do many others ...

A valid request that comes up every now and then for years, but the fact that CMS allows short breaks is already a compromise as it was originally developed for datacenters where it runs without any break.
Take it or leave it.
ID: 51371 · Report as offensive     Reply Quote
Profile Guy
Avatar

Send message
Joined: 9 Feb 08
Posts: 55
Credit: 1,526,508
RAC: 2,895
Message 51372 - Posted: 8 Jan 2025, 10:47:40 UTC - in response to Message 51370.  
Last modified: 8 Jan 2025, 10:52:20 UTC

Thanks, that's informative.
I hope I'm not asking what's been clearly described elsewhere.
But does that mean that being out of contact with a >1hr pause causes synchronisation issues?
Which raises the question of is there somewhere a Symmetric Multi-Processor system managing the task simultaneously across multiple remote LHC@home users? Or not?
Perhaps a script to sync-up my squid exists somewhere - beyond my skills to make one.
What's going on, please?
Thanks.
ID: 51372 · Report as offensive     Reply Quote
computezrmle
Volunteer moderator
Volunteer developer
Volunteer tester
Help desk expert
Avatar

Send message
Joined: 15 Jun 08
Posts: 2607
Credit: 262,481,847
RAC: 137,761
Message 51373 - Posted: 8 Jan 2025, 11:01:01 UTC - in response to Message 51372.  

High level explanation as we (=BOINC volunteers) can't influence any backend settings.

It's not a missing synchronisation (short term, like TCP).
Instead,
- the jobs request token(s) as a key to access certain backend systems. The tokens time out after a while.
- the jobs periodically send keep alive packets to backend systems (like WMAgent). If they don't any more the backend markes them as lost.
ID: 51373 · Report as offensive     Reply Quote
Profile Guy
Avatar

Send message
Joined: 9 Feb 08
Posts: 55
Credit: 1,526,508
RAC: 2,895
Message 51374 - Posted: 8 Jan 2025, 11:39:24 UTC - in response to Message 51373.  

Thank you. You are very knowledgable.
It sounds like a simple thing to fix though...

Ah. And it's only January... LOL
ID: 51374 · Report as offensive     Reply Quote
Profile Guy
Avatar

Send message
Joined: 9 Feb 08
Posts: 55
Credit: 1,526,508
RAC: 2,895
Message 51381 - Posted: 10 Jan 2025, 17:09:22 UTC
Last modified: 10 Jan 2025, 17:38:46 UTC

Hello,
Thanks all for your descriptions of what's happening. It's all very enlightening.
Currently I've stopped downloading ATLAS and CMS because -
CMS and ATLAS do need an internet connection. When a task is paused/suspended the VM doesn't have one.
Only short network interruptions are allowed - shorter than one hour.
- and as I said - I often pause BOINC to do other things... (Six Track and Theory are still go though.)
Now.
Because the >1hr pause causes a failure to complete the ATLAS/CMS task, I "suspect" that I may be wasting the LHC@home project's time in some way, like; it has to send the task out again -
...CMS tends to grant credits even if errors happen.
This is caused by the fact that most errors are not clear enough to blame the user...
- and the project is only being polite by granting credit.

But am I wasting time? To elaborate: Is the time I spend crunching simply contributing to a work unit communal result that's serviced by many clients? Has an incomplete task still has assisted a bit? The granted credit suggests as much... And in which case, of course, I'd resume downloading the mt ATLAS and CMS tasks.
Thank you sincerely for your continuing attention to my enquiries.
ID: 51381 · Report as offensive     Reply Quote
computezrmle
Volunteer moderator
Volunteer developer
Volunteer tester
Help desk expert
Avatar

Send message
Joined: 15 Jun 08
Posts: 2607
Credit: 262,481,847
RAC: 137,761
Message 51383 - Posted: 11 Jan 2025, 11:27:44 UTC - in response to Message 51381.  

... But am I wasting time? To elaborate: Is the time I spend crunching simply contributing to a work unit communal result that's serviced by many clients? Has an incomplete task still has assisted a bit? The granted credit suggests as much... And in which case, of course, I'd resume downloading the mt ATLAS and CMS tasks.
Thank you sincerely for your continuing attention to my enquiries.

To understand what happens with the data your computer delivers you may want to look at the mcplots homepage:
http://mcplots.cern.ch/

That page is related to "Theory Simulation".
In short:
If your data doesn't validate it will remain unused.
If your data validates it will become part of the overall statistical analysis which generates the nice plots.

The data returned for ATLAS/CMS is used in a similar way related to the particular experiment/detector.
Same here:
If your data doesn't validate it will remain unused.
If your data validates it will become part of the overall statistical analysis.

In all cases:
As long as there are enough validated data snippets to generate a reliable overall result lost data doesn't matter.
ID: 51383 · Report as offensive     Reply Quote
Profile Guy
Avatar

Send message
Joined: 9 Feb 08
Posts: 55
Credit: 1,526,508
RAC: 2,895
Message 51384 - Posted: 11 Jan 2025, 14:47:49 UTC - in response to Message 51383.  

Right... :-)
Thank you.
Thanks for the link. Just seeing the swathes of analysis at mcplots bolsters just how much regard I've got for everyone involved with the fantastic subatomic science in progress at CERN, elsewhere and all their collaborators. As a volunteer, no matter what, I feel part of that. It is, ultimately, excellent work. Love it. Thanks everyone.
ID: 51384 · Report as offensive     Reply Quote
Profile Guy
Avatar

Send message
Joined: 9 Feb 08
Posts: 55
Credit: 1,526,508
RAC: 2,895
Message 51454 - Posted: 27 Jan 2025, 11:14:17 UTC

ATLAS and CMS resumed - with confidence! (for some time..!)
Thank you.
ID: 51454 · Report as offensive     Reply Quote

Message boards : CMS Application : Pausing and resuming CMS tasks


©2025 CERN