Message boards : Number crunching : Tasks stuck at 99.99% with run time of 1 day+
Message board moderation

To post messages, you must log in.

1 · 2 · Next

AuthorMessage
greg_be

Send message
Joined: 28 Dec 08
Posts: 339
Credit: 4,863,589
RAC: 511
Message 46470 - Posted: 18 Mar 2022, 17:06:17 UTC
Last modified: 18 Mar 2022, 17:09:07 UTC

I have had 2 tasks this week hit that 99.99% point and then get stuck.
The one day run time is because I don't notice them, I shut down my computer at night (suspend,shut down client, exit via menu options) and then fire it back up when I get up and leave it alone until I get back from work and shut down. I assume everything runs fine and then spot check and see this going on.

So why the 99.99% mark? Why not complete it?
I checked the stderr before I aborted it and could not see anything wrong.
Time counts were ok, model count was ok, no comms errors, no VM errors.

I assume it's just random stuff with a bug in it, but maybe someone could tell me if its more or not?


Tasks
https://lhcathome.cern.ch/lhcathome/workunit.php?wuid=185439718 Validate error before me on a Linux machine.

https://lhcathome.cern.ch/lhcathome/workunit.php?wuid=185355437 completed and validated on a Linux machine.



Virtualbox (6.1.32) installed / BOINC 7.16.20
Vbox is clean, only running tasks are in it.
ID: 46470 · Report as offensive     Reply Quote
maeax

Send message
Joined: 2 May 07
Posts: 2242
Credit: 173,901,277
RAC: 2,795
Message 46472 - Posted: 18 Mar 2022, 17:45:02 UTC - in response to Message 46470.  

Atlas start always new. You need 24/7 runtime.
When you have a problem (not enough RAM or more-see Yeti's checklist),
this must be first checked.
ID: 46472 · Report as offensive     Reply Quote
greg_be

Send message
Joined: 28 Dec 08
Posts: 339
Credit: 4,863,589
RAC: 511
Message 46477 - Posted: 19 Mar 2022, 13:19:26 UTC - in response to Message 46472.  

Atlas start always new. You need 24/7 runtime.
When you have a problem (not enough RAM or more-see Yeti's checklist),
this must be first checked.



Umm..that is not always the case. I have done this routine many times and had no problems.
RAM? 48 gigs is more than enough. I can run 15 Rosetta Pythons and have memory left over.
ID: 46477 · Report as offensive     Reply Quote
maeax

Send message
Joined: 2 May 07
Posts: 2242
Credit: 173,901,277
RAC: 2,795
Message 46478 - Posted: 19 Mar 2022, 13:25:21 UTC - in response to Message 46477.  

Do you run too many tasks concurrently, without Cern-Tasks?
When a Atlas-Task have not enough RAM for doing the Collisions, this can make problems.
ID: 46478 · Report as offensive     Reply Quote
greg_be

Send message
Joined: 28 Dec 08
Posts: 339
Credit: 4,863,589
RAC: 511
Message 46482 - Posted: 19 Mar 2022, 17:48:43 UTC - in response to Message 46478.  

Do you run too many tasks concurrently, without Cern-Tasks?
When a Atlas-Task have not enough RAM for doing the Collisions, this can make problems.



Not familiar with cern-tasks. ATLAS is restricted to 4 cores and 1 task. That is what works on my system via BOINC.
ID: 46482 · Report as offensive     Reply Quote
greg_be

Send message
Joined: 28 Dec 08
Posts: 339
Credit: 4,863,589
RAC: 511
Message 46486 - Posted: 20 Mar 2022, 12:08:51 UTC - in response to Message 46478.  

ok....now that I have had time to process I think I get what your asking.

So here is what goes on. I have 7 projects including LHC. (WCG is offline at the moment so its 6 in total)
But they rotate every few hours.

ATLAS is limited to one task 4 cores as that is the only way to get a stable working environment.

But since this writing, these have been the only 2 that had any problems.
Since then everything has returned to normal.
So it must have been a 'bug' somewhere.

Over in Rosetta I have had a few tasks error out with:
2022-03-19 09:54:18 (8908): Guest Log: 02:19:40.428489 timesync vgsvcTimeSyncWorker: Radical host time change: 34 342 910 000 000ns (HostNow=1 647 680 057 820 000 000 ns HostLast=1 647 645 714 910 000 000 ns)
2022-03-19 09:54:28 (8908): Guest Log: 02:19:50.428917 timesync vgsvcTimeSyncWorker: Radical guest time change: 34 431 310 621 000ns (GuestNow=1 647 680 067 820 439 000 ns GuestLast=1 647 645 636 509 818 000 ns fSetTimeLastLoop=true )

Then the task gets all out of shape and stalls.
In this case on Rosetta, it was updating at .002% for every 2 second cycle on Boinc Tasks program and CPU was 21%. The task had run for 1.5 days and made it into the 80% range before dying.

I didn't look at the log for these tasks before killing them, but they made it in one case to 99.99% and then stalled. CPU % was very low. Under 20%.
ID: 46486 · Report as offensive     Reply Quote
maeax

Send message
Joined: 2 May 07
Posts: 2242
Credit: 173,901,277
RAC: 2,795
Message 46487 - Posted: 20 Mar 2022, 12:26:42 UTC - in response to Message 46486.  

Have in Boinc the pref "change between work every" 7.500 Minutes.
But, there must be a problem with RAM for those projects, when they work together.
ID: 46487 · Report as offensive     Reply Quote
doug

Send message
Joined: 28 Mar 20
Posts: 23
Credit: 128,343
RAC: 143
Message 46491 - Posted: 21 Mar 2022, 15:30:11 UTC

I am having what seems to be a similar situation. I have an Atlas task that has been running for, according to Boinc Mgr and BoincTasks, 2d:5h:35m. It uses all 4 of my CPUs when running. I've run a number of CMS tasks recently with no problem. This is apparently the first Atlas task I've gotten. At first I thought I had sequentially gotten multiple Atlas tasks, until the actual elapsed time for this one finally penetrated my thick skull.

Something doesn't seem right. The only LHC-related errors in the BIONC event log are a few of these:

Project communication failed: attempting access to reference site
Internet access OK - project servers may be temporarily down.

BoincTasks (v1.78) shows the "CPU %" for this task as 228.35%, which I'd think can't be right.

I'd hate to have to cancel this task and waste all those cycles, but it seems like something is seriously wrong with the task processing. Does anyone have any thoughts and/or recommendations?

Windows 10 (10.0.19044.1586)
16G ram
BOINC v7.16.20 (x64)
vbox 6.1.12 r139181 (Qt5.6.2)
BOINC Mgr shows 34 GB disk assigned to BOINC but unused
Intel(R) Core(TM) i5-3470 CPU @ 3.20GHz 3.20 GHz


Thanks.

Doug
ID: 46491 · Report as offensive     Reply Quote
computezrmle
Volunteer moderator
Volunteer developer
Volunteer tester
Help desk expert
Avatar

Send message
Joined: 15 Jun 08
Posts: 2531
Credit: 253,722,201
RAC: 41,981
Message 46492 - Posted: 21 Mar 2022, 15:56:13 UTC - in response to Message 46491.  

BoincTasks (v1.78) shows the "CPU %" for this task as 228.35%, which I'd think can't be right.

Why not?
Unlike CMS ATLAS sets up a multicore VM (up to 8 cores).
In your case the computer reports a 4-core CPU, hence it should be a 4-core VM.
You may check the task's stderr.txt for an entry like "Setting CPU Count for VM. (n)" with n being the number of allocated cores.

This also means that your CPU is fully allocated by VirtualBox and your OS permanently switches between the VM and all other processes, e.g. the OS itself. This results in the low BoincTasks CPU percentage.

Unfortunately it looks like you didn't install the VirtualBox extensions.
If you would, you would be able to check the VM's console 2 which shows the calculation progress inside the VM.
Other tools like BOINC or 3rd party apps are not able to do this.
They always show fake numbers.
ID: 46492 · Report as offensive     Reply Quote
greg_be

Send message
Joined: 28 Dec 08
Posts: 339
Credit: 4,863,589
RAC: 511
Message 46493 - Posted: 21 Mar 2022, 18:25:18 UTC - in response to Message 46492.  

computezrmle, what am i looking for to see %?
You talk of console 2, where is that?
ID: 46493 · Report as offensive     Reply Quote
computezrmle
Volunteer moderator
Volunteer developer
Volunteer tester
Help desk expert
Avatar

Send message
Joined: 15 Jun 08
Posts: 2531
Credit: 253,722,201
RAC: 41,981
Message 46494 - Posted: 21 Mar 2022, 19:07:26 UTC - in response to Message 46493.  

computezrmle, what am i looking for to see %?

Open BOINC Manager, select a task and click on the property button.
Compare runtime with CPU time.
Most tasks are singlecore -> expect CPU time to be close to or less than runtime.

Multicore tasks running on n cores can be up to n * runtime.

Example (4-core ATLAS)
runtime: 1 h (100 %)
CPU time: up to 4 h (400 %)


Doug's 4-core task reports 228.35 % which means an average of only 2.2835 cores are used.


You talk of console 2, where is that?

As I wrote:
You need to install the VirtualBox extensions.

Then open BOINC Manager, select a VM task and click on "Show VM Console".
All of that is explained in Yeti's checklist.



Experts could use a different RPC client and contact the VM via the network port reported in stderr.txt.
ID: 46494 · Report as offensive     Reply Quote
Crystal Pellet
Volunteer moderator
Volunteer tester

Send message
Joined: 14 Jan 10
Posts: 1417
Credit: 9,441,051
RAC: 885
Message 46496 - Posted: 21 Mar 2022, 20:37:40 UTC - in response to Message 46494.  

You need to install the VirtualBox extensions.
Not really needed.

You can use Oracle VM VirtualBox Manager.
Click the running VM there and then the button Show (top right)
With At-F3 and Alt-F2 you can get more info about processes and progress.
ID: 46496 · Report as offensive     Reply Quote
doug

Send message
Joined: 28 Mar 20
Posts: 23
Credit: 128,343
RAC: 143
Message 46497 - Posted: 22 Mar 2022, 3:09:06 UTC

I installed the Virtual Box extension pack for the version of VB I have. I restarted BOINC, and selected the Atlas project. I don't see any button or anything else labeled "Show VM console".

I also tried the Oracle VM Virtual Box Mgr. I selected the Atlas task, which displayed the "Show" button (as Crystal Pellet stated). I clicked that button and a console-type window opened up. Pressing Alt-F3 does nothing. Pressing Alt-F2 displays the message:

blk_update_request:I/O error, dev sda, sector 37743176

followed by this message about 17 times: "Write error on swap device", followed by (8:0:377432XX), where XX is different on every line.

These seem like error messages, but don't tell me, at least, much useful.

And I just saw a minute or so ago, while writing this message, that the Atlas task, after 2 days and 13+ hours of "elapsed time" and 5 days and 17+ hours of "CPU time", has terminated in a "Computation error". So, that was a whole lot of wasted cycles.

Doug
ID: 46497 · Report as offensive     Reply Quote
computezrmle
Volunteer moderator
Volunteer developer
Volunteer tester
Help desk expert
Avatar

Send message
Joined: 15 Jun 08
Posts: 2531
Credit: 253,722,201
RAC: 41,981
Message 46503 - Posted: 22 Mar 2022, 7:30:39 UTC - in response to Message 46497.  

blk_update_request:I/O error, dev sda, sector 37743176
followed by this message about 17 times: "Write error on swap device", followed by (8:0:377432XX), where XX is different on every line.

This points out a corrupt vm_image.vdi in the task's slot directory.
Most likely caused by too many incomplete suspend/resume operations on a heavy loaded system.

As I wrote this computer has a 4-core CPU and the VM was configured to allocate all 4 CPUs.
A setup like that may be successful if nothing but the VM runs on the computer.
Unfortunately, the logfile shows this wasn't the case.

The VirtualBox forum clearly recommends not to allocate all available cores.
You may change the setup to 1-core or 2-core VMs either here or via an app_config.xml.
ID: 46503 · Report as offensive     Reply Quote
doug

Send message
Joined: 28 Mar 20
Posts: 23
Credit: 128,343
RAC: 143
Message 46511 - Posted: 22 Mar 2022, 15:23:17 UTC - in response to Message 46503.  

Thanks to all for your help.

computezrmle, I've changed my LHC preferences here to reduce the number of cores for Atlas tasks. I have a question: when you said: "Unfortunately, the logfile shows this wasn't the case", what "logfile" are you referring to? One local to my machine? One somewhere on the site here?

Also, is there a sample app_config.xml for LHC that I could work from if I decide I need further or more granular settings?

Thanks again to all.

Doug
ID: 46511 · Report as offensive     Reply Quote
computezrmle
Volunteer moderator
Volunteer developer
Volunteer tester
Help desk expert
Avatar

Send message
Joined: 15 Jun 08
Posts: 2531
Credit: 253,722,201
RAC: 41,981
Message 46513 - Posted: 22 Mar 2022, 16:16:07 UTC - in response to Message 46511.  

what "logfile" are you referring to?

Each task writes "stderr.txt" to the task's working slot.
You can check this file while the task is in progress.

Once the task is finished the content of stderr.txt is reported back to the server and you find it as part of the task details.
Example:
https://lhcathome.cern.ch/lhcathome/result.php?resultid=349270061
The example logfile mentions a couple of suspend/resume cycles which usually don't happen if the computer runs only 1 project.




is there a sample app_config.xml for LHC

An app_config.xml must strictly follow the rules explained here:
https://boinc.berkeley.edu/wiki/Client_configuration#Project-level_configuration

A typical ATLAS configuration may look like this:
<app_config>
    <app>
        <name>ATLAS</name>
        <max_concurrent>1</max_concurrent>
    </app>
    <app_version>
        <app_name>ATLAS</app_name>
        <avg_ncpus>2</avg_ncpus>
        <plan_class>vbox64_mt_mcore_atlas</plan_class>
        <cmdline>--memory_size_mb 4800</cmdline>
    </app_version>
</app_config>
ID: 46513 · Report as offensive     Reply Quote
AndreyOR

Send message
Joined: 8 Dec 19
Posts: 37
Credit: 7,587,438
RAC: 0
Message 46516 - Posted: 22 Mar 2022, 19:33:38 UTC - in response to Message 46486.  

greg_be,
It seems like you've had time discrepancy issues with Rosetta on VM. I've recently been dealing with this issue on a different project. I've recently started running MacOS Mojave on VBox to process 32-bit tasks for climateprediction.net. I've been getting a message in BOINC event log that reads (numbers in parenthesis vary):
New system time (1647911207) < old system time (1648063920); clearing timeouts

Following this, task progress bars freeze but the time counting continues. I can get things going again by doing suspend/resume on each task but so far tasks error out at the very end. Which sucks since these are very long running tasks, take days to weeks to run. I've never seen this kind of issues before but I also don't use VBox much. I rarely run apps that are VBox only since I use WSL2 to run Linux apps which uses Hyper-V and those don't really work well together.
ID: 46516 · Report as offensive     Reply Quote
computezrmle
Volunteer moderator
Volunteer developer
Volunteer tester
Help desk expert
Avatar

Send message
Joined: 15 Jun 08
Posts: 2531
Credit: 253,722,201
RAC: 41,981
Message 46519 - Posted: 22 Mar 2022, 20:33:44 UTC - in response to Message 46516.  

As long as different OSs run on the same computer it might be good to
- set the computer's CMOS RTC to UTC
- configure VMs to use UTC
- sync the local clock via NTP

See:
https://github.com/BOINC/boinc/pull/4631
ID: 46519 · Report as offensive     Reply Quote
AndreyOR

Send message
Joined: 8 Dec 19
Posts: 37
Credit: 7,587,438
RAC: 0
Message 46523 - Posted: 23 Mar 2022, 4:11:38 UTC - in response to Message 46519.  

Thanks for the suggestions. I found a VBox command in the manual to make VM sync time with host frequently but that didn't seem to make a difference. So I disabled time checking/syncing on macOS to see if that'll help. If it doesn't I'll try your suggestions. Is following your suggestions going to make my PC run on UTC time instead of local time?
ID: 46523 · Report as offensive     Reply Quote
computezrmle
Volunteer moderator
Volunteer developer
Volunteer tester
Help desk expert
Avatar

Send message
Joined: 15 Jun 08
Posts: 2531
Credit: 253,722,201
RAC: 41,981
Message 46526 - Posted: 23 Mar 2022, 7:51:38 UTC - in response to Message 46523.  

It's the way how an OS interprets the time information from the CMOS RTC at system start.
Windows traditionally expects local time while most other OSs expect UTC.
All (?) recent OSs can be told to interpret the RTC time as localtime or UTC and it makes sense to use the same mode across all OSs you run on the same computer.

To use UTC in a mixed environment has a couple of advantages (follow the github links).

This setting does not influence the time presented to a user as long as the timzone is correctly set.
ID: 46526 · Report as offensive     Reply Quote
1 · 2 · Next

Message boards : Number crunching : Tasks stuck at 99.99% with run time of 1 day+


©2024 CERN