Message boards : Number crunching : Hight error rate - more than 50%
Message board moderation

To post messages, you must log in.

AuthorMessage
Stephane Gagnon

Send message
Joined: 17 Nov 16
Posts: 6
Credit: 752,169
RAC: 0
Message 37309 - Posted: 11 Nov 2018, 17:52:12 UTC

Hi there (sorry for my English)

First, thanks for all the time and effort of all the team (the project is exclusively run by volunteer effort, as far as I know)

The problem:
I'm running LHC@home from some time now, but I'm concerned about recent high error rate:
State: All (32): In progress (7) · Valid (11) · Invalid (2) · Error (12)
Application: All (32): ATLAS Simulation (8) · LHCb Simulation (8) · Theory Simulation (16)
LHCb Simulation (8): Error while computing (7) Completed and validated (1)
ATLAS Simulation (8) Completed and validated (5) Validate error (2) Cancelled by server (1)
Theory Simulation (16) Completed and validated (5) Error while computing (4) In progress (7)

So... 14 tasks in error / 25 task total. I feel worried

The context:
Hardware: Plain Dell laptop, no overclock, no hack
Host OS: Microsoft Windows 7: Professional x64 Edition, Service Pack 1
Client: BOINC client version 7.14.2 for windows_x86_64, plain install, not installed as a service, no GPU usage as far as I know.

Hypothesis
- Windows reboot:
The computer reboot often (windows update and... well ... this is Windows). I try to close BOINC manually before rebooting, but Virtual Box (VB) is often unable to shutdown properly ("A program need to close [...]"). Waiting VB to close by itself seems useless. After reboot, BOINC is generally restarting and processing the interrupted task, no error, no warning. But in the end, the task seemed to be declared on error by LHC@home.
- Statistics concern
25 tasks is a very low number, with hight rate of false pattern. Maybe I must wait a month or two, but I'm not shure if the tasks will stay that long in my LHC@home web account.

So...
- Is "Windows reboot" a problem?
- If so, how can I do it properly (please give precise answer, I'm not at ease with VB)?
- If the task is declared on error at LHC@home, the task is useless to them? If there is no way to shutdown properly and the task work is useless to LHC@home, I will probably cancel the task after with BOINC after a restart.
- Is there a way to save the LHC@home task list or access the LHC@home task list archives? It may help to find a more solid pattern.

Suggestion
(BOINC is not your software, you probably have no or little tork on it but...) If I can say to BOINC:
"Process the current tasks then STOP - DO NOT START ANOTHER ONE"
or
"Forget the job done, just restart from the beginning"
it may help. Even if the task takes 30H to process, I may try to wait and save the task this way. Or ask BOINC to start over again (if the start just began).

Thanks for your time and effort. Regards,

Note: I don't know where to post this request:
- The problem affect three projects
- The host is Windows but the problem is probably related to Unix / virtual box
- I don't think it's cafe related too (Is it Java related? ;)
ID: 37309 · Report as offensive     Reply Quote
computezrmle
Volunteer moderator
Volunteer developer
Volunteer tester
Help desk expert
Avatar

Send message
Joined: 15 Jun 08
Posts: 2386
Credit: 222,903,681
RAC: 137,976
Message 37310 - Posted: 11 Nov 2018, 18:36:25 UTC - in response to Message 37309.  

It may help to examine the logs your computer sent back to the project server.
Thus you should make your computer visible for other volunteers here:
https://lhcathome.cern.ch/lhcathome/prefs.php?subset=project
ID: 37310 · Report as offensive     Reply Quote
Stephane Gagnon

Send message
Joined: 17 Nov 16
Posts: 6
Credit: 752,169
RAC: 0
Message 37311 - Posted: 11 Nov 2018, 19:22:51 UTC - in response to Message 37310.  
Last modified: 11 Nov 2018, 19:46:20 UTC

Thanks for the reply. Noted. I'm looking to unhide computers...
Done. I "Updated" with command from the BOINC Manager too.
See Computer Id: 10407976
ID: 37311 · Report as offensive     Reply Quote
computezrmle
Volunteer moderator
Volunteer developer
Volunteer tester
Help desk expert
Avatar

Send message
Joined: 15 Jun 08
Posts: 2386
Credit: 222,903,681
RAC: 137,976
Message 37313 - Posted: 11 Nov 2018, 21:41:06 UTC - in response to Message 37311.  

Most of your tasks are running fine and produce valid results.
Most of the errors are caused by the project and not by your setup.



Nonetheless there are a few comments.


1. LHCb has been very problematic for a few weeks.
It is stopped at the moment and I suggest to deselect it until it delivers a sustainable stream of tasks (and jobs inside the tasks).
https://lhcathome.cern.ch/lhcathome/forum_thread.php?id=4871&postid=37283


2. Theory occasionally has "hiccups" but seems to be stable enough to stay active.


3. ATLAS
Good log entries are:
2018-11-09 17:04:13 (4556): Guest Log: HITS file was successfully produced
2018-11-09 17:04:13 (4556): Powering off VM.
2018-11-09 17:04:14 (4556): Successfully stopped VM.


Not so good are this lines:
2018-11-03 16:30:12 (5468): VM state change detected. (old = 'running', new = 'paused')
2018-11-03 16:31:25 (5468): VM state change detected. (old = 'paused', new = 'running')
2018-11-03 16:31:36 (5468): VM state change detected. (old = 'running', new = 'paused')
2018-11-03 16:31:57 (5468): VM state change detected. (old = 'paused', new = 'running')
2018-11-03 16:32:18 (5468): VM state change detected. (old = 'running', new = 'paused')
etc. etc. etc.

Frequent interruptions can cause a "Postponed ..." error (see other threads).

It may also lead to errors like the following where an interruption may have occurred when the result was about to be copied back.

2018-11-05 02:30:49 (2260): Guest Log: Copying the results back to the shared directory!
2018-11-05 02:31:10 (2260): VM state change detected. (old = 'paused', new = 'running')
2018-11-05 02:31:20 (2260): Guest Log: Copied the result file back to the shared directory and created atlas_done file!



You may try to interrupt the VMs as infrequent as possible.

Last but not least. Be so kind as to ask further questions in the distinct project threads or in "number crunching".
ID: 37313 · Report as offensive     Reply Quote
Stephane Gagnon

Send message
Joined: 17 Nov 16
Posts: 6
Credit: 752,169
RAC: 0
Message 37314 - Posted: 11 Nov 2018, 22:31:07 UTC - in response to Message 37313.  

Thanks for the information.
Can you transfer my request on the "number crunching" thread? (Can I do it?)
Regards,
ID: 37314 · Report as offensive     Reply Quote
Stephane Gagnon

Send message
Joined: 17 Nov 16
Posts: 6
Credit: 752,169
RAC: 0
Message 37356 - Posted: 15 Nov 2018, 22:48:51 UTC - in response to Message 37314.  

Message has been moved. Thanks :)
Please do not close the thread, I'm trying to reproduce some problems found in the past. It may take some time.
ID: 37356 · Report as offensive     Reply Quote
Stephane Gagnon

Send message
Joined: 17 Nov 16
Posts: 6
Credit: 752,169
RAC: 0
Message 37368 - Posted: 17 Nov 2018, 14:49:03 UTC - in response to Message 37356.  

Hi there, thanks for watching.

I just got the cause of the error - "VM job unmanageable":
2018-11-16 10:57:20 PM | LHC@home | Task JW8MDm3I3htnyYickojUe11pABFKDmABFKDm2oAMDmABFKDm50CVOm_0 postponed for 86400 seconds: VM job unmanageable, restarting later.

That's why I got the swapping who seems to cause problem:
2018-11-03 16:30:12 (5468): VM state change detected. (old = 'running', new = 'paused')
2018-11-03 16:31:25 (5468): VM state change detected. (old = 'paused', new = 'running')
...

But this time, I configured BOINC to have only one LHC@Home task - No more "VM State change detected". Will see if the task will finish in error.

Stay tuned :)
ID: 37368 · Report as offensive     Reply Quote
Toby Broom
Volunteer moderator

Send message
Joined: 27 Sep 08
Posts: 798
Credit: 644,697,403
RAC: 235,135
Message 37369 - Posted: 17 Nov 2018, 21:52:25 UTC

I would "set use at most" to 100% in the usage limits, if you don't want to use 100% then you can limit the number of Jobs in the project configuration.
ID: 37369 · Report as offensive     Reply Quote
Crystal Pellet
Volunteer moderator
Volunteer tester

Send message
Joined: 14 Jan 10
Posts: 1268
Credit: 8,421,616
RAC: 2,139
Message 37370 - Posted: 18 Nov 2018, 9:03:36 UTC

Maybe you use the default setting (25%) in BOINC Manager - Computing Tab - When to suspend

Suspend when non-BOINC CPU usage is above .... %

If so: Untick this or set it to 100%
ID: 37370 · Report as offensive     Reply Quote
grumpy

Send message
Joined: 1 Sep 04
Posts: 57
Credit: 2,831,592
RAC: 53
Message 37390 - Posted: 20 Nov 2018, 21:37:02 UTC
Last modified: 20 Nov 2018, 21:37:32 UTC

!RunTime Errors!
R6025
-pure virtual function call
ID: 37390 · Report as offensive     Reply Quote
Stephane Gagnon

Send message
Joined: 17 Nov 16
Posts: 6
Credit: 752,169
RAC: 0
Message 37449 - Posted: 29 Nov 2018, 0:19:35 UTC - in response to Message 37390.  
Last modified: 29 Nov 2018, 0:29:04 UTC

Thanks for your suggestions.

About the Task JW8MDm3I3htnyYickojUe11pABFKDmABFKDm2oAMDmABFKDm50CVOm_0:
- The task finished after multiple BOINC client restart (File / Exit Boinc + Wait 60 secs + Start BOINC client)
and
- The task was reported "Completed and validated"
This is cool :)

Later, I got a task (no 208883557) with a bunch of "VM state change detected"
After four days (and 222,971 secs of computing), the task reported "Error while computing"...
...Even if I configured BOINC to have only one LHC@Home task (store at least 0 days of work / Store up to an additional 0 days of work)

I configured BOINC to take 75% of CPU Power because:
- It helps Windows to process his stuff
- This is a sound related "sweet spot" Working without being audible / annoying

I "Use at most 100% of CPU time"
I only suspend when the computer "is on battery"

(Grumpy, I don't understand your observation and/or what to do with it. Sorry)

The main problem here is LHC needs a lot of supervision, a lot of patience, to end with failed work. :(

Is there a way to configure BOINC to do this:
- If you detect "VM job unmanageable", just restart yourself (and the task will continue by itself).

Have a nice day everybody
ID: 37449 · Report as offensive     Reply Quote
Toby Broom
Volunteer moderator

Send message
Joined: 27 Sep 08
Posts: 798
Credit: 644,697,403
RAC: 235,135
Message 37452 - Posted: 29 Nov 2018, 19:44:38 UTC

This works well for me:



BTW I have virtual box 5.1.x to avoid the un-manageable errors.
ID: 37452 · Report as offensive     Reply Quote

Message boards : Number crunching : Hight error rate - more than 50%


©2024 CERN