1) Message boards : Number crunching : Hight error rate - more than 50% (Message 37449)
Posted 29 Nov 2018 by Stephane Gagnon
Post:
Thanks for your suggestions.

About the Task JW8MDm3I3htnyYickojUe11pABFKDmABFKDm2oAMDmABFKDm50CVOm_0:
- The task finished after multiple BOINC client restart (File / Exit Boinc + Wait 60 secs + Start BOINC client)
and
- The task was reported "Completed and validated"
This is cool :)

Later, I got a task (no 208883557) with a bunch of "VM state change detected"
After four days (and 222,971 secs of computing), the task reported "Error while computing"...
...Even if I configured BOINC to have only one LHC@Home task (store at least 0 days of work / Store up to an additional 0 days of work)

I configured BOINC to take 75% of CPU Power because:
- It helps Windows to process his stuff
- This is a sound related "sweet spot" Working without being audible / annoying

I "Use at most 100% of CPU time"
I only suspend when the computer "is on battery"

(Grumpy, I don't understand your observation and/or what to do with it. Sorry)

The main problem here is LHC needs a lot of supervision, a lot of patience, to end with failed work. :(

Is there a way to configure BOINC to do this:
- If you detect "VM job unmanageable", just restart yourself (and the task will continue by itself).

Have a nice day everybody
2) Message boards : Number crunching : Hight error rate - more than 50% (Message 37368)
Posted 17 Nov 2018 by Stephane Gagnon
Post:
Hi there, thanks for watching.

I just got the cause of the error - "VM job unmanageable":
2018-11-16 10:57:20 PM | LHC@home | Task JW8MDm3I3htnyYickojUe11pABFKDmABFKDm2oAMDmABFKDm50CVOm_0 postponed for 86400 seconds: VM job unmanageable, restarting later.

That's why I got the swapping who seems to cause problem:
2018-11-03 16:30:12 (5468): VM state change detected. (old = 'running', new = 'paused')
2018-11-03 16:31:25 (5468): VM state change detected. (old = 'paused', new = 'running')
...

But this time, I configured BOINC to have only one LHC@Home task - No more "VM State change detected". Will see if the task will finish in error.

Stay tuned :)
3) Message boards : Number crunching : Hight error rate - more than 50% (Message 37356)
Posted 15 Nov 2018 by Stephane Gagnon
Post:
Message has been moved. Thanks :)
Please do not close the thread, I'm trying to reproduce some problems found in the past. It may take some time.
4) Message boards : Number crunching : Hight error rate - more than 50% (Message 37314)
Posted 11 Nov 2018 by Stephane Gagnon
Post:
Thanks for the information.
Can you transfer my request on the "number crunching" thread? (Can I do it?)
Regards,
5) Message boards : Number crunching : Hight error rate - more than 50% (Message 37311)
Posted 11 Nov 2018 by Stephane Gagnon
Post:
Thanks for the reply. Noted. I'm looking to unhide computers...
Done. I "Updated" with command from the BOINC Manager too.
See Computer Id: 10407976
6) Message boards : Number crunching : Hight error rate - more than 50% (Message 37309)
Posted 11 Nov 2018 by Stephane Gagnon
Post:
Hi there (sorry for my English)

First, thanks for all the time and effort of all the team (the project is exclusively run by volunteer effort, as far as I know)

The problem:
I'm running LHC@home from some time now, but I'm concerned about recent high error rate:
State: All (32): In progress (7) · Valid (11) · Invalid (2) · Error (12)
Application: All (32): ATLAS Simulation (8) · LHCb Simulation (8) · Theory Simulation (16)
LHCb Simulation (8): Error while computing (7) Completed and validated (1)
ATLAS Simulation (8) Completed and validated (5) Validate error (2) Cancelled by server (1)
Theory Simulation (16) Completed and validated (5) Error while computing (4) In progress (7)

So... 14 tasks in error / 25 task total. I feel worried

The context:
Hardware: Plain Dell laptop, no overclock, no hack
Host OS: Microsoft Windows 7: Professional x64 Edition, Service Pack 1
Client: BOINC client version 7.14.2 for windows_x86_64, plain install, not installed as a service, no GPU usage as far as I know.

Hypothesis
- Windows reboot:
The computer reboot often (windows update and... well ... this is Windows). I try to close BOINC manually before rebooting, but Virtual Box (VB) is often unable to shutdown properly ("A program need to close [...]"). Waiting VB to close by itself seems useless. After reboot, BOINC is generally restarting and processing the interrupted task, no error, no warning. But in the end, the task seemed to be declared on error by LHC@home.
- Statistics concern
25 tasks is a very low number, with hight rate of false pattern. Maybe I must wait a month or two, but I'm not shure if the tasks will stay that long in my LHC@home web account.

So...
- Is "Windows reboot" a problem?
- If so, how can I do it properly (please give precise answer, I'm not at ease with VB)?
- If the task is declared on error at LHC@home, the task is useless to them? If there is no way to shutdown properly and the task work is useless to LHC@home, I will probably cancel the task after with BOINC after a restart.
- Is there a way to save the LHC@home task list or access the LHC@home task list archives? It may help to find a more solid pattern.

Suggestion
(BOINC is not your software, you probably have no or little tork on it but...) If I can say to BOINC:
"Process the current tasks then STOP - DO NOT START ANOTHER ONE"
or
"Forget the job done, just restart from the beginning"
it may help. Even if the task takes 30H to process, I may try to wait and save the task this way. Or ask BOINC to start over again (if the start just began).

Thanks for your time and effort. Regards,

Note: I don't know where to post this request:
- The problem affect three projects
- The host is Windows but the problem is probably related to Unix / virtual box
- I don't think it's cafe related too (Is it Java related? ;)



©2024 CERN