Message boards :
Number crunching :
Hight error rate - more than 50%
Message board moderation
Author | Message |
---|---|
Send message Joined: 17 Nov 16 Posts: 6 Credit: 752,169 RAC: 0 |
Hi there (sorry for my English) First, thanks for all the time and effort of all the team (the project is exclusively run by volunteer effort, as far as I know) The problem: I'm running LHC@home from some time now, but I'm concerned about recent high error rate: State: All (32): In progress (7) · Valid (11) · Invalid (2) · Error (12) Application: All (32): ATLAS Simulation (8) · LHCb Simulation (8) · Theory Simulation (16) LHCb Simulation (8): Error while computing (7) Completed and validated (1) ATLAS Simulation (8) Completed and validated (5) Validate error (2) Cancelled by server (1) Theory Simulation (16) Completed and validated (5) Error while computing (4) In progress (7) So... 14 tasks in error / 25 task total. I feel worried The context: Hardware: Plain Dell laptop, no overclock, no hack Host OS: Microsoft Windows 7: Professional x64 Edition, Service Pack 1 Client: BOINC client version 7.14.2 for windows_x86_64, plain install, not installed as a service, no GPU usage as far as I know. Hypothesis - Windows reboot: The computer reboot often (windows update and... well ... this is Windows). I try to close BOINC manually before rebooting, but Virtual Box (VB) is often unable to shutdown properly ("A program need to close [...]"). Waiting VB to close by itself seems useless. After reboot, BOINC is generally restarting and processing the interrupted task, no error, no warning. But in the end, the task seemed to be declared on error by LHC@home. - Statistics concern 25 tasks is a very low number, with hight rate of false pattern. Maybe I must wait a month or two, but I'm not shure if the tasks will stay that long in my LHC@home web account. So... - Is "Windows reboot" a problem? - If so, how can I do it properly (please give precise answer, I'm not at ease with VB)? - If the task is declared on error at LHC@home, the task is useless to them? If there is no way to shutdown properly and the task work is useless to LHC@home, I will probably cancel the task after with BOINC after a restart. - Is there a way to save the LHC@home task list or access the LHC@home task list archives? It may help to find a more solid pattern. Suggestion (BOINC is not your software, you probably have no or little tork on it but...) If I can say to BOINC: "Process the current tasks then STOP - DO NOT START ANOTHER ONE" or "Forget the job done, just restart from the beginning" it may help. Even if the task takes 30H to process, I may try to wait and save the task this way. Or ask BOINC to start over again (if the start just began). Thanks for your time and effort. Regards, Note: I don't know where to post this request: - The problem affect three projects - The host is Windows but the problem is probably related to Unix / virtual box - I don't think it's cafe related too (Is it Java related? ;) |
Send message Joined: 15 Jun 08 Posts: 2386 Credit: 222,936,497 RAC: 137,523 |
It may help to examine the logs your computer sent back to the project server. Thus you should make your computer visible for other volunteers here: https://lhcathome.cern.ch/lhcathome/prefs.php?subset=project |
Send message Joined: 17 Nov 16 Posts: 6 Credit: 752,169 RAC: 0 |
Thanks for the reply. Noted. I'm looking to unhide computers... Done. I "Updated" with command from the BOINC Manager too. See Computer Id: 10407976 |
Send message Joined: 15 Jun 08 Posts: 2386 Credit: 222,936,497 RAC: 137,523 |
Most of your tasks are running fine and produce valid results. Most of the errors are caused by the project and not by your setup. Nonetheless there are a few comments. 1. LHCb has been very problematic for a few weeks. It is stopped at the moment and I suggest to deselect it until it delivers a sustainable stream of tasks (and jobs inside the tasks). https://lhcathome.cern.ch/lhcathome/forum_thread.php?id=4871&postid=37283 2. Theory occasionally has "hiccups" but seems to be stable enough to stay active. 3. ATLAS Good log entries are: 2018-11-09 17:04:13 (4556): Guest Log: HITS file was successfully produced 2018-11-09 17:04:13 (4556): Powering off VM. 2018-11-09 17:04:14 (4556): Successfully stopped VM. Not so good are this lines: 2018-11-03 16:30:12 (5468): VM state change detected. (old = 'running', new = 'paused') 2018-11-03 16:31:25 (5468): VM state change detected. (old = 'paused', new = 'running') 2018-11-03 16:31:36 (5468): VM state change detected. (old = 'running', new = 'paused') 2018-11-03 16:31:57 (5468): VM state change detected. (old = 'paused', new = 'running') 2018-11-03 16:32:18 (5468): VM state change detected. (old = 'running', new = 'paused') etc. etc. etc. Frequent interruptions can cause a "Postponed ..." error (see other threads). It may also lead to errors like the following where an interruption may have occurred when the result was about to be copied back. 2018-11-05 02:30:49 (2260): Guest Log: Copying the results back to the shared directory! 2018-11-05 02:31:10 (2260): VM state change detected. (old = 'paused', new = 'running') 2018-11-05 02:31:20 (2260): Guest Log: Copied the result file back to the shared directory and created atlas_done file! You may try to interrupt the VMs as infrequent as possible. Last but not least. Be so kind as to ask further questions in the distinct project threads or in "number crunching". |
Send message Joined: 17 Nov 16 Posts: 6 Credit: 752,169 RAC: 0 |
Thanks for the information. Can you transfer my request on the "number crunching" thread? (Can I do it?) Regards, |
Send message Joined: 17 Nov 16 Posts: 6 Credit: 752,169 RAC: 0 |
Message has been moved. Thanks :) Please do not close the thread, I'm trying to reproduce some problems found in the past. It may take some time. |
Send message Joined: 17 Nov 16 Posts: 6 Credit: 752,169 RAC: 0 |
Hi there, thanks for watching. I just got the cause of the error - "VM job unmanageable": 2018-11-16 10:57:20 PM | LHC@home | Task JW8MDm3I3htnyYickojUe11pABFKDmABFKDm2oAMDmABFKDm50CVOm_0 postponed for 86400 seconds: VM job unmanageable, restarting later. That's why I got the swapping who seems to cause problem: 2018-11-03 16:30:12 (5468): VM state change detected. (old = 'running', new = 'paused') 2018-11-03 16:31:25 (5468): VM state change detected. (old = 'paused', new = 'running') ... But this time, I configured BOINC to have only one LHC@Home task - No more "VM State change detected". Will see if the task will finish in error. Stay tuned :) |
Send message Joined: 27 Sep 08 Posts: 798 Credit: 644,744,234 RAC: 233,516 |
I would "set use at most" to 100% in the usage limits, if you don't want to use 100% then you can limit the number of Jobs in the project configuration. |
Send message Joined: 14 Jan 10 Posts: 1268 Credit: 8,421,616 RAC: 2,139 |
Maybe you use the default setting (25%) in BOINC Manager - Computing Tab - When to suspend Suspend when non-BOINC CPU usage is above .... % If so: Untick this or set it to 100% |
Send message Joined: 1 Sep 04 Posts: 57 Credit: 2,831,592 RAC: 53 |
!RunTime Errors! R6025 -pure virtual function call |
Send message Joined: 17 Nov 16 Posts: 6 Credit: 752,169 RAC: 0 |
Thanks for your suggestions. About the Task JW8MDm3I3htnyYickojUe11pABFKDmABFKDm2oAMDmABFKDm50CVOm_0: - The task finished after multiple BOINC client restart (File / Exit Boinc + Wait 60 secs + Start BOINC client) and - The task was reported "Completed and validated" This is cool :) Later, I got a task (no 208883557) with a bunch of "VM state change detected" After four days (and 222,971 secs of computing), the task reported "Error while computing"... ...Even if I configured BOINC to have only one LHC@Home task (store at least 0 days of work / Store up to an additional 0 days of work) I configured BOINC to take 75% of CPU Power because: - It helps Windows to process his stuff - This is a sound related "sweet spot" Working without being audible / annoying I "Use at most 100% of CPU time" I only suspend when the computer "is on battery" (Grumpy, I don't understand your observation and/or what to do with it. Sorry) The main problem here is LHC needs a lot of supervision, a lot of patience, to end with failed work. :( Is there a way to configure BOINC to do this: - If you detect "VM job unmanageable", just restart yourself (and the task will continue by itself). Have a nice day everybody |
Send message Joined: 27 Sep 08 Posts: 798 Credit: 644,744,234 RAC: 233,516 |
|
©2024 CERN