Tasks run 4 days and finish with error

Author	Message
Mr P Hucker Send message Joined: 12 Aug 06 Posts: 448 Credit: 12,981,632 RAC: 7,723	Message 43178 - Posted: 3 Aug 2020, 23:38:03 UTC - in response to Message 43177. I'm getting very good at aborting defunct jobs on BOINC manager and deleting the corresponding machines (which don't clean up by themselves) on VirtualBox. What do you mean by don't clean up? I've just been aborting the Boinc tasks. I just had a look at a machine I think I've done this on, and there are some Virtualbox tasks in Windows 10 task manager doing nothing, but they're only using 1MB out of 36GB of RAM. So no big deal? I wonder if I am the only one being afflicted by ~80% of the jobs I receive stalling in an error? Well I'm only getting 1% having problems. They always sit at the beginning never starting - the wall time and CPU time stay at "-". ID: 43178 · Reply Quote

maeax Send message Joined: 2 May 07 Posts: 2277 Credit: 178,500,242 RAC: 126,712	Message 43180 - Posted: 4 Aug 2020, 0:30:11 UTC - in response to Message 43177. This seems to be a network problem. When your Theory task is not aible to connect to Cern Server (sft,grid...) at the starting time of the task in the Virtualbox, then you have this problems. ID: 43180 · Reply Quote

computezrmle Volunteer moderator Volunteer developer Volunteer tester Help desk expert Send message Joined: 15 Jun 08 Posts: 2679 Credit: 286,799,897 RAC: 75,238	Message 43183 - Posted: 4 Aug 2020, 6:00:52 UTC - in response to Message 43177. You are running a linux computer where SVM is disabled. This causes all vbox tasks from LHC@home to fail. https://lhcathome.cern.ch/lhcathome/show_host_detail.php?hostid=10551677 https://lhcathome.cern.ch/lhcathome/result.php?resultid=280249201 VBoxManage: error: AMD-V is disabled in the BIOS (or by the host OS) (VERR_SVM_DISABLED) https://lhcathome.cern.ch/lhcathome/show_host_detail.php?hostid=10643884 Most of the tasks from your Windows computer show this behavior: 2020-08-03 13:59:05 (28916): Detected: vboxwrapper 26197 2020-08-03 14:21:51 (28916): Stopping VM. 2020-08-03 14:25:33 (6100): Detected: vboxwrapper 26197 # This is a restart The VM stops and restarts after just a few minutes. This is a very (disk) expensive behavior and may also cause network issues since the connections have to be reestablished after each restart. The solution would be to avoid task switching and to reduce the number of concurrently running LHC@home tasks. In addition you may check your network: - attaching a computer via wi-fi is not recommended. At least it requires an extremely stable wi-fi channel. - is your internet connection powerful enough to handle all the traffic? ID: 43183 · Reply Quote

Mr P Hucker Send message Joined: 12 Aug 06 Posts: 448 Credit: 12,981,632 RAC: 7,723	Message 43184 - Posted: 4 Aug 2020, 9:54:00 UTC - in response to Message 43183. Last modified: 4 Aug 2020, 9:56:44 UTC You are running a linux computer where SVM is disabled. This causes all vbox tasks from LHC@home to fail. Yes I know, it's probably a shortcoming of Boinc and nothing can be done. But is there no way of informing people they have important stuff like this disabled? If it can't come through Boinc, an automated email from the server? Most of the tasks from your Windows computer show this behavior: 2020-08-03 13:59:05 (28916): Detected: vboxwrapper 26197 2020-08-03 14:21:51 (28916): Stopping VM. 2020-08-03 14:25:33 (6100): Detected: vboxwrapper 26197 # This is a restart The VM stops and restarts after just a few minutes. This is a very (disk) expensive behavior and may also cause network issues since the connections have to be reestablished after each restart. The solution would be to avoid task switching and to reduce the number of concurrently running LHC@home tasks. I think the Boinc default for switching tasks is pretty short. I changed mine to infinity (or whatever the largest number it would accept is). I see no point in stopping a running task to do a bit of another one. ID: 43184 · Reply Quote

Henry Nebrensky Send message Joined: 13 Jul 05 Posts: 170 Credit: 15,020,549 RAC: 24	Message 43185 - Posted: 4 Aug 2020, 14:45:50 UTC - in response to Message 43184. Last modified: 4 Aug 2020, 14:46:33 UTC You are running a linux computer where SVM is disabled. This causes all vbox tasks from LHC@home to fail. Yes I know, it's probably a shortcoming of Boinc and nothing can be done. But is there no way of informing people they have important stuff like this disabled? If it can't come through Boinc, an automated email from the server? I'm not sure it's any of BOINC's business - it's entirely valid for users to run tasks (Sixtrack) and other projects that don't use virtualisation, so why should they be lumbered with a mass of irrelevant emails? Surely it's for the VirtualBox installer to make it clear that the process is incomplete? This crops up often enough that something's clearly missing. (Obligatory reminder of Yeti's checklist !) I think the Boinc default for switching tasks is pretty short. I changed mine to infinity (or whatever the largest number it would accept is). I see no point in stopping a running task to do a bit of another one. I'm still on 1200min (20 hours) - long enough that sane tasks should have finished, but leaves the client able to swap out never-ending Sherpas et al. if it wants. On a brighter note, Theory tasks do seem to have been much better behaved over the past couple of months! ID: 43185 · Reply Quote

Chris Jenks Send message Joined: 16 Jun 06 Posts: 10 Credit: 3,245,057 RAC: 0	Message 43186 - Posted: 4 Aug 2020, 14:50:14 UTC - in response to Message 43178. I'm getting very good at aborting defunct jobs on BOINC manager and deleting the corresponding machines (which don't clean up by themselves) on VirtualBox. What do you mean by don't clean up? I've just been aborting the Boinc tasks. I just had a look at a machine I think I've done this on, and there are some Virtualbox tasks in Windows 10 task manager doing nothing, but they're only using 1MB out of 36GB of RAM. So no big deal? What I mean by VirtualBox not cleaning itself up is that after I abort a crashed LHC process in BOINC Manager, the corresponding machine in VirtualBox immediately changes status to Powered Off, but it stays that way indefinitely. I have to manually delete each machine to keep them from accumulating. Fortunately, both BOINC Manager and VirtualBox allow me to select a range of jobs/machines to abort/delete, making this less tedious than selecting each one at a time. ID: 43186 · Reply Quote

Chris Jenks Send message Joined: 16 Jun 06 Posts: 10 Credit: 3,245,057 RAC: 0	Message 43187 - Posted: 4 Aug 2020, 15:10:22 UTC - in response to Message 43183. You are running a linux computer where SVM is disabled. This causes all vbox tasks from LHC@home to fail. https://lhcathome.cern.ch/lhcathome/show_host_detail.php?hostid=10551677 https://lhcathome.cern.ch/lhcathome/result.php?resultid=280249201 VBoxManage: error: AMD-V is disabled in the BIOS (or by the host OS) (VERR_SVM_DISABLED) I know. I tried getting VirtualBox running on an old laptop to run the LHC@home jobs besides SixTrack but got stuck and decided it wasn't worth it. https://lhcathome.cern.ch/lhcathome/show_host_detail.php?hostid=10643884 Most of the tasks from your Windows computer show this behavior: 2020-08-03 13:59:05 (28916): Detected: vboxwrapper 26197 2020-08-03 14:21:51 (28916): Stopping VM. 2020-08-03 14:25:33 (6100): Detected: vboxwrapper 26197 # This is a restart The VM stops and restarts after just a few minutes. This is a very (disk) expensive behavior and may also cause network issues since the connections have to be reestablished after each restart. The solution would be to avoid task switching and to reduce the number of concurrently running LHC@home tasks. In addition you may check your network: - attaching a computer via wi-fi is not recommended. At least it requires an extremely stable wi-fi channel. - is your internet connection powerful enough to handle all the traffic? My local configuration had tasks switching every 60 minutes. I changed this to 1,200 minutes. I noticed that if I reduce the resource share for LHC@Home so that only a few jobs download at a time, they seem to work. I also notice that when a bunch of jobs, say ten, start at the same time, the ones that are spared from crashing tend to be the first and/or the last in the list. I've been mostly happy with the hardwired DSL my PC is using. While I agree that this does look like a network error, my download rate is about 400 kilobytes/second, which is the fastest DSL I've had from several locations in the Sacramento, California area. Once in a while the network will be slow or fail, but this is such a small minority of the time that it can't account for the constant stream of failed jobs. Maybe the error is a timeout due to a bunch of jobs trying to download something simultaneously. It would be nice if this timeout error could be handled more gracefully than wasting a CPU thread for 10 days. ID: 43187 · Reply Quote

computezrmle Volunteer moderator Volunteer developer Volunteer tester Help desk expert Send message Joined: 15 Jun 08 Posts: 2679 Credit: 286,799,897 RAC: 75,238	Message 43188 - Posted: 4 Aug 2020, 16:01:45 UTC - in response to Message 43187. 400 kilobytes/second maximum rate. This could indeed cause trouble if you start or restart too many VMs concurrently. The advice would be to limit the concurrently running tasks (VMs) until they run stable. You may start at a low number and slightly increase it until you know at which number the trouble returns. ID: 43188 · Reply Quote

Chris Jenks Send message Joined: 16 Jun 06 Posts: 10 Credit: 3,245,057 RAC: 0	Message 43189 - Posted: 4 Aug 2020, 17:37:38 UTC - in response to Message 43188. 400 kilobytes/second maximum rate. This could indeed cause trouble if you start or restart too many VMs concurrently. The advice would be to limit the concurrently running tasks (VMs) until they run stable. You may start at a low number and slightly increase it until you know at which number the trouble returns. I will scale my resource share back and hope that these bugs get fixed someday so I can return to full participation: 1. The timeout should be increased to account for multiple jobs being launched simultaneously on a DSL-connected PC. 2. When a network operation fails, it should either be re-tried or the job should close, not sit allocating a slot in BOINC's active job listing. 3. When a job is aborted in BOINC Manager, the corresponding virtual machine should eventually be removed automatically. Thanks for your ideas and suggestions, everyone. ID: 43189 · Reply Quote

Mr P Hucker Send message Joined: 12 Aug 06 Posts: 448 Credit: 12,981,632 RAC: 7,723	Message 43191 - Posted: 4 Aug 2020, 18:55:28 UTC - in response to Message 43185. Last modified: 4 Aug 2020, 18:59:30 UTC Henry Nebrensky wrote: I'm not sure it's any of BOINC's business - it's entirely valid for users to run tasks (Sixtrack) and other projects that don't use virtualisation, so why should they be lumbered with a mass of irrelevant emails? Surely it's for the VirtualBox installer to make it clear that the process is incomplete? This crops up often enough that something's clearly missing. (Obligatory reminder of Yeti's checklist !) That's not what I meant. I meant an email sent only to anyone returning invalid tasks. And I mentioned Boinc incase there was a way Boinc could put up a notification saying there was a problem. Henry Nebrensky wrote: I'm still on 1200min (20 hours) - long enough that sane tasks should have finished, but leaves the client able to swap out never-ending Sherpas et al. if it wants. On a brighter note, Theory tasks do seem to have been much better behaved over the past couple of months! I just let them run for the 10 day limit. It's only 1 core, and swapping them out means they're going to be returned late. Chris Jenks wrote: What I mean by VirtualBox not cleaning itself up is that after I abort a crashed LHC process in BOINC Manager, the corresponding machine in VirtualBox immediately changes status to Powered Off, but it stays that way indefinitely. I have to manually delete each machine to keep them from accumulating. Fortunately, both BOINC Manager and VirtualBox allow me to select a range of jobs/machines to abort/delete, making this less tedious than selecting each one at a time. Do they use up resources somewhere? I just leave mine sitting idle. Chris Jenks wrote: I know. I tried getting VirtualBox running on an old laptop to run the LHC@home jobs besides SixTrack but got stuck and decided it wasn't worth it. I got Virtualbox running on a few old machines, but got a lot of crashes with Atlas/Theory/CMS, so decided to leave them to run Sixtrack and other projects only. These machines were not happy: Core 2 Q8400 quad core, 8GB DDR2 RAM I3 M350 quad core laptop, 8GB DDR3 RAM Pentium N3700 quad core, 8GB DDR3 RAM These machines are ok: i5-8600K six core, 16GB DDR4 RAM Dual Xeon X5650 12 core x 2, 36GB DDR3 RAM Another dual Xeon X5650 12 core x 2, 36GB DDR3 RAM Chris Jenks wrote: I've been mostly happy with the hardwired DSL my PC is using. While I agree that this does look like a network error, my download rate is about 400 kilobytes/second, which is the fastest DSL I've had from several locations in the Sacramento, California area. Once in a while the network will be slow or fail, but this is such a small minority of the time that it can't account for the constant stream of failed jobs. Maybe the error is a timeout due to a bunch of jobs trying to download something simultaneously. It would be nice if this timeout error could be handled more gracefully than wasting a CPU thread for 10 days. Ouch, DSL. The UK is all fibre, 6.75MBytes download rate here. Not sure how impatient LHC tasks/servers are, but I've seen a single Theory max out my line for a few minutes, so there must be a lot of data transfer. If you're starting 10 at once on a slower line, that could be a problem. With normal Boinc tasks, Boinc only allows two connections per project at a time, but I think with Theory doing its own downloading, you might get loads of files competing. But if only one or two are started at once it should be ok. If you manually allow one at a time until all the cores are running, they should all finish at random times and new ones will be started one at a time. ID: 43191 · Reply Quote

Chris Jenks Send message Joined: 16 Jun 06 Posts: 10 Credit: 3,245,057 RAC: 0	Message 43196 - Posted: 6 Aug 2020, 13:02:50 UTC - in response to Message 43191. I had been sharing my 16 threads equally between LHC and Rosetta. When I changed resource share to 50 for LHC and 100 for Rosetta, I seem to be having fewer crashed jobs. This morning I found seven Theory jobs running, none crashed, which is more than I remember seeing in a long time. ID: 43196 · Reply Quote

Mr P Hucker Send message Joined: 12 Aug 06 Posts: 448 Credit: 12,981,632 RAC: 7,723	Message 43197 - Posted: 6 Aug 2020, 15:52:23 UTC - in response to Message 43196. I had been sharing my 16 threads equally between LHC and Rosetta. When I changed resource share to 50 for LHC and 100 for Rosetta, I seem to be having fewer crashed jobs. This morning I found seven Theory jobs running, none crashed, which is more than I remember seeing in a long time. You can also use "max concurrent" in the config file to limit precisely how many theories run. ID: 43197 · Reply Quote

maeax Send message Joined: 2 May 07 Posts: 2277 Credit: 178,500,242 RAC: 126,712	Message 43919 - Posted: 15 Dec 2020, 10:29:51 UTC Four days with success, Pythia8 - [boinc pp jets 8000 170,-,2960 - pythia8 8.301 dire-default 48000 158] https://lhcathome.cern.ch/lhcathome/result.php?resultid=290783284 Laufzeit 3 Tage 23 Stunden 57 min. 39 sek. CPU Zeit 3 Tage 21 Stunden 49 min. 44 sek. ID: 43919 · Reply Quote

[VENETO] boboviz Send message Joined: 7 May 08 Posts: 248 Credit: 1,877,664 RAC: 11,725	Message 43991 - Posted: 26 Dec 2020, 9:01:43 UTC This is my error with loooong wus (on VM Console): Probing /cvmfs/grid.cern.ch....failed!! crancky: [ERROR] 'cvmfs_config probe grid.cern.ch' failded ERROR Could not source logging functions from /cvmfs/grid.cern.ch/vc/bin/logging_functions. ID: 43991 · Reply Quote

Sesson Send message Joined: 4 Apr 19 Posts: 31 Credit: 4,806,385 RAC: 7,811	Message 43993 - Posted: 26 Dec 2020, 9:47:39 UTC - in response to Message 43919. Four days with success, Pythia8 - [boinc pp jets 8000 170,-,2960 - pythia8 8.301 dire-default 48000 158] Yes, some time ago it was a job with 100k events and could take a week to complete, now it contains only 48k events. Some scientist must have changed that. ID: 43993 · Reply Quote

NOGOOD Send message Joined: 18 Nov 17 Posts: 131 Credit: 57,966,838 RAC: 7,744	Message 44293 - Posted: 11 Feb 2021, 11:04:53 UTC - in response to Message 43993. Last modified: 11 Feb 2021, 11:05:11 UTC Can Theory task succeed if in VM console by Alt+F2 I see olny the string "Running jobs output should appear here"? ID: 44293 · Reply Quote

Crystal Pellet Volunteer moderator Volunteer tester Send message Joined: 14 Jan 10 Posts: 1461 Credit: 9,852,106 RAC: 3,035	Message 44300 - Posted: 11 Feb 2021, 17:28:58 UTC - in response to Message 44293. Can Theory task succeed if in VM console by Alt+F2 I see olny the string "Running jobs output should appear here"? No ID: 44300 · Reply Quote

Magic Quantum Mechanic Send message Joined: 24 Oct 04 Posts: 1234 Credit: 79,635,462 RAC: 99,784	Message 44302 - Posted: 11 Feb 2021, 23:54:27 UTC - in response to Message 44293. Can Theory task succeed if in VM console by Alt+F2 I see olny the string "Running jobs output should appear here"? When you check a task that is running and the log page tells you that it means it is finished and tasks tend to continue running for several minutes before they are officially done on the Boinc Manager tasks so let the particular task continue running and you can also check the the task on the VB Manager and the log there will tell you that it is finished in VB but not ready to complete and send back yet. ( I have watched thousands of these running and seeing it say "Running jobs output should appear here" when I check to see how close to finished a task is) ID: 44302 · Reply Quote

[VENETO] boboviz Send message Joined: 7 May 08 Posts: 248 Credit: 1,877,664 RAC: 11,725	Message 44508 - Posted: 18 Mar 2021, 6:28:20 UTC Last modified: 18 Mar 2021, 6:33:14 UTC My errors after 20hs of run 305482630 305481267 ID: 44508 · Reply Quote

computezrmle Volunteer moderator Volunteer developer Volunteer tester Help desk expert Send message Joined: 15 Jun 08 Posts: 2679 Credit: 286,799,897 RAC: 75,238	Message 44511 - Posted: 18 Mar 2021, 7:54:42 UTC - in response to Message 44508. https://lhcathome.cern.ch/lhcathome/result.php?resultid=305482630 Your logfiles show this a couple of times. In most cases the VM can recover at the next restart but not always: 2021-03-17 07:14:50 (6428): Error in stop VM for VM: -108 In this case the BOINC client died before vboxwrapper and the VM were properly shut down: 2021-03-17 18:36:46 (17480): Stopping VM. 18:37:03 (17480): BOINC client no longer exists - exiting 18:37:03 (17480): timer handler: client dead, exiting 18:37:14 (17480): BOINC client no longer exists - exiting 18:37:14 (17480): timer handler: client dead, exiting Unclean shutdowns can result in an unusable vm_image.vdi in the slots folder. At the very end the VM hangs and can't update it's heartbeat file. Vboxwrapper then terminates the VM and reports it as lost. 2021-03-18 02:21:03 (7856): VM Heartbeat file specified, but missing heartbeat. Stopping/restarting a VM causes a high peak load on the I/O system for a short time. Timings are influenced by the disk speed but also by the amount of RAM available for the OS disk cache and it's internal write delay timers. ID: 44511 · Reply Quote

LHC@home