Message boards :
Theory Application :
Tasks run 4 days and finish with error
Message board moderation
Previous · 1 . . . 4 · 5 · 6 · 7 · 8 · Next
Author | Message |
---|---|
Send message Joined: 12 Aug 06 Posts: 418 Credit: 5,667,249 RAC: 3 |
I'm getting very good at aborting defunct jobs on BOINC manager and deleting the corresponding machines (which don't clean up by themselves) on VirtualBox. What do you mean by don't clean up? I've just been aborting the Boinc tasks. I just had a look at a machine I think I've done this on, and there are some Virtualbox tasks in Windows 10 task manager doing nothing, but they're only using 1MB out of 36GB of RAM. So no big deal? I wonder if I am the only one being afflicted by ~80% of the jobs I receive stalling in an error? Well I'm only getting 1% having problems. They always sit at the beginning never starting - the wall time and CPU time stay at "-". |
Send message Joined: 2 May 07 Posts: 2100 Credit: 159,816,975 RAC: 134,993 |
This seems to be a network problem. When your Theory task is not aible to connect to Cern Server (sft,grid...) at the starting time of the task in the Virtualbox, then you have this problems. |
Send message Joined: 15 Jun 08 Posts: 2411 Credit: 226,401,929 RAC: 131,707 |
You are running a linux computer where SVM is disabled. This causes all vbox tasks from LHC@home to fail. https://lhcathome.cern.ch/lhcathome/show_host_detail.php?hostid=10551677 https://lhcathome.cern.ch/lhcathome/result.php?resultid=280249201 VBoxManage: error: AMD-V is disabled in the BIOS (or by the host OS) (VERR_SVM_DISABLED) https://lhcathome.cern.ch/lhcathome/show_host_detail.php?hostid=10643884 Most of the tasks from your Windows computer show this behavior: 2020-08-03 13:59:05 (28916): Detected: vboxwrapper 26197 2020-08-03 14:21:51 (28916): Stopping VM. 2020-08-03 14:25:33 (6100): Detected: vboxwrapper 26197 # This is a restart The VM stops and restarts after just a few minutes. This is a very (disk) expensive behavior and may also cause network issues since the connections have to be reestablished after each restart. The solution would be to avoid task switching and to reduce the number of concurrently running LHC@home tasks. In addition you may check your network: - attaching a computer via wi-fi is not recommended. At least it requires an extremely stable wi-fi channel. - is your internet connection powerful enough to handle all the traffic? |
Send message Joined: 12 Aug 06 Posts: 418 Credit: 5,667,249 RAC: 3 |
You are running a linux computer where SVM is disabled. Yes I know, it's probably a shortcoming of Boinc and nothing can be done. But is there no way of informing people they have important stuff like this disabled? If it can't come through Boinc, an automated email from the server? Most of the tasks from your Windows computer show this behavior: I think the Boinc default for switching tasks is pretty short. I changed mine to infinity (or whatever the largest number it would accept is). I see no point in stopping a running task to do a bit of another one. |
Send message Joined: 13 Jul 05 Posts: 167 Credit: 14,945,019 RAC: 511 |
You are running a linux computer where SVM is disabled. I'm not sure it's any of BOINC's business - it's entirely valid for users to run tasks (Sixtrack) and other projects that don't use virtualisation, so why should they be lumbered with a mass of irrelevant emails? Surely it's for the VirtualBox installer to make it clear that the process is incomplete? This crops up often enough that something's clearly missing. (Obligatory reminder of Yeti's checklist !) I think the Boinc default for switching tasks is pretty short. I changed mine to infinity (or whatever the largest number it would accept is). I see no point in stopping a running task to do a bit of another one.I'm still on 1200min (20 hours) - long enough that sane tasks should have finished, but leaves the client able to swap out never-ending Sherpas et al. if it wants. On a brighter note, Theory tasks do seem to have been much better behaved over the past couple of months! |
Send message Joined: 16 Jun 06 Posts: 10 Credit: 3,245,056 RAC: 0 |
I'm getting very good at aborting defunct jobs on BOINC manager and deleting the corresponding machines (which don't clean up by themselves) on VirtualBox. What I mean by VirtualBox not cleaning itself up is that after I abort a crashed LHC process in BOINC Manager, the corresponding machine in VirtualBox immediately changes status to Powered Off, but it stays that way indefinitely. I have to manually delete each machine to keep them from accumulating. Fortunately, both BOINC Manager and VirtualBox allow me to select a range of jobs/machines to abort/delete, making this less tedious than selecting each one at a time. |
Send message Joined: 16 Jun 06 Posts: 10 Credit: 3,245,056 RAC: 0 |
You are running a linux computer where SVM is disabled. I know. I tried getting VirtualBox running on an old laptop to run the LHC@home jobs besides SixTrack but got stuck and decided it wasn't worth it. https://lhcathome.cern.ch/lhcathome/show_host_detail.php?hostid=10643884 My local configuration had tasks switching every 60 minutes. I changed this to 1,200 minutes. I noticed that if I reduce the resource share for LHC@Home so that only a few jobs download at a time, they seem to work. I also notice that when a bunch of jobs, say ten, start at the same time, the ones that are spared from crashing tend to be the first and/or the last in the list. I've been mostly happy with the hardwired DSL my PC is using. While I agree that this does look like a network error, my download rate is about 400 kilobytes/second, which is the fastest DSL I've had from several locations in the Sacramento, California area. Once in a while the network will be slow or fail, but this is such a small minority of the time that it can't account for the constant stream of failed jobs. Maybe the error is a timeout due to a bunch of jobs trying to download something simultaneously. It would be nice if this timeout error could be handled more gracefully than wasting a CPU thread for 10 days. |
Send message Joined: 15 Jun 08 Posts: 2411 Credit: 226,401,929 RAC: 131,707 |
400 kilobytes/second maximum rate. This could indeed cause trouble if you start or restart too many VMs concurrently. The advice would be to limit the concurrently running tasks (VMs) until they run stable. You may start at a low number and slightly increase it until you know at which number the trouble returns. |
Send message Joined: 16 Jun 06 Posts: 10 Credit: 3,245,056 RAC: 0 |
400 kilobytes/second maximum rate. I will scale my resource share back and hope that these bugs get fixed someday so I can return to full participation: 1. The timeout should be increased to account for multiple jobs being launched simultaneously on a DSL-connected PC. 2. When a network operation fails, it should either be re-tried or the job should close, not sit allocating a slot in BOINC's active job listing. 3. When a job is aborted in BOINC Manager, the corresponding virtual machine should eventually be removed automatically. Thanks for your ideas and suggestions, everyone. |
Send message Joined: 12 Aug 06 Posts: 418 Credit: 5,667,249 RAC: 3 |
Henry Nebrensky wrote:
That's not what I meant. I meant an email sent only to anyone returning invalid tasks. And I mentioned Boinc incase there was a way Boinc could put up a notification saying there was a problem. Henry Nebrensky wrote: I'm still on 1200min (20 hours) - long enough that sane tasks should have finished, but leaves the client able to swap out never-ending Sherpas et al. if it wants. I just let them run for the 10 day limit. It's only 1 core, and swapping them out means they're going to be returned late. Chris Jenks wrote: What I mean by VirtualBox not cleaning itself up is that after I abort a crashed LHC process in BOINC Manager, the corresponding machine in VirtualBox immediately changes status to Powered Off, but it stays that way indefinitely. I have to manually delete each machine to keep them from accumulating. Fortunately, both BOINC Manager and VirtualBox allow me to select a range of jobs/machines to abort/delete, making this less tedious than selecting each one at a time. Do they use up resources somewhere? I just leave mine sitting idle. Chris Jenks wrote: I know. I tried getting VirtualBox running on an old laptop to run the LHC@home jobs besides SixTrack but got stuck and decided it wasn't worth it. I got Virtualbox running on a few old machines, but got a lot of crashes with Atlas/Theory/CMS, so decided to leave them to run Sixtrack and other projects only. These machines were not happy: Core 2 Q8400 quad core, 8GB DDR2 RAM I3 M350 quad core laptop, 8GB DDR3 RAM Pentium N3700 quad core, 8GB DDR3 RAM These machines are ok: i5-8600K six core, 16GB DDR4 RAM Dual Xeon X5650 12 core x 2, 36GB DDR3 RAM Another dual Xeon X5650 12 core x 2, 36GB DDR3 RAM Chris Jenks wrote: I've been mostly happy with the hardwired DSL my PC is using. While I agree that this does look like a network error, my download rate is about 400 kilobytes/second, which is the fastest DSL I've had from several locations in the Sacramento, California area. Once in a while the network will be slow or fail, but this is such a small minority of the time that it can't account for the constant stream of failed jobs. Maybe the error is a timeout due to a bunch of jobs trying to download something simultaneously. It would be nice if this timeout error could be handled more gracefully than wasting a CPU thread for 10 days. Ouch, DSL. The UK is all fibre, 6.75MBytes download rate here. Not sure how impatient LHC tasks/servers are, but I've seen a single Theory max out my line for a few minutes, so there must be a lot of data transfer. If you're starting 10 at once on a slower line, that could be a problem. With normal Boinc tasks, Boinc only allows two connections per project at a time, but I think with Theory doing its own downloading, you might get loads of files competing. But if only one or two are started at once it should be ok. If you manually allow one at a time until all the cores are running, they should all finish at random times and new ones will be started one at a time. |
Send message Joined: 16 Jun 06 Posts: 10 Credit: 3,245,056 RAC: 0 |
I had been sharing my 16 threads equally between LHC and Rosetta. When I changed resource share to 50 for LHC and 100 for Rosetta, I seem to be having fewer crashed jobs. This morning I found seven Theory jobs running, none crashed, which is more than I remember seeing in a long time. |
Send message Joined: 12 Aug 06 Posts: 418 Credit: 5,667,249 RAC: 3 |
I had been sharing my 16 threads equally between LHC and Rosetta. When I changed resource share to 50 for LHC and 100 for Rosetta, I seem to be having fewer crashed jobs. This morning I found seven Theory jobs running, none crashed, which is more than I remember seeing in a long time. You can also use "max concurrent" in the config file to limit precisely how many theories run. |
Send message Joined: 2 May 07 Posts: 2100 Credit: 159,816,975 RAC: 134,993 |
Four days with success, Pythia8 - [boinc pp jets 8000 170,-,2960 - pythia8 8.301 dire-default 48000 158] https://lhcathome.cern.ch/lhcathome/result.php?resultid=290783284 Laufzeit 3 Tage 23 Stunden 57 min. 39 sek. CPU Zeit 3 Tage 21 Stunden 49 min. 44 sek. |
Send message Joined: 7 May 08 Posts: 195 Credit: 1,504,161 RAC: 113 |
This is my error with loooong wus (on VM Console): Probing /cvmfs/grid.cern.ch....failed!! crancky: [ERROR] 'cvmfs_config probe grid.cern.ch' failded ERROR Could not source logging functions from /cvmfs/grid.cern.ch/vc/bin/logging_functions. |
Send message Joined: 4 Apr 19 Posts: 31 Credit: 3,882,229 RAC: 9,411 |
Four days with success, Pythia8 - [boinc pp jets 8000 170,-,2960 - pythia8 8.301 dire-default 48000 158] Yes, some time ago it was a job with 100k events and could take a week to complete, now it contains only 48k events. Some scientist must have changed that. |
Send message Joined: 18 Nov 17 Posts: 121 Credit: 52,034,142 RAC: 26,305 |
Can Theory task succeed if in VM console by Alt+F2 I see olny the string "Running jobs output should appear here"? |
Send message Joined: 14 Jan 10 Posts: 1280 Credit: 8,491,903 RAC: 2,069 |
Can Theory task succeed if in VM console by Alt+F2 I see olny the string "Running jobs output should appear here"? No |
Send message Joined: 24 Oct 04 Posts: 1127 Credit: 49,750,905 RAC: 9,376 |
Can Theory task succeed if in VM console by Alt+F2 I see olny the string "Running jobs output should appear here"? When you check a task that is running and the log page tells you that it means it is finished and tasks tend to continue running for several minutes before they are officially done on the Boinc Manager tasks so let the particular task continue running and you can also check the the task on the VB Manager and the log there will tell you that it is finished in VB but not ready to complete and send back yet. ( I have watched thousands of these running and seeing it say "Running jobs output should appear here" when I check to see how close to finished a task is) |
Send message Joined: 7 May 08 Posts: 195 Credit: 1,504,161 RAC: 113 |
|
Send message Joined: 15 Jun 08 Posts: 2411 Credit: 226,401,929 RAC: 131,707 |
https://lhcathome.cern.ch/lhcathome/result.php?resultid=305482630 Your logfiles show this a couple of times. In most cases the VM can recover at the next restart but not always: 2021-03-17 07:14:50 (6428): Error in stop VM for VM: -108 In this case the BOINC client died before vboxwrapper and the VM were properly shut down: 2021-03-17 18:36:46 (17480): Stopping VM. 18:37:03 (17480): BOINC client no longer exists - exiting 18:37:03 (17480): timer handler: client dead, exiting 18:37:14 (17480): BOINC client no longer exists - exiting 18:37:14 (17480): timer handler: client dead, exiting Unclean shutdowns can result in an unusable vm_image.vdi in the slots folder. At the very end the VM hangs and can't update it's heartbeat file. Vboxwrapper then terminates the VM and reports it as lost. 2021-03-18 02:21:03 (7856): VM Heartbeat file specified, but missing heartbeat. Stopping/restarting a VM causes a high peak load on the I/O system for a short time. Timings are influenced by the disk speed but also by the amount of RAM available for the OS disk cache and it's internal write delay timers. |
©2024 CERN