Message boards : Theory Application : Tasks run 4 days and finish with error
Message board moderation

To post messages, you must log in.

Previous · 1 . . . 4 · 5 · 6 · 7 · 8 · Next

AuthorMessage
Mr P Hucker
Avatar

Send message
Joined: 12 Aug 06
Posts: 418
Credit: 5,667,249
RAC: 3
Message 43178 - Posted: 3 Aug 2020, 23:38:03 UTC - in response to Message 43177.  

I'm getting very good at aborting defunct jobs on BOINC manager and deleting the corresponding machines (which don't clean up by themselves) on VirtualBox.


What do you mean by don't clean up? I've just been aborting the Boinc tasks. I just had a look at a machine I think I've done this on, and there are some Virtualbox tasks in Windows 10 task manager doing nothing, but they're only using 1MB out of 36GB of RAM. So no big deal?

I wonder if I am the only one being afflicted by ~80% of the jobs I receive stalling in an error?


Well I'm only getting 1% having problems. They always sit at the beginning never starting - the wall time and CPU time stay at "-".
ID: 43178 · Report as offensive     Reply Quote
maeax

Send message
Joined: 2 May 07
Posts: 2099
Credit: 159,812,775
RAC: 143,794
Message 43180 - Posted: 4 Aug 2020, 0:30:11 UTC - in response to Message 43177.  

This seems to be a network problem.
When your Theory task is not aible to connect to Cern Server (sft,grid...) at the starting time of the task in the Virtualbox,
then you have this problems.
ID: 43180 · Report as offensive     Reply Quote
computezrmle
Volunteer moderator
Volunteer developer
Volunteer tester
Help desk expert
Avatar

Send message
Joined: 15 Jun 08
Posts: 2411
Credit: 226,312,211
RAC: 131,995
Message 43183 - Posted: 4 Aug 2020, 6:00:52 UTC - in response to Message 43177.  

You are running a linux computer where SVM is disabled.
This causes all vbox tasks from LHC@home to fail.

https://lhcathome.cern.ch/lhcathome/show_host_detail.php?hostid=10551677
https://lhcathome.cern.ch/lhcathome/result.php?resultid=280249201
VBoxManage: error: AMD-V is disabled in the BIOS (or by the host OS) (VERR_SVM_DISABLED)




https://lhcathome.cern.ch/lhcathome/show_host_detail.php?hostid=10643884
Most of the tasks from your Windows computer show this behavior:
2020-08-03 13:59:05 (28916): Detected: vboxwrapper 26197
2020-08-03 14:21:51 (28916): Stopping VM.
2020-08-03 14:25:33 (6100): Detected: vboxwrapper 26197  # This is a restart

The VM stops and restarts after just a few minutes.
This is a very (disk) expensive behavior and may also cause network issues since the connections have to be reestablished after each restart.

The solution would be to avoid task switching and to reduce the number of concurrently running LHC@home tasks.

In addition you may check your network:
- attaching a computer via wi-fi is not recommended. At least it requires an extremely stable wi-fi channel.
- is your internet connection powerful enough to handle all the traffic?
ID: 43183 · Report as offensive     Reply Quote
Mr P Hucker
Avatar

Send message
Joined: 12 Aug 06
Posts: 418
Credit: 5,667,249
RAC: 3
Message 43184 - Posted: 4 Aug 2020, 9:54:00 UTC - in response to Message 43183.  
Last modified: 4 Aug 2020, 9:56:44 UTC

You are running a linux computer where SVM is disabled.
This causes all vbox tasks from LHC@home to fail.


Yes I know, it's probably a shortcoming of Boinc and nothing can be done. But is there no way of informing people they have important stuff like this disabled? If it can't come through Boinc, an automated email from the server?

Most of the tasks from your Windows computer show this behavior:
2020-08-03 13:59:05 (28916): Detected: vboxwrapper 26197
2020-08-03 14:21:51 (28916): Stopping VM.
2020-08-03 14:25:33 (6100): Detected: vboxwrapper 26197  # This is a restart

The VM stops and restarts after just a few minutes.
This is a very (disk) expensive behavior and may also cause network issues since the connections have to be reestablished after each restart.

The solution would be to avoid task switching and to reduce the number of concurrently running LHC@home tasks.


I think the Boinc default for switching tasks is pretty short. I changed mine to infinity (or whatever the largest number it would accept is). I see no point in stopping a running task to do a bit of another one.
ID: 43184 · Report as offensive     Reply Quote
Henry Nebrensky

Send message
Joined: 13 Jul 05
Posts: 167
Credit: 14,945,019
RAC: 511
Message 43185 - Posted: 4 Aug 2020, 14:45:50 UTC - in response to Message 43184.  
Last modified: 4 Aug 2020, 14:46:33 UTC

You are running a linux computer where SVM is disabled.
This causes all vbox tasks from LHC@home to fail.

Yes I know, it's probably a shortcoming of Boinc and nothing can be done. But is there no way of informing people they have important stuff like this disabled? If it can't come through Boinc, an automated email from the server?

I'm not sure it's any of BOINC's business - it's entirely valid for users to run tasks (Sixtrack) and other projects that don't use virtualisation, so why should they be lumbered with a mass of irrelevant emails?
Surely it's for the VirtualBox installer to make it clear that the process is incomplete? This crops up often enough that something's clearly missing. (Obligatory reminder of Yeti's checklist !)

I think the Boinc default for switching tasks is pretty short. I changed mine to infinity (or whatever the largest number it would accept is). I see no point in stopping a running task to do a bit of another one.
I'm still on 1200min (20 hours) - long enough that sane tasks should have finished, but leaves the client able to swap out never-ending Sherpas et al. if it wants.
On a brighter note, Theory tasks do seem to have been much better behaved over the past couple of months!
ID: 43185 · Report as offensive     Reply Quote
Chris Jenks

Send message
Joined: 16 Jun 06
Posts: 10
Credit: 3,245,056
RAC: 0
Message 43186 - Posted: 4 Aug 2020, 14:50:14 UTC - in response to Message 43178.  

I'm getting very good at aborting defunct jobs on BOINC manager and deleting the corresponding machines (which don't clean up by themselves) on VirtualBox.


What do you mean by don't clean up? I've just been aborting the Boinc tasks. I just had a look at a machine I think I've done this on, and there are some Virtualbox tasks in Windows 10 task manager doing nothing, but they're only using 1MB out of 36GB of RAM. So no big deal?

What I mean by VirtualBox not cleaning itself up is that after I abort a crashed LHC process in BOINC Manager, the corresponding machine in VirtualBox immediately changes status to Powered Off, but it stays that way indefinitely. I have to manually delete each machine to keep them from accumulating. Fortunately, both BOINC Manager and VirtualBox allow me to select a range of jobs/machines to abort/delete, making this less tedious than selecting each one at a time.
ID: 43186 · Report as offensive     Reply Quote
Chris Jenks

Send message
Joined: 16 Jun 06
Posts: 10
Credit: 3,245,056
RAC: 0
Message 43187 - Posted: 4 Aug 2020, 15:10:22 UTC - in response to Message 43183.  

You are running a linux computer where SVM is disabled.
This causes all vbox tasks from LHC@home to fail.

https://lhcathome.cern.ch/lhcathome/show_host_detail.php?hostid=10551677
https://lhcathome.cern.ch/lhcathome/result.php?resultid=280249201
VBoxManage: error: AMD-V is disabled in the BIOS (or by the host OS) (VERR_SVM_DISABLED)


I know. I tried getting VirtualBox running on an old laptop to run the LHC@home jobs besides SixTrack but got stuck and decided it wasn't worth it.

https://lhcathome.cern.ch/lhcathome/show_host_detail.php?hostid=10643884
Most of the tasks from your Windows computer show this behavior:
2020-08-03 13:59:05 (28916): Detected: vboxwrapper 26197
2020-08-03 14:21:51 (28916): Stopping VM.
2020-08-03 14:25:33 (6100): Detected: vboxwrapper 26197  # This is a restart

The VM stops and restarts after just a few minutes.
This is a very (disk) expensive behavior and may also cause network issues since the connections have to be reestablished after each restart.

The solution would be to avoid task switching and to reduce the number of concurrently running LHC@home tasks.

In addition you may check your network:
- attaching a computer via wi-fi is not recommended. At least it requires an extremely stable wi-fi channel.
- is your internet connection powerful enough to handle all the traffic?

My local configuration had tasks switching every 60 minutes. I changed this to 1,200 minutes. I noticed that if I reduce the resource share for LHC@Home so that only a few jobs download at a time, they seem to work. I also notice that when a bunch of jobs, say ten, start at the same time, the ones that are spared from crashing tend to be the first and/or the last in the list.

I've been mostly happy with the hardwired DSL my PC is using. While I agree that this does look like a network error, my download rate is about 400 kilobytes/second, which is the fastest DSL I've had from several locations in the Sacramento, California area. Once in a while the network will be slow or fail, but this is such a small minority of the time that it can't account for the constant stream of failed jobs. Maybe the error is a timeout due to a bunch of jobs trying to download something simultaneously. It would be nice if this timeout error could be handled more gracefully than wasting a CPU thread for 10 days.
ID: 43187 · Report as offensive     Reply Quote
computezrmle
Volunteer moderator
Volunteer developer
Volunteer tester
Help desk expert
Avatar

Send message
Joined: 15 Jun 08
Posts: 2411
Credit: 226,312,211
RAC: 131,995
Message 43188 - Posted: 4 Aug 2020, 16:01:45 UTC - in response to Message 43187.  

400 kilobytes/second maximum rate.
This could indeed cause trouble if you start or restart too many VMs concurrently.
The advice would be to limit the concurrently running tasks (VMs) until they run stable.

You may start at a low number and slightly increase it until you know at which number the trouble returns.
ID: 43188 · Report as offensive     Reply Quote
Chris Jenks

Send message
Joined: 16 Jun 06
Posts: 10
Credit: 3,245,056
RAC: 0
Message 43189 - Posted: 4 Aug 2020, 17:37:38 UTC - in response to Message 43188.  

400 kilobytes/second maximum rate.
This could indeed cause trouble if you start or restart too many VMs concurrently.
The advice would be to limit the concurrently running tasks (VMs) until they run stable.

You may start at a low number and slightly increase it until you know at which number the trouble returns.

I will scale my resource share back and hope that these bugs get fixed someday so I can return to full participation:

1. The timeout should be increased to account for multiple jobs being launched simultaneously on a DSL-connected PC.
2. When a network operation fails, it should either be re-tried or the job should close, not sit allocating a slot in BOINC's active job listing.
3. When a job is aborted in BOINC Manager, the corresponding virtual machine should eventually be removed automatically.

Thanks for your ideas and suggestions, everyone.
ID: 43189 · Report as offensive     Reply Quote
Mr P Hucker
Avatar

Send message
Joined: 12 Aug 06
Posts: 418
Credit: 5,667,249
RAC: 3
Message 43191 - Posted: 4 Aug 2020, 18:55:28 UTC - in response to Message 43185.  
Last modified: 4 Aug 2020, 18:59:30 UTC

Henry Nebrensky wrote:

I'm not sure it's any of BOINC's business - it's entirely valid for users to run tasks (Sixtrack) and other projects that don't use virtualisation, so why should they be lumbered with a mass of irrelevant emails?
Surely it's for the VirtualBox installer to make it clear that the process is incomplete? This crops up often enough that something's clearly missing. (Obligatory reminder of Yeti's checklist !)


That's not what I meant. I meant an email sent only to anyone returning invalid tasks.

And I mentioned Boinc incase there was a way Boinc could put up a notification saying there was a problem.

Henry Nebrensky wrote:
I'm still on 1200min (20 hours) - long enough that sane tasks should have finished, but leaves the client able to swap out never-ending Sherpas et al. if it wants.
On a brighter note, Theory tasks do seem to have been much better behaved over the past couple of months!


I just let them run for the 10 day limit. It's only 1 core, and swapping them out means they're going to be returned late.

Chris Jenks wrote:
What I mean by VirtualBox not cleaning itself up is that after I abort a crashed LHC process in BOINC Manager, the corresponding machine in VirtualBox immediately changes status to Powered Off, but it stays that way indefinitely. I have to manually delete each machine to keep them from accumulating. Fortunately, both BOINC Manager and VirtualBox allow me to select a range of jobs/machines to abort/delete, making this less tedious than selecting each one at a time.


Do they use up resources somewhere? I just leave mine sitting idle.

Chris Jenks wrote:
I know. I tried getting VirtualBox running on an old laptop to run the LHC@home jobs besides SixTrack but got stuck and decided it wasn't worth it.


I got Virtualbox running on a few old machines, but got a lot of crashes with Atlas/Theory/CMS, so decided to leave them to run Sixtrack and other projects only.

These machines were not happy:
Core 2 Q8400 quad core, 8GB DDR2 RAM
I3 M350 quad core laptop, 8GB DDR3 RAM
Pentium N3700 quad core, 8GB DDR3 RAM

These machines are ok:
i5-8600K six core, 16GB DDR4 RAM
Dual Xeon X5650 12 core x 2, 36GB DDR3 RAM
Another dual Xeon X5650 12 core x 2, 36GB DDR3 RAM

Chris Jenks wrote:
I've been mostly happy with the hardwired DSL my PC is using. While I agree that this does look like a network error, my download rate is about 400 kilobytes/second, which is the fastest DSL I've had from several locations in the Sacramento, California area. Once in a while the network will be slow or fail, but this is such a small minority of the time that it can't account for the constant stream of failed jobs. Maybe the error is a timeout due to a bunch of jobs trying to download something simultaneously. It would be nice if this timeout error could be handled more gracefully than wasting a CPU thread for 10 days.


Ouch, DSL. The UK is all fibre, 6.75MBytes download rate here. Not sure how impatient LHC tasks/servers are, but I've seen a single Theory max out my line for a few minutes, so there must be a lot of data transfer. If you're starting 10 at once on a slower line, that could be a problem. With normal Boinc tasks, Boinc only allows two connections per project at a time, but I think with Theory doing its own downloading, you might get loads of files competing. But if only one or two are started at once it should be ok. If you manually allow one at a time until all the cores are running, they should all finish at random times and new ones will be started one at a time.
ID: 43191 · Report as offensive     Reply Quote
Chris Jenks

Send message
Joined: 16 Jun 06
Posts: 10
Credit: 3,245,056
RAC: 0
Message 43196 - Posted: 6 Aug 2020, 13:02:50 UTC - in response to Message 43191.  

I had been sharing my 16 threads equally between LHC and Rosetta. When I changed resource share to 50 for LHC and 100 for Rosetta, I seem to be having fewer crashed jobs. This morning I found seven Theory jobs running, none crashed, which is more than I remember seeing in a long time.
ID: 43196 · Report as offensive     Reply Quote
Mr P Hucker
Avatar

Send message
Joined: 12 Aug 06
Posts: 418
Credit: 5,667,249
RAC: 3
Message 43197 - Posted: 6 Aug 2020, 15:52:23 UTC - in response to Message 43196.  

I had been sharing my 16 threads equally between LHC and Rosetta. When I changed resource share to 50 for LHC and 100 for Rosetta, I seem to be having fewer crashed jobs. This morning I found seven Theory jobs running, none crashed, which is more than I remember seeing in a long time.


You can also use "max concurrent" in the config file to limit precisely how many theories run.
ID: 43197 · Report as offensive     Reply Quote
maeax

Send message
Joined: 2 May 07
Posts: 2099
Credit: 159,812,775
RAC: 143,794
Message 43919 - Posted: 15 Dec 2020, 10:29:51 UTC

Four days with success, Pythia8 - [boinc pp jets 8000 170,-,2960 - pythia8 8.301 dire-default 48000 158]
https://lhcathome.cern.ch/lhcathome/result.php?resultid=290783284
Laufzeit 3 Tage 23 Stunden 57 min. 39 sek.
CPU Zeit 3 Tage 21 Stunden 49 min. 44 sek.
ID: 43919 · Report as offensive     Reply Quote
[VENETO] boboviz
Avatar

Send message
Joined: 7 May 08
Posts: 195
Credit: 1,504,161
RAC: 113
Message 43991 - Posted: 26 Dec 2020, 9:01:43 UTC

This is my error with loooong wus (on VM Console):
Probing /cvmfs/grid.cern.ch....failed!!
crancky: [ERROR] 'cvmfs_config probe grid.cern.ch' failded
ERROR Could not source logging functions from /cvmfs/grid.cern.ch/vc/bin/logging_functions.
ID: 43991 · Report as offensive     Reply Quote
Sesson

Send message
Joined: 4 Apr 19
Posts: 31
Credit: 3,871,297
RAC: 8,962
Message 43993 - Posted: 26 Dec 2020, 9:47:39 UTC - in response to Message 43919.  

Four days with success, Pythia8 - [boinc pp jets 8000 170,-,2960 - pythia8 8.301 dire-default 48000 158]


Yes, some time ago it was a job with 100k events and could take a week to complete, now it contains only 48k events. Some scientist must have changed that.
ID: 43993 · Report as offensive     Reply Quote
NOGOOD

Send message
Joined: 18 Nov 17
Posts: 120
Credit: 52,013,821
RAC: 26,027
Message 44293 - Posted: 11 Feb 2021, 11:04:53 UTC - in response to Message 43993.  
Last modified: 11 Feb 2021, 11:05:11 UTC

Can Theory task succeed if in VM console by Alt+F2 I see olny the string "Running jobs output should appear here"?
ID: 44293 · Report as offensive     Reply Quote
Crystal Pellet
Volunteer moderator
Volunteer tester

Send message
Joined: 14 Jan 10
Posts: 1280
Credit: 8,488,906
RAC: 1,868
Message 44300 - Posted: 11 Feb 2021, 17:28:58 UTC - in response to Message 44293.  

Can Theory task succeed if in VM console by Alt+F2 I see olny the string "Running jobs output should appear here"?

No
ID: 44300 · Report as offensive     Reply Quote
Profile Magic Quantum Mechanic
Avatar

Send message
Joined: 24 Oct 04
Posts: 1127
Credit: 49,749,782
RAC: 9,696
Message 44302 - Posted: 11 Feb 2021, 23:54:27 UTC - in response to Message 44293.  

Can Theory task succeed if in VM console by Alt+F2 I see olny the string "Running jobs output should appear here"?


When you check a task that is running and the log page tells you that it means it is finished and tasks tend to continue running for several minutes before they are officially done on the Boinc Manager tasks so let the particular task continue running and you can also check the the task on the VB Manager and the log there will tell you that it is finished in VB but not ready to complete and send back yet.

( I have watched thousands of these running and seeing it say "Running jobs output should appear here" when I check to see how close to finished a task is)
ID: 44302 · Report as offensive     Reply Quote
[VENETO] boboviz
Avatar

Send message
Joined: 7 May 08
Posts: 195
Credit: 1,504,161
RAC: 113
Message 44508 - Posted: 18 Mar 2021, 6:28:20 UTC
Last modified: 18 Mar 2021, 6:33:14 UTC

My errors after 20hs of run
305482630
305481267
ID: 44508 · Report as offensive     Reply Quote
computezrmle
Volunteer moderator
Volunteer developer
Volunteer tester
Help desk expert
Avatar

Send message
Joined: 15 Jun 08
Posts: 2411
Credit: 226,312,211
RAC: 131,995
Message 44511 - Posted: 18 Mar 2021, 7:54:42 UTC - in response to Message 44508.  

https://lhcathome.cern.ch/lhcathome/result.php?resultid=305482630

Your logfiles show this a couple of times.
In most cases the VM can recover at the next restart but not always:
2021-03-17 07:14:50 (6428): Error in stop VM for VM: -108



In this case the BOINC client died before vboxwrapper and the VM were properly shut down:
2021-03-17 18:36:46 (17480): Stopping VM.
18:37:03 (17480): BOINC client no longer exists - exiting
18:37:03 (17480): timer handler: client dead, exiting
18:37:14 (17480): BOINC client no longer exists - exiting
18:37:14 (17480): timer handler: client dead, exiting



Unclean shutdowns can result in an unusable vm_image.vdi in the slots folder.
At the very end the VM hangs and can't update it's heartbeat file.
Vboxwrapper then terminates the VM and reports it as lost.
2021-03-18 02:21:03 (7856): VM Heartbeat file specified, but missing heartbeat.



Stopping/restarting a VM causes a high peak load on the I/O system for a short time.
Timings are influenced by the disk speed but also by the amount of RAM available for the OS disk cache and it's internal write delay timers.
ID: 44511 · Report as offensive     Reply Quote
Previous · 1 . . . 4 · 5 · 6 · 7 · 8 · Next

Message boards : Theory Application : Tasks run 4 days and finish with error


©2024 CERN