1) Message boards : Theory Application : Tasks run 4 days and finish with error (Message 43196)
Posted 6 Aug 2020 by Chris Jenks
Post:
I had been sharing my 16 threads equally between LHC and Rosetta. When I changed resource share to 50 for LHC and 100 for Rosetta, I seem to be having fewer crashed jobs. This morning I found seven Theory jobs running, none crashed, which is more than I remember seeing in a long time.
2) Message boards : Theory Application : Tasks run 4 days and finish with error (Message 43189)
Posted 4 Aug 2020 by Chris Jenks
Post:
400 kilobytes/second maximum rate.
This could indeed cause trouble if you start or restart too many VMs concurrently.
The advice would be to limit the concurrently running tasks (VMs) until they run stable.

You may start at a low number and slightly increase it until you know at which number the trouble returns.

I will scale my resource share back and hope that these bugs get fixed someday so I can return to full participation:

1. The timeout should be increased to account for multiple jobs being launched simultaneously on a DSL-connected PC.
2. When a network operation fails, it should either be re-tried or the job should close, not sit allocating a slot in BOINC's active job listing.
3. When a job is aborted in BOINC Manager, the corresponding virtual machine should eventually be removed automatically.

Thanks for your ideas and suggestions, everyone.
3) Message boards : Theory Application : Tasks run 4 days and finish with error (Message 43187)
Posted 4 Aug 2020 by Chris Jenks
Post:
You are running a linux computer where SVM is disabled.
This causes all vbox tasks from LHC@home to fail.

https://lhcathome.cern.ch/lhcathome/show_host_detail.php?hostid=10551677
https://lhcathome.cern.ch/lhcathome/result.php?resultid=280249201
VBoxManage: error: AMD-V is disabled in the BIOS (or by the host OS) (VERR_SVM_DISABLED)


I know. I tried getting VirtualBox running on an old laptop to run the LHC@home jobs besides SixTrack but got stuck and decided it wasn't worth it.

https://lhcathome.cern.ch/lhcathome/show_host_detail.php?hostid=10643884
Most of the tasks from your Windows computer show this behavior:
2020-08-03 13:59:05 (28916): Detected: vboxwrapper 26197
2020-08-03 14:21:51 (28916): Stopping VM.
2020-08-03 14:25:33 (6100): Detected: vboxwrapper 26197  # This is a restart

The VM stops and restarts after just a few minutes.
This is a very (disk) expensive behavior and may also cause network issues since the connections have to be reestablished after each restart.

The solution would be to avoid task switching and to reduce the number of concurrently running LHC@home tasks.

In addition you may check your network:
- attaching a computer via wi-fi is not recommended. At least it requires an extremely stable wi-fi channel.
- is your internet connection powerful enough to handle all the traffic?

My local configuration had tasks switching every 60 minutes. I changed this to 1,200 minutes. I noticed that if I reduce the resource share for LHC@Home so that only a few jobs download at a time, they seem to work. I also notice that when a bunch of jobs, say ten, start at the same time, the ones that are spared from crashing tend to be the first and/or the last in the list.

I've been mostly happy with the hardwired DSL my PC is using. While I agree that this does look like a network error, my download rate is about 400 kilobytes/second, which is the fastest DSL I've had from several locations in the Sacramento, California area. Once in a while the network will be slow or fail, but this is such a small minority of the time that it can't account for the constant stream of failed jobs. Maybe the error is a timeout due to a bunch of jobs trying to download something simultaneously. It would be nice if this timeout error could be handled more gracefully than wasting a CPU thread for 10 days.
4) Message boards : Theory Application : Tasks run 4 days and finish with error (Message 43186)
Posted 4 Aug 2020 by Chris Jenks
Post:
I'm getting very good at aborting defunct jobs on BOINC manager and deleting the corresponding machines (which don't clean up by themselves) on VirtualBox.


What do you mean by don't clean up? I've just been aborting the Boinc tasks. I just had a look at a machine I think I've done this on, and there are some Virtualbox tasks in Windows 10 task manager doing nothing, but they're only using 1MB out of 36GB of RAM. So no big deal?

What I mean by VirtualBox not cleaning itself up is that after I abort a crashed LHC process in BOINC Manager, the corresponding machine in VirtualBox immediately changes status to Powered Off, but it stays that way indefinitely. I have to manually delete each machine to keep them from accumulating. Fortunately, both BOINC Manager and VirtualBox allow me to select a range of jobs/machines to abort/delete, making this less tedious than selecting each one at a time.
5) Message boards : Theory Application : Tasks run 4 days and finish with error (Message 43177)
Posted 3 Aug 2020 by Chris Jenks
Post:
I'm getting very good at aborting defunct jobs on BOINC manager and deleting the corresponding machines (which don't clean up by themselves) on VirtualBox. I wonder if I am the only one being afflicted by ~80% of the jobs I receive stalling in an error? In this image I am referring to the second job, which claims to be running on BOINC manager but isn't actually working. It is wasting a thread, and will continue to waste a thread until I manually delete the job. Not only is this tedious to keep doing, it is wasting a good fraction of my computer's processing capacity on an ongoing basis because I usually have ten such defunct jobs at a time..



Both BOINC manager (version 7.26.7) and 64 bit VirtualBox (version 6.1.12) are up to date, running on Windows 10 Pro. Everything is stock. I don't know what to fix to get jobs to work, assuming everybody else's jobs are working.
6) Message boards : Theory Application : Tasks run 4 days and finish with error (Message 42600)
Posted 24 May 2020 by Chris Jenks
Post:

It is possible that my network was down at the moment my latest five jobs were issued, but not very likely. Plus it would be nice if the software would try again. So I take it I can abort these five jobs and save three days of imaginary crunching?


Are they using your CPU time in the task manager? If not, abort them. If they are, you could wait and see if a couple are long runners that finish within the time frame, but it's not likely.

Until you asked I hadn't noticed I could expand the BOINC tasks apart in Task Manager (Windows isn't my primary OS) and I see only Rosetta and WCG using my CPU:

So I will abort the LHC jobs.

Thanks for all the help.
7) Message boards : Theory Application : Tasks run 4 days and finish with error (Message 42596)
Posted 23 May 2020 by Chris Jenks
Post:
NG You might want to update your VB version since I think you are still running a 2019 version and Oracle tends to do lots of updates to fix the usual problems. (VirtualBox 5.2.34 (released October 15 2019)


Mine isn't on the newest list ( that I use here) VirtualBox 6.1.6 but it works
And I also have tested lots of them with VirtualBox 6.1.8 and no problems running Theory tasks

https://www.virtualbox.org/wiki/Download_Old_Builds
https://www.virtualbox.org/wiki/Downloads
I even have had good luck with the Sherpa and the many other event generators.


Peter Hucker
I have always used the latest version and never had any problems.


(damn this thread is faster than the server)

Yes Peter I knew you had been up to date and that was for NOGOOD and I figured he would see that I said that to NG and go from there.

The latest version of VirtualBox available is 6.1.8. Mine is 6.1.6, but to upgrade it I would have to end my LHC jobs.

I've even wondered if the recentness of my VirtualBox is a problem, since BOINC recommends the older version they distribute with the BOINC package.

It is possible that my network was down at the moment my latest five jobs were issued, but not very likely. Plus it would be nice if the software would try again. So I take it I can abort these five jobs and save three days of imaginary crunching?
8) Message boards : Theory Application : Tasks run 4 days and finish with error (Message 42589)
Posted 22 May 2020 by Chris Jenks
Post:
Strange. That computer has returned 25 good theories recently in just over an hour on average. It's odd that you now have 5 that are all taking that long. I usually just get the odd one that's a long runner.

Mind you, you have 11 errors returned in about 13 minutes each. Do you know what happened to them? Was the computer being rebooted at the time, or computation paused for a game, or tasks swapped to run another project?

I must admit I'm rather new to running LHC@Home on virtual machines, starting only a month ago. I have had errors since then, as my logs show. At first it was due to this being done on a new PC needing reboots, and whenever I rebooted I would find a mess of aborted machines on the VirtualBox which wouldn't clean themselves up, and I seem to remember problems with the following jobs until I went in and manually logged the machines out and removed them all.

Even now, LHC@Home thinks I am running three ATLAS jobs I don't have, and when I look at the VM console for the first Theory job I get this:



I assume the rest are the same. Forgive my ignorance, but is there any point letting this apparently crashed process sit on my system, pretending to be using up a hyperthread? Or is the error non-fatal? The error looks like it is due to a network problem, in which case the job could complete successfully if re-run.
9) Message boards : Theory Application : Tasks run 4 days and finish with error (Message 42587)
Posted 22 May 2020 by Chris Jenks
Post:
I have five Theory processes running, all with an expected duration of ten days. This is a difficult commitment because I find if I reboot then the virtual machines don't recover and I lose the jobs.

I now have, for the five jobs:

Elapsed: 5d 22:33:00
Remaining: 4d 01:37:00
Deadline: 5/26/20 10:43:19 AM
It is currently 5/22/20 11:06:00 AM


Is that the total of all 5? What is each one at?


10) Message boards : Theory Application : Tasks run 4 days and finish with error (Message 42567)
Posted 22 May 2020 by Chris Jenks
Post:
I have five Theory processes running, all with an expected duration of ten days. This is a difficult commitment because I find if I reboot then the virtual machines don't recover and I lose the jobs.

But on top of that, with summer coming I thought I would look at restricting the computing times so my PC doesn't burn me out of the office while I am working. After a few cool hours I noticed that the deadlines seem to be set at exactly ten days, so that I can't suspend the computation without potentially finishing after the deadline.

I now have, for the five jobs:

Elapsed: 5d 22:33:00
Remaining: 4d 01:37:00
Deadline: 5/26/20 10:43:19 AM
It is currently 5/22/20 11:06:00 AM



©2024 CERN