Message boards : Theory Application : New version v262.80 for Windows 64bit
Message board moderation

To post messages, you must log in.

AuthorMessage
Profile Laurence
Project administrator
Project developer

Send message
Joined: 20 Jun 14
Posts: 372
Credit: 238,712
RAC: 0
Message 29978 - Posted: 20 Apr 2017, 13:21:18 UTC

This new version is for 64bit Windows only. It provides a rebuilt vboxwrapper which uses the VBoxManage command to control the VMs rather than the API. Please let us know if you see any issues with this build.
ID: 29978 · Report as offensive     Reply Quote
Profile Ray Murray
Volunteer moderator
Avatar

Send message
Joined: 29 Sep 04
Posts: 281
Credit: 11,859,285
RAC: 1
Message 29994 - Posted: 21 Apr 2017, 18:28:17 UTC
Last modified: 21 Apr 2017, 21:28:35 UTC

My Avast Antivirus(free) didn't like the new wrapper so I lost 2 tasks to an access denied error at startup. I already had everything-Boinc excluded but with a recent AV update, 1 host kept those exclusions but the other cleared them so I have had to put them back in again. The errors left ghosts in VBox so I had to abort a 3rd task clear them all out.
3 tasks running on 2 hosts, now, all running jobs fine. None of them old enough to RETURN a job yet but looking good so far after the initial (solved) start-up problem.

Later
All 3 running 262.80 VMs have completed and returned jobs and got new ones.
Standard Boinc over-estimate of runtime which will settle down as it "learns" about the new wrapper. I'm assuming we are sticking to the 12 hour + complete-job-in-progress cut-off.
ID: 29994 · Report as offensive     Reply Quote
Toby Broom
Volunteer moderator

Send message
Joined: 27 Sep 08
Posts: 798
Credit: 644,824,434
RAC: 228,946
Message 29998 - Posted: 22 Apr 2017, 6:48:08 UTC
Last modified: 22 Apr 2017, 9:25:51 UTC

I had a ton of tasks that said VM job unmanageable, restarting later.

Seems like these got stuck and blocked more work being fetched so computers were idle.

I find this to be regression in reliablilty.
ID: 29998 · Report as offensive     Reply Quote
PHILIPPE

Send message
Joined: 24 Jul 16
Posts: 88
Credit: 239,917
RAC: 0
Message 30001 - Posted: 22 Apr 2017, 9:51:18 UTC - in response to Message 29998.  

I tried to run a 2-core wu for theory and it succeeded.
one process sherpa - one process pythia.
I noticed that the time left within the vm
Event 20500 ( 15h 23m 21s elapsed / 2h 37m 38s left )
for ending the sherpa job was higher than the time remaining for the wu in boinc client.
So not all the 24000 events set by default haven't been processed (only 20500).
When a work unit in boinc client ends while the job inside is not finished ,is the result of the partial job done saved for the project?
Will the 3500 events not done be sent to another volunteer ?
Why the credits earned didn't take in account the fact 2 cores have been used instead of one?
Are only the sherpa jobs possible when several cores are used ?
I never saw them with a one core...
ID: 30001 · Report as offensive     Reply Quote
Crystal Pellet
Volunteer moderator
Volunteer tester

Send message
Joined: 14 Jan 10
Posts: 1268
Credit: 8,421,616
RAC: 2,139
Message 30007 - Posted: 22 Apr 2017, 16:17:34 UTC - in response to Message 30001.  

I tried to run a 2-core wu for theory and it succeeded.
one process sherpa - one process pythia.
I noticed that the time left within the vm
Event 20500 ( 15h 23m 21s elapsed / 2h 37m 38s left )
for ending the sherpa job was higher than the time remaining for the wu in boinc client.
So not all the 24000 events set by default haven't been processed (only 20500).
When a work unit in boinc client ends while the job inside is not finished ,is the result of the partial job done saved for the project?
Will the 3500 events not done be sent to another volunteer ?

No, the processed events are lost, when the VM is shutdown due to the 18hr limit.

Why the credits earned didn't take in account the fact 2 cores have been used instead of one?

When you're here for the credits, run only single core tasks.
When you have enough memory, it's even more efficient cause less idle cpu-time.
Only ATLAS will run faster when using more cores.

Are only the sherpa jobs possible when several cores are used ?
I never saw them with a one core...

No, sherpa's will also appear on single core VM's.
ID: 30007 · Report as offensive     Reply Quote
PHILIPPE

Send message
Joined: 24 Jul 16
Posts: 88
Credit: 239,917
RAC: 0
Message 30008 - Posted: 22 Apr 2017, 16:25:16 UTC - in response to Message 30007.  
Last modified: 22 Apr 2017, 16:25:34 UTC

Thanks for the answers.
ID: 30008 · Report as offensive     Reply Quote
Profile Magic Quantum Mechanic
Avatar

Send message
Joined: 24 Oct 04
Posts: 1114
Credit: 49,503,137
RAC: 3,956
Message 30012 - Posted: 22 Apr 2017, 23:29:43 UTC - in response to Message 29998.  

I had a ton of tasks that said VM job unmanageable, restarting later.

Seems like these got stuck and blocked more work being fetched so computers were idle.

I find this to be regression in reliability.


Yeah Toby I have had a few of these and even tried to force one to restart but I could tell it wasn't going to restart so I aborted that one and the others that did that even before starting.
Volunteer Mad Scientist For Life
ID: 30012 · Report as offensive     Reply Quote
Profile Magic Quantum Mechanic
Avatar

Send message
Joined: 24 Oct 04
Posts: 1114
Credit: 49,503,137
RAC: 3,956
Message 30013 - Posted: 23 Apr 2017, 1:10:37 UTC

https://lhcathome.cern.ch/lhcathome/results.php?hostid=10337530

It does it after started and before started.

One thing it claims is not enough memory and I watched as it happened and it had plenty of memory left and before could run all 8 cores with no problems.
ID: 30013 · Report as offensive     Reply Quote
Crystal Pellet
Volunteer moderator
Volunteer tester

Send message
Joined: 14 Jan 10
Posts: 1268
Credit: 8,421,616
RAC: 2,139
Message 30015 - Posted: 23 Apr 2017, 7:10:14 UTC

This version suffers much more from

[LHC@home] task postponed 86400.000000 sec: Communication with VM Hypervisor failed.

probably due to VBoxSVC too busy to respond immediately on a VBoxManage request.

You have to wait a whole day (or restart BOINC).
ID: 30015 · Report as offensive     Reply Quote
BuckeyeChuck

Send message
Joined: 21 Jan 17
Posts: 3
Credit: 2,797,466
RAC: 0
Message 30018 - Posted: 23 Apr 2017, 12:25:51 UTC

I presently have nine Theory Simulation 262.80 (vbox64) tasks that have postponed themselves with the same message given by others in this thread:

Postponed: VM job unmanageable, restarting later.

The run time prior to postponement varies wildly. The shortest was 14 seconds; the longest was 5 hours and 36 minutes.

All of them have an estimated completion time of at least 10 hours.
ID: 30018 · Report as offensive     Reply Quote
Toby Broom
Volunteer moderator

Send message
Joined: 27 Sep 08
Posts: 798
Credit: 644,824,434
RAC: 228,946
Message 30019 - Posted: 23 Apr 2017, 12:33:16 UTC

Magic how did you force the restart? Quit Boinc and restart?
ID: 30019 · Report as offensive     Reply Quote
Profile Yeti
Volunteer moderator
Avatar

Send message
Joined: 2 Sep 04
Posts: 453
Credit: 193,369,412
RAC: 10,065
Message 30020 - Posted: 23 Apr 2017, 12:57:03 UTC - in response to Message 30018.  

I presently have nine Theory Simulation 262.80 (vbox64) tasks that have postponed themselves with the same message given by others in this thread:

Postponed: VM job unmanageable, restarting later.

You can find some information about postponed VMs in my Checklist V3 for Atlas at No. 6 and No. 16d

The checklist was designed for Atlas, but regarding postponed it will help fo VMs


Supporting BOINC, a great concept !
ID: 30020 · Report as offensive     Reply Quote
Profile Magic Quantum Mechanic
Avatar

Send message
Joined: 24 Oct 04
Posts: 1114
Credit: 49,503,137
RAC: 3,956
Message 30023 - Posted: 23 Apr 2017, 17:54:35 UTC - in response to Message 30019.  
Last modified: 23 Apr 2017, 17:58:53 UTC

Magic how did you force the restart? Quit Boinc and restart?


Well I do a few things like going to the VB manager and telling it to *start* or I just pause everything and do a reboot.

The one I did during my previous post was the reboot version and that worked and it restarted and finished a few hours later.

Still not sure why it is saying it is a memory problem now.

It has no problem running four of the X2 LHC-dev Theory tasks.

Still has 2.7GB memory available and the times I watched the tasks here crash it said I still had 5GB available.
ID: 30023 · Report as offensive     Reply Quote
Toby Broom
Volunteer moderator

Send message
Joined: 27 Sep 08
Posts: 798
Credit: 644,824,434
RAC: 228,946
Message 30025 - Posted: 23 Apr 2017, 20:01:12 UTC
Last modified: 23 Apr 2017, 22:25:30 UTC

Thanks, same as me.

I don't see any ram problems, just lack of communication.

I wrote a script that restart BOINC if there is more than 50% of the tasks stuck in pending. I run it every hour to fix the pending's
ID: 30025 · Report as offensive     Reply Quote
Toby Broom
Volunteer moderator

Send message
Joined: 27 Sep 08
Posts: 798
Credit: 644,824,434
RAC: 228,946
Message 30032 - Posted: 24 Apr 2017, 19:17:35 UTC

I think will just set to no new work til there is a new version.
ID: 30032 · Report as offensive     Reply Quote
Profile Ray Murray
Volunteer moderator
Avatar

Send message
Joined: 29 Sep 04
Posts: 281
Credit: 11,859,285
RAC: 1
Message 30050 - Posted: 25 Apr 2017, 19:10:04 UTC
Last modified: 25 Apr 2017, 19:22:48 UTC

Another couple of issues with 262.80:

VMs don't always abide by the 18 hour cut-off; I have 3 VMs at 23, 24 and 25 hours. 2 were reset a short time ago as they both had loopers (that I forgot to take details of, doh!) and have started new jobs. I'll see if they terminate after those. (nope)
If not, I'll try to end them "gracefully".
The 25hr one finished its last job 9 hrs ago but has given up after a few condor write failures. A reset got it a new job.
3 successful graceful task terminations.

[Shutdown signal doesn't always properly get to VBox although that might my fault as I'm running Boinc 7.7.2 which hasn't been fully cleared for release yet]

Often the progress % bears no relation elapsed time eg.
7hrs 39%
24hrs 6%
25hrs 65%
90 mins 8.3%
110 mins 7.4%

[3.01 multis running single core appear immune to these issues although it looks like they use the same wrapper]
Let's see what 262.90 has to offer.
Got ...80s just like others have reported in the ...90 thread.
ID: 30050 · Report as offensive     Reply Quote

Message boards : Theory Application : New version v262.80 for Windows 64bit


©2024 CERN