1) Message boards : Number crunching : ATLAS/CMS Tasks all dying like flies LHC/LHC-dev both (Message 40044)
Posted 29 Sep 2019 by BelgianEnthousiast
Post:
Hi All,

I'm not sure if my problems are related, but my Atlas WU's keep running forever, never ending.
(counting up to 100 % but just keeps running on and on)
I have to abort after 2 days knowing that before such a WU would take at most 2-6 hours.

I run Theory, CMS, SixTrackx, Atlas, etc. in parallel.

Up until early september all went well, but suddenly Atlas seems to have an issue.

Proc : Intel Core i7 - 6850 K @3.6 GHz, not overclocked
Mobo : Asus X299 Deluxe
RAM : 32 GB
Windows 10 Pro buid 1809
BOINC Mgr. : 6.14.2 (x64)
VBox : 5.1.38 + associated extension Pack

I only run single core WU's for any of the applications allowing a total of 9 out of 12 cores.
2 cores out of 9 are reserved for GPU Grid or Einstein if no GPUGrid WU's available.

In terms of memory, I use in general on average 7 GB with a maximum of 11 that I have
seen over the years. Out of 32, that should not be an issue either...

I used Yeti's checklist and all is ok.
LeoMoon CPU-V indicates that VT-x is supported and enabled.

Any suggestions are very welcome !
2) Message boards : ATLAS application : Atlas tasks "Postponed: VM job unmanageable, restarting later." (Message 39125)
Posted 14 Jun 2019 by BelgianEnthousiast
Post:
Hi Toby,

Thanks for the quick response !

I understood from above discussions that apparently v5.1 would work with Atlas, so your
statement confirms that.
Which is a little bit odd if I may say so : when downloading new version of BOINC + VBOX
you automatically get 5.2 something.

So, if I install 3 new PC's I have to install BOINC + VBOX first, then downgrade VBOX 5.1
on all 3 PC's afterwards. That's a bit of a hassle...

Are there any plans in the relatively short term to make Atlas compatible with v5.2 ?

By the way, Oracle is already pushing to install v6. smth in the meantime...
Or is it the purpuse to leapfrog v5.2 and support straight away v6 ?

Wish you a nice weekend !

Friendly Greetings,
K.
3) Message boards : ATLAS application : Atlas tasks "Postponed: VM job unmanageable, restarting later." (Message 39093)
Posted 10 Jun 2019 by BelgianEnthousiast
Post:
Hi Everyone,

I'm experiencing exactly the same error with the Atlas project, none others.

I have them on a quite frequent basis, pausing, exiting BOINC, rebooting and restarting again
does the job. Not just on one, but now also on 2 machines.

I'm running VBox 5.2.26 r 128414 (Qt5.6.2)
BOINC Manager 7.14.2, Widgets 3.0.1

Both machines have 32 GB of RAM, swap file of 48 GB (1.5 x RAM size).
Memory usage is peaking at 40 % of 32 GB (or around 13-14 GB of actual RAM usage).
So by far not coming near even 50 % of the allowed maximum (setting of 80 % of RAM max usage for BOINC).

It is quite annoying since it also interferes with other projects such as GPUGrid or WorldComGrid.

When reading through the above discussion, there's mention that wrapper v5.2 seems to be the cause of the issues.
Are there any plans to start to support it ? (Oracle wants me to upgrade to 5.2.30 in the mean time...)

Thanks for any updates :-)

BE.
4) Message boards : ATLAS application : Atlas-native App finished in Seconds (Message 37905)
Posted 1 Feb 2019 by BelgianEnthousiast
Post:
Hi Gunde,

It was a combination of factors in fact.

I upgraded the BIOS to the latest version to patch an Intel security bug, but at the same time, it wreaked havoc in the BIOS settings,
effectively resetting the VT-X.

In parallel, there was something wrong with the interaction between BOINC 7.14.2 and Virtual Box 6.0.4 which got screwed up by
me installing other software. (silly me...)

And in the end, BOINC seems to have a good memory and retained that VT-X was disabled in the .xml file in the BOINC_DATA directory (see details in Yeti's check list).

So, in the end, flashing back latest working version of the BIOS, enabling VT-X again;
removing the software I installed;
removing BOINC & Virtual Box 6
reinstalling BOINC 7.14.2, VirtualBox 5.2.26;
modifying BOINC's memory in the xml file

and finally it works again like a charm ! Only pity I lost nearly an entire month to troubleshoot the darn thing...

Which makes me pose the question : does Atlas/LHC really need VM's to run ? Why can WorldComGrid, ClimatePrediction or Rosetta run without it ?
(just asking out of pure ignorance, apologies for that !)

Nice weekend to all !

B.E.
5) Message boards : Number crunching : Checklist Version 3 for Atlas@Home (and other VM-based Projects) on your PC (Message 37903)
Posted 1 Feb 2019 by BelgianEnthousiast
Post:
Hi Yeti,

Many thanks for the check list, you've saved me twice in less than 1 month !
Nasty things BIOS upgrades and issues with the lasting memory of the client_state.xml file...

1) downgraded BIOS back to last known stable (wanted to put in last one which fixed an Intel security bug, but turned out to be a bummer),
lost the Hyper-V settings in the process.
2) downgraded Virtual Box back to version 5.2.26 from version 6.0.4 (including expansion pack)
I had all WU's VM's stalled...
3) had to modify the client_state.xml back to enable vm_extensions.

Wouldn't have been able to find all these details if it weren't for your exhaustive list...

LHC seems now to run once again without issues !

Have a nice weekend !

Friendly Greetings,

K.
6) Message boards : ATLAS application : Atlas-native App finished in Seconds (Message 37793)
Posted 20 Jan 2019 by BelgianEnthousiast
Post:
Hi All,

I'm running into trouble with just any kind of LCH tasks these days.
Initially I was running 5-core applications (Theory, Atlas, and if available CMS, LHCb, etc...) without any issues for now about a year I guess since they became available.

Then I started getting "VM unmanageable" errors on Atlas about 2-3 months ago. Restarting BOINC after a reboot solved this issue, but I had to constantly reboot my system
which is not very handy.

I uninstalled VirtualBox (version 6 at that time) and reinstalled it again to try to solve the "VM unmanageable" error, but to no avail.

I then stopped all LHC WU's, searched for any settings that could provoke an issue.
Just out of curiosity, I then enabled 7-core applications as I thought "maybe 5-core WU's carry a flaw, and 7-core don't have this flaw".

To my horror, all of the tasks errored out with the message : WU aborted, within 30 seconds of starting the WU.
But without any further information. When checking the standard log, I can't even see what the origin of the error is.

I then scaled back to 6-core WU's, still the same issue, after 30 seconds maximum, "WU aborted".

I scaled back to 5-cores. I now even have the same issue on both of them...

FYI. There's nothing wrong with my processor or memory, I tried running WorldCommunityGrid and it runs just fine on 9 cores (out of 12, 2 more cores are assigned to GPUGrid)

I then decided to go drastic and remove VirtualBox and BOINC and reinstall them, but with keeping application settings for now.

I did it, but unfortunately, simply the same errors again on LHC... "WU aborted" after 30 seconds.

Is there a way I can easily enable logging (I see a lot of generic BOINC logging capabilities, but not specific to WU's) ?

Should I remove BOINC & VB completely including existing settings ? and start all over again ?

Would appreciate your help as it's a pity wasting so much time (struggling for nearly 2 months now) ! :-)
7) Message boards : ATLAS application : Atlas tasks "Postponed: VM job unmanageable, restarting later." (Message 37768)
Posted 16 Jan 2019 by BelgianEnthousiast
Post:
Hi Toby, All,

I'm encountering the same issue for about 2 months now. (Postponed: VM job unmanageable, restarting later.)

I run BOINC version 7.14.2 (x64), version 3.0.1.
VB v6.0.0 r127566.

I have 32 GB of RAM
6 core CPU with hyperthreading activated (12 virtual cores)
2 x Asus GTX 1070Ti
Ample disk space, dedicated drive to BOINC of 3 TB with 2.5 TB still unused capacity.
Win 10 Pro (10.017134), all latest patches installed. All drivers kept up-to-date (mobo, drives, LAN, bluetooth, WiFi, etc.).
Latest up-to-date Anti-Virus & Firewall software.

I usually see RAM usage at 12-15 GB out of 32, 15-20 GB available as reported by Win 10.

I've been running multiple projects over time simultaneously (not all of them at once of course, but a select number of them at any
one time).
On GPU level I'm mainly running GPUGrid, in absence of GPUGrid, I have MilkyWay as backup (and Einstein, but only if
MilkyWay is out of WU's)
Each GPU card gets one virtual CPU attributed.

On CPU level, I run LHC and as backup I have WorldComGrid, Rosetta, ClimatePrediction. (only one selected as backup at a time)

I allow LHC to run virtually all projects : LHCb, Atlas, Theory, CMS, etc.

The WU's run in 5-core setups and recently in 7-core setups.

Total core usage : 2 for GPU's, 7 for CPU or 9 cores in total. Adding one core for BOINC and VirtualBox, I'm running on average 10
virtual cores out of 12.

Apart from doing some mail or surfing on the internet, nothing else is happening on this system.
Average CPU load is 80 %
The system runs 24/7, 320/365 days for 4 years now.

On LHC in total, I racked up 3.537.241 credits, I modestly think you could say I'm no longer a rookie :-).

The odd thing is that this error kicks in also in the middle of the night, when I'm not active, subsequently, there should be ample space, processing
power and RAM available.
Unless of course I'm hacked and someone's using my rig to mine on cryptocurrencies... but didn't find any trace just yet.

So far I only spotted Atlas WU's being affected. And it errors not only at the beginning of the WU, but at any possible time (e.g. I have one now at 81.47 %)

The only way to unblock it, is to suspend all CPU & GPU activity, give BOINC the time to stop all crunching. Exit orderly from BOINC, restart BOINC
and then it works again when all WU's are activated once more. Unfortunately, quite soon afterwards, it errors out again. Sometimes still the same
WU, sometimes another one.

LHC is the only project using VB, most likely, there might be some interaction causing an issue. The odd thing is that it seems to occur on both
v6 from VB as well as v5.2.x Moreover, BOINC being at the very latest version does not seem to offer any alleviation.

Actually, when checking VB, I get 2 instances which give an error "Inaccessible" :
Runtime error opening 'F:\BOINC DATA\slots\2\boinc_a10398caff860c3b\boinc_a10398caff860c3b.vbox' for reading: -103 (Path not found.).
F:\tinderbox\win-rel\src\VBox\Main\src-server\MachineImpl.cpp[745] (long __cdecl Machine::i_registeredInit(void)).
Result Code: E_FAIL (0x80004005)
Component: MachineWrap
Interface: IMachine {5047460a-265d-4538-b23e-ddba5fb84976}

Then I have a third instance which is "Powered Off"

And a forth one which is Running (the current Atlas-7cores WU).

Should I reinstall VB ?

Any suggestions ?
8) Message boards : Number crunching : Downloads have stalled (Message 35775)
Posted 5 Jul 2018 by BelgianEnthousiast
Post:
To the LHC administrators :

On my machine (Win 10 Pro - see my previous post) the behaviour I observe is the following :

1. Only LHC (Atlas) WU's downloads stall, Rosetta, WorldComGrid, ClimatePrediction, GPUGrid download just fine.
2. The downloads (usually the big ones > 200 MB) start well at 5-7 Mbps but gradually degrade and at around 50-80 %
of the total filesize, the download speed decreases to zero.
3. At that point, all other downloads from LHC are blocked.
4. When de-activating network and re-activating it again, the downloads resume and if it had progressed far enough to
get the whole file in, it continues to download the smaller files as well. However, if too much of the file was left to download,
I observe exactly the same behaviour again : the download speed decreases over time to zero and stalls the download (again).
5. I also saw the same thing when actually suspending all WU's crunching, exiting BOINC and restarting it again, then resuming
all WU's afterwards.

Can you investigate whether this has to do with the BOINC manager/Windows TCP/IP stack/(T)FTP protocol or with the
file transfer software on your end please ?

I lost a whole night of crunching because of this once again... (4th or 5th time in a week)

Many thanks in advance !

B.
9) Message boards : Number crunching : Downloads have stalled (Message 35671)
Posted 28 Jun 2018 by BelgianEnthousiast
Post:
I'm sorry Nils, but this is by no means resolved.

I'm still living the problem and the fact that it's reproduced at multiple participants means that it is either related to Boinc or VirtualBox.

For information, BOINC version 7.10.2 / 3.0.1
VirtualBox 5.2.12 r 122591

I'm running Windows 10 Pro version 1709, build 16299.431, patched to the latest updates.
Motherboard Asus A-99 vII
CPU Intel Core i7-6850K @3.6 GHz
Memory Corsair 32 GB
GPU's : Asus Strix 1070Ti 8 GB and 1070 (will have 2 Ti's this weekend)

No overclocking is happening.

I'm running for years LHC on CPU and GPUGrid on 2 GPU's with little issues.
Alternatively I also run ClimatePrediction, Rosetta, WorldCommunityGrid, MilkyWay.

I dedicate 8 out of 12 of the virtual cores to Boinc (reserving 2 of 8 to GPUGrid to manage the GPU WU's),
and rest is shared by LHC and WCG, with priority 2000 to LHC and 150 to WCG to run primarily LHC.
Atlas 5 cores taking .... 5 cores and the last core being shared between LHCb, Theory, CMS and Sixtracks.

As you can see from my stats probably, this has been running quite well, racking up between 5.000 and 10.000 credits
per day.
I only have troubles up-and downloading LHC WU's.

What I did notice, is that when I stop BOINC (after suspending active WU's of course), I exit it and start the
application again, the down & uploads suddenly work again and reach 3.5 Mbit/s speeds, finishing them in
a matter of a few tens of seconds.

Sorry to push but could you please investigate further as this is quite annoying :-S

Many thanks in advance !

K.

[/img]d:\LHC- ATLAS DOWNLOADS.jpg[img][/img]
10) Message boards : Number crunching : Downloads have stalled (Message 35642)
Posted 25 Jun 2018 by BelgianEnthousiast
Post:
Same issue here, one Atlas item downloading at 10 Kb/sec, while other Atlas downloads are suspended in parallel....
311,75 MB of 355,60 MB (at 87,66 %) with 19h15:29 elapsed.
Next Atlas one : 211,06 MB of 302,87 (at 69,68 %) dowloading at 9,97 Kbps at 06h01:51
Third Atlas one : 208,80 MB of 302,13 MB (69,11 %) at 17,3 Kbps, busy for 03h27:14...

So first unit downloading nearly a full day, then gets suspended for no reason while others download now already to nearly 70 %.
Not really logical...
Why suspending the first one at nearly 90 % ?
11) Message boards : ATLAS application : Very long tasks in the queue (Message 34784)
Posted 28 Mar 2018 by BelgianEnthousiast
Post:
Hi,

I think it's a good idea indeed, it's similar to ClimatePrediction which I also run.

However, please don't run it on Ubuntu.
I installed it on Windows 10 and it conflicted straight away with LHC and VirtualBox.

I had to uninstall Ubuntu again and my system has since become much more unstable, prompting me to reboot
my system every 3 days or otherwise risk of crashing it. Something which GPUGrid really doesn't like and results
in lost WU's (which last around 6-9 hours) and it's a shame losing those if you're at 95 %...

Thanks to take that into consideration !
12) Message boards : Number crunching : All-out errors on LHC seemingly due to virtualbox. (Message 34276)
Posted 4 Feb 2018 by BelgianEnthousiast
Post:
Ok, that's clear, thanks for the sunday evening response guys ! Most appreciated !
13) Message boards : Number crunching : All-out errors on LHC seemingly due to virtualbox. (Message 34271)
Posted 4 Feb 2018 by BelgianEnthousiast
Post:
Hi All, Yeti,

Big thanks for your very quick and indeed very precise support !

I followed your checklist and indeed it turned out that for some reason virtualisation got disabled.
I would suspect as you mentioned that it was disabled through a BIOS updated which I performed just before year end.

The "<p_vm_extensions_disabled>1</p_vm_extensions_disabled>" was indeed also at "1", so put that back to "0".

Now everything works like a charm !

Maybe a quick side question : does 17 hours as crunch time seem correct to you for CMS and Theory WU's ?

Traditionally, I saw rather 2.5 to max. 7 à 7 hours for LHC or ATLAS WU's...

Again thanks for your support !

Have a nice sunday afternoon :-)
14) Message boards : Number crunching : All-out errors on LHC seemingly due to virtualbox. (Message 34202)
Posted 1 Feb 2018 by BelgianEnthousiast
Post:
Hi,

I've been crunching LHC for ages now and everything works fine.
Until yesterday I enabled not just LHC but all the other sub-projects too.
This morning - to my horror - all tasks were listed as "FAILED".

When checking the log, I got this as error messages. I haven't got a clue what they are referring to.
Can anyone help please ?

Big thanks !

31/01/2018 23:58:57 | LHC@home | Starting task Theory_17686_1517416129.405500_0
31/01/2018 23:59:01 | | Vbox app stderr indicates CPU VM extensions disabled
31/01/2018 23:59:02 | LHC@home | Computation for task Theory_17888_1497842526.551771_0 finished
31/01/2018 23:59:02 | LHC@home | Starting task LHCb_17627_1517416128.197336_0
31/01/2018 23:59:10 | LHC@home | Computation for task w15_ats2017_b1_qp_2_ats2017_b1_QP_2_IOCT_24__52__s__64.27_59.295__11_13__5__7.5_1_sixvf_boinc3983_0 finished
31/01/2018 23:59:10 | LHC@home | Starting task Theory_17820_1497842525.300976_0
31/01/2018 23:59:13 | LHC@home | Started upload of w15_ats2017_b1_qp_2_ats2017_b1_QP_2_IOCT_24__52__s__64.27_59.295__11_13__5__7.5_1_sixvf_boinc3983_0_r1915472003_0
31/01/2018 23:59:16 | | Vbox app stderr indicates CPU VM extensions disabled
31/01/2018 23:59:16 | LHC@home | Computation for task Theory_17685_1517416129.384875_0 finished
31/01/2018 23:59:16 | LHC@home | Starting task CMS_17712_1517416129.756855_0
31/01/2018 23:59:18 | | Vbox app stderr indicates CPU VM extensions disabled
31/01/2018 23:59:18 | LHC@home | Computation for task LHCb_29353_1517417334.615865_0 finished
31/01/2018 23:59:18 | LHC@home | Starting task Theory_17821_1497842525.321976_0
31/01/2018 23:59:20 | LHC@home | Finished upload of w15_ats2017_b1_qp_2_ats2017_b1_QP_2_IOCT_24__52__s__64.27_59.295__11_13__5__7.5_1_sixvf_boinc3983_0_r1915472003_0
31/01/2018 23:59:23 | | Vbox app stderr indicates CPU VM extensions disabled
31/01/2018 23:59:23 | LHC@home | Computation for task Theory_574_1517417637.429044_0 finished
31/01/2018 23:59:23 | LHC@home | Starting task CMS_29450_1517417336.133922_0
31/01/2018 23:59:34 | | Vbox app stderr indicates CPU VM extensions disabled
31/01/2018 23:59:35 | LHC@home | Computation for task Theory_17686_1517416129.405500_0 finished
31/01/2018 23:59:35 | LHC@home | Starting task LHCb_17643_1517416128.494939_0
31/01/2018 23:59:39 | | Vbox app stderr indicates CPU VM extensions disabled
31/01/2018 23:59:40 | LHC@home | Computation for task LHCb_17627_1517416128.197336_0 finished
31/01/2018 23:59:40 | LHC@home | Starting task Theory_17691_1517416129.506237_0
31/01/2018 23:59:44 | | Vbox app stderr indicates CPU VM extensions disabled
15) Message boards : Number crunching : A sudden huge increase in computation errors (Message 27723)
Posted 22 Mar 2016 by BelgianEnthousiast
Post:
Boinc is only using 10 out of the 12 cores to allow headroom for the VB & other tasks. I usually run 3 Climates, 4 Rosetta's and 3 Atlas'es in parallel, working just perfect for about a year now. I run GPUGrid on the GPU additionally to that.

It's just recently that suddenly my system started hanging...

On the distribution mechanism, this is most of the time the case, just when LHC pops up with a new batch of WU's, then LHC takes all resources for about a day/day and a half. I don't mind it doing that as the workload readiness tends to be intermittent. (couple of days WU's available, then two weeks nothing, etc.)
16) Message boards : Number crunching : A sudden huge increase in computation errors (Message 27720)
Posted 20 Mar 2016 by BelgianEnthousiast
Post:
Hi,

Thanks for the reply.

The BOINC mgr is at version 7.6.22 (x64) - widgets 3.0.1
The VirtualBox is at version 4.03.12 r 93733
Checking the version of VB, I noticed there's a new version. I'll install
that one already and check if it will run properly now.

In the mean time :
All was running happily for about a year now.
LHC successfully completed 16.245 WU's so far with no errors at all until last week.
Atlas successfully completed 8.455 WU's with sporadic units failing due to problems with the VirtualBox mid last year and an occasional need for a hard reboot (my wrongdoing...).

But what happened last week was really drastic. I had no way of letting BOINC
again 24/7, it hung repeatedly after a couple of minutes running LHC and Atlas.

I expect LHC to be the culprit as it takes full precedence over Atlas. Whenever
a batch LHC comes in BOINC automatically processes LHC first.

Anyway, any ideas would be most welcome. I'll try the new VB in the mean time.

Kind Regards,

BE.
17) Message boards : Number crunching : A sudden huge increase in computation errors (Message 27716)
Posted 16 Mar 2016 by BelgianEnthousiast
Post:
Is it possible it could also hang my system ?

I have only LHC running on the CPU and in // with GPUGrid on the GPU.
GPU seems to work fine, but I have major issues with the LHC WU's.
At least 10 now. Never had any issues with LHC before... :only one failed WU on 15683.

I'll increase the disksize allocated to BOINC to already see if that can
remedy the problem.

Please dig further to see if this could crash a system as well apart from
failed units.

I'm running Win 7 x64 SP 1, all latest updates installed. 32 GB of RAM.
Core I7-5930 @ 3.5 GHz CPU.
I also have Atlas & Oracle Virtualbox installed.

Thanks !

BE.
18) Questions and Answers : Wish list : downloading current statistics in excel/CSV format (Message 27413)
Posted 28 Apr 2015 by BelgianEnthousiast
Post:
To the developers of LHC@home.

When starting to crunch numbers, you find yourself stuck clicking half an hour to work your way through nearly 10.000 other crunchers 20 at a time(page).

This is ennerving and quite time consuming (rather loss of time...).

I understand that maybe the reasoning behind it is to stimulate people to crunch more and faster, but that would be a bit over the top, no ? ;-)

Consequently, would it be possible to have one CSV (Comma Separated Values) or excel format file on the first statistics page which contains the current status of the 10.000 first users (as the current page shows : ranking, user name, average, total and date of entry) ?

One thing I'd like to do myself is to sort e.g. per country, etc.

Would be really handy and most appreciated !

Btw, happy to be part of such an amazing project and have the ability to contribute something !!!

Cheers,

K.



©2024 CERN