Message boards : Number crunching : ATLAS/CMS Tasks all dying like flies LHC/LHC-dev both
Message board moderation

To post messages, you must log in.

AuthorMessage
Thund3rb1rd

Send message
Joined: 24 Jul 05
Posts: 17
Credit: 2,342,183
RAC: 261
Message 39584 - Posted: 12 Aug 2019, 17:17:02 UTC

All my tasks for ATLAS and CMS are dropping like flies on both LHC and LHC-dev on just one machine.

The machine is Windows 10, BOINC 7.14.2 and VM 6.0.10.

I posted this to the LHC-dev message board and was politely reminded to make sure VT-x was turned on, which it wasn't and which i took care of. According to Task Manager, Virtualization is now enabled.

Obviously, the fault lies in my machine, Dear Horatio, but I'm darned if I know what. I read the log files, but don't understand a good deal of what I'm reading.

Suggestions would be most welcome.

It would be best, I suppose, if I were to stop getting LHC tasks until I can get this fixed.

By the way, everything is running normally on my Win 7 machine which has the same BOINC and VM versions.
ID: 39584 · Report as offensive     Reply Quote
maeax

Send message
Joined: 2 May 07
Posts: 2071
Credit: 156,144,248
RAC: 105,364
Message 39585 - Posted: 12 Aug 2019, 17:29:28 UTC - in response to Message 39584.  

Do you have Hypervisor from Microsoft enabled?

Waiting for VM "boinc_4c8ff30495f2e512" to power on...
VBoxManage.exe: error: Not in a hypervisor partition (HVP=0) (VERR_NEM_NOT_AVAILABLE).
VBoxManage.exe: error: VT-x is disabled in the BIOS for all CPU modes (VERR_VMX_MSR_ALL_VMX_DISABLED)
VBoxManage.exe: error: Details: code E_FAIL (0x80004005), component ConsoleWrap, interface IConsole

2019-08-11 17:49:35 (3952): VM failed to start.
ID: 39585 · Report as offensive     Reply Quote
ivan
Volunteer moderator
Project tester
Volunteer developer
Volunteer tester
Project scientist
Avatar

Send message
Joined: 29 Aug 05
Posts: 998
Credit: 6,264,307
RAC: 71
Message 39586 - Posted: 12 Aug 2019, 17:41:08 UTC

We did have a bit of downtime on CMS jobs (both sites) this afternoon, as a new submission regime was implemented. As far as I can see we are up again now, and my US colleague reports that his changes appear to have worked perfectly. If you still have CMS problems, please report on the dedicated message board.
ID: 39586 · Report as offensive     Reply Quote
computezrmle
Volunteer moderator
Volunteer developer
Volunteer tester
Help desk expert
Avatar

Send message
Joined: 15 Jun 08
Posts: 2386
Credit: 222,955,184
RAC: 136,929
Message 39588 - Posted: 12 Aug 2019, 17:43:36 UTC - in response to Message 39584.  

VirtualBox as well as Hyper-V are both virtualization hypervisors.
They must not be activated concurrently.

As LHC@home's vbox apps require VirtualBox, Hyper-V must be switched off (Windows 10 computers).
ID: 39588 · Report as offensive     Reply Quote
Thund3rb1rd

Send message
Joined: 24 Jul 05
Posts: 17
Credit: 2,342,183
RAC: 261
Message 39589 - Posted: 12 Aug 2019, 17:48:17 UTC - in response to Message 39585.  

The snapshot showing was taken before I fixed the BIOS.

Even since then, everything has been dying.
ID: 39589 · Report as offensive     Reply Quote
Thund3rb1rd

Send message
Joined: 24 Jul 05
Posts: 17
Credit: 2,342,183
RAC: 261
Message 39590 - Posted: 12 Aug 2019, 18:02:55 UTC - in response to Message 39588.  

Hyper-V was not enabled on this machine.
ID: 39590 · Report as offensive     Reply Quote
computezrmle
Volunteer moderator
Volunteer developer
Volunteer tester
Help desk expert
Avatar

Send message
Joined: 15 Jun 08
Posts: 2386
Credit: 222,955,184
RAC: 136,929
Message 39591 - Posted: 12 Aug 2019, 18:26:52 UTC

@Thund3rb1rd

According to your logfiles ATLAS did most likely fail because of the missing VT-x in your computer's BIOS.
This seems to be fixed according to your message:
https://lhcathome.cern.ch/lhcathome/forum_thread.php?id=5111&postid=39589

In addition this computer has an ATLAS task "in progress" since 12 Aug 2019, 2:56:14 UTC that has not (yet?) failed.


CMS had a short outage today which caused some of your tasks over at -dev to fail (not only your tasks and not only at -dev).
CMS is now working fine again as Ivan explained and I can confirm this from my own logs.


Are there any errors left on your windows 10 computer?
ID: 39591 · Report as offensive     Reply Quote
Thund3rb1rd

Send message
Joined: 24 Jul 05
Posts: 17
Credit: 2,342,183
RAC: 261
Message 39596 - Posted: 12 Aug 2019, 18:59:35 UTC - in response to Message 39591.  
Last modified: 12 Aug 2019, 19:01:11 UTC

Well, I don't know. I knew about (but had forgotten about) the BIOS; That's fixed.

I didn't know about hyper-v but that wasn't enabled anyway... I'm a LOOOONG way from being competent in Windows 10.

I don't know about any other errors in my setup - I simply don't know what to look for. It's likely a Win 10 issue since my Win 7 machine seems to be working okay.

Purely as a philosophical observation, I tend to avoid projects that need a great deal of detailed tinkering with the operating system to be productive. I've already changed the profile for my Win 10 machine to not accept ATLAS or CMS tasks until this gets sorted out just as I stopped accepting VM tasks from cosmology@home and dropped climatepredictio@home all together. If that makes my equipment less useful, so be it. There are millions of other folks out there in the BOINC universe to choose from.

And before I forget my manners and get a pop on the head from my Mom, thank you for your help.
ID: 39596 · Report as offensive     Reply Quote
BelgianEnthousiast

Send message
Joined: 5 Apr 15
Posts: 18
Credit: 5,910,849
RAC: 0
Message 40044 - Posted: 29 Sep 2019, 12:40:46 UTC

Hi All,

I'm not sure if my problems are related, but my Atlas WU's keep running forever, never ending.
(counting up to 100 % but just keeps running on and on)
I have to abort after 2 days knowing that before such a WU would take at most 2-6 hours.

I run Theory, CMS, SixTrackx, Atlas, etc. in parallel.

Up until early september all went well, but suddenly Atlas seems to have an issue.

Proc : Intel Core i7 - 6850 K @3.6 GHz, not overclocked
Mobo : Asus X299 Deluxe
RAM : 32 GB
Windows 10 Pro buid 1809
BOINC Mgr. : 6.14.2 (x64)
VBox : 5.1.38 + associated extension Pack

I only run single core WU's for any of the applications allowing a total of 9 out of 12 cores.
2 cores out of 9 are reserved for GPU Grid or Einstein if no GPUGrid WU's available.

In terms of memory, I use in general on average 7 GB with a maximum of 11 that I have
seen over the years. Out of 32, that should not be an issue either...

I used Yeti's checklist and all is ok.
LeoMoon CPU-V indicates that VT-x is supported and enabled.

Any suggestions are very welcome !
ID: 40044 · Report as offensive     Reply Quote
Crystal Pellet
Volunteer moderator
Volunteer tester

Send message
Joined: 14 Jan 10
Posts: 1268
Credit: 8,421,616
RAC: 2,139
Message 40046 - Posted: 29 Sep 2019, 13:06:43 UTC - in response to Message 40044.  
Last modified: 29 Sep 2019, 13:11:51 UTC

Any suggestions are very welcome !
The tasks from the latest batches are running much longer. It seems you are running your ATLAS-tasks on 1 core.
If that's the case a task could even last more than 4 days. Did you try following a task (event processing) with "Show Graphics" in BOINC Manager.
Oracle VM VirtualBox Extension Pack should be installed for that to work.

More info: https://lhcathome.cern.ch/lhcathome/forum_thread.php?id=5135
ID: 40046 · Report as offensive     Reply Quote
computezrmle
Volunteer moderator
Volunteer developer
Volunteer tester
Help desk expert
Avatar

Send message
Joined: 15 Jun 08
Posts: 2386
Credit: 222,955,184
RAC: 136,929
Message 40047 - Posted: 29 Sep 2019, 13:09:25 UTC - in response to Message 40044.  

I'm not sure if my problems are related, but my Atlas WU's keep running forever, never ending.
(counting up to 100 % but just keeps running on and on)

BOINC's runtime estimation can't deal with series of tasks where runtimes are spread over large intervals.
Hence this parameter is extremely unreliable and should be ignored.


I have to abort after 2 days knowing that before such a WU would take at most 2-6 hours.

Each ATLAS task usually simulate 200 colission events.
Depending on the input parameters each event requires a calculation time between a few seconds and far more than 1200 s.

As long as the top console shows athena.py at nearly 100% cpu usage and the logfile console shows the last finished event (on a 1-core setup) is less than 200 there's no need to cancel the task.
Just be patient and let it run.
ID: 40047 · Report as offensive     Reply Quote
Toby Broom
Volunteer moderator

Send message
Joined: 27 Sep 08
Posts: 798
Credit: 644,770,056
RAC: 231,829
Message 40056 - Posted: 30 Sep 2019, 19:10:07 UTC - in response to Message 40047.  

In theory the project though could make a better estimate of the fpops, they could also reset the credit statistics server side as recommended if the job sizes change by 10x and/or finally they could also multi-size applications so that slower computers would be given small jobs and faster ones bigger jobs.

so boinc could handle it if the project team, applied the tools in boinc fully.

it's sort of strange as sixtrack never has these problem?

I just leave them running until they pass the deadline, which is normally about 7day then abort them, since the calculations inside the VM are doing some work it seems the best for the project even though there is no credit for an aborted task.
ID: 40056 · Report as offensive     Reply Quote

Message boards : Number crunching : ATLAS/CMS Tasks all dying like flies LHC/LHC-dev both


©2024 CERN