Thread 'ATLAS/CMS Tasks all dying like flies LHC/LHC-dev both'

Author	Message
Thund3rb1rd Send message Joined: 24 Jul 05 Posts: 17 Credit: 2,463,456 RAC: 132	Message 39584 - Posted: 12 Aug 2019, 17:17:02 UTC All my tasks for ATLAS and CMS are dropping like flies on both LHC and LHC-dev on just one machine. The machine is Windows 10, BOINC 7.14.2 and VM 6.0.10. I posted this to the LHC-dev message board and was politely reminded to make sure VT-x was turned on, which it wasn't and which i took care of. According to Task Manager, Virtualization is now enabled. Obviously, the fault lies in my machine, Dear Horatio, but I'm darned if I know what. I read the log files, but don't understand a good deal of what I'm reading. Suggestions would be most welcome. It would be best, I suppose, if I were to stop getting LHC tasks until I can get this fixed. By the way, everything is running normally on my Win 7 machine which has the same BOINC and VM versions. ID: 39584 · Reply Quote

maeax Send message Joined: 2 May 07 Posts: 2279 Credit: 178,779,667 RAC: 338	Message 39585 - Posted: 12 Aug 2019, 17:29:28 UTC - in response to Message 39584. Do you have Hypervisor from Microsoft enabled? Waiting for VM "boinc_4c8ff30495f2e512" to power on... VBoxManage.exe: error: Not in a hypervisor partition (HVP=0) (VERR_NEM_NOT_AVAILABLE). VBoxManage.exe: error: VT-x is disabled in the BIOS for all CPU modes (VERR_VMX_MSR_ALL_VMX_DISABLED) VBoxManage.exe: error: Details: code E_FAIL (0x80004005), component ConsoleWrap, interface IConsole 2019-08-11 17:49:35 (3952): VM failed to start. ID: 39585 · Reply Quote

ivan Volunteer moderator Project tester Volunteer developer Volunteer tester Project scientist Send message Joined: 29 Aug 05 Posts: 1134 Credit: 11,613,188 RAC: 11,261	Message 39586 - Posted: 12 Aug 2019, 17:41:08 UTC We did have a bit of downtime on CMS jobs (both sites) this afternoon, as a new submission regime was implemented. As far as I can see we are up again now, and my US colleague reports that his changes appear to have worked perfectly. If you still have CMS problems, please report on the dedicated message board. ID: 39586 · Reply Quote

computezrmle Volunteer moderator Volunteer developer Volunteer tester Help desk expert Send message Joined: 15 Jun 08 Posts: 2724 Credit: 299,002,782 RAC: 47,790	Message 39588 - Posted: 12 Aug 2019, 17:43:36 UTC - in response to Message 39584. VirtualBox as well as Hyper-V are both virtualization hypervisors. They must not be activated concurrently. As LHC@home's vbox apps require VirtualBox, Hyper-V must be switched off (Windows 10 computers). ID: 39588 · Reply Quote

Thund3rb1rd Send message Joined: 24 Jul 05 Posts: 17 Credit: 2,463,456 RAC: 132	Message 39589 - Posted: 12 Aug 2019, 17:48:17 UTC - in response to Message 39585. The snapshot showing was taken before I fixed the BIOS. Even since then, everything has been dying. ID: 39589 · Reply Quote

Thund3rb1rd Send message Joined: 24 Jul 05 Posts: 17 Credit: 2,463,456 RAC: 132	Message 39590 - Posted: 12 Aug 2019, 18:02:55 UTC - in response to Message 39588. Hyper-V was not enabled on this machine. ID: 39590 · Reply Quote

computezrmle Volunteer moderator Volunteer developer Volunteer tester Help desk expert Send message Joined: 15 Jun 08 Posts: 2724 Credit: 299,002,782 RAC: 47,790	Message 39591 - Posted: 12 Aug 2019, 18:26:52 UTC @Thund3rb1rd According to your logfiles ATLAS did most likely fail because of the missing VT-x in your computer's BIOS. This seems to be fixed according to your message: https://lhcathome.cern.ch/lhcathome/forum_thread.php?id=5111&postid=39589 In addition this computer has an ATLAS task "in progress" since 12 Aug 2019, 2:56:14 UTC that has not (yet?) failed. CMS had a short outage today which caused some of your tasks over at -dev to fail (not only your tasks and not only at -dev). CMS is now working fine again as Ivan explained and I can confirm this from my own logs. Are there any errors left on your windows 10 computer? ID: 39591 · Reply Quote

Thund3rb1rd Send message Joined: 24 Jul 05 Posts: 17 Credit: 2,463,456 RAC: 132	Message 39596 - Posted: 12 Aug 2019, 18:59:35 UTC - in response to Message 39591. Last modified: 12 Aug 2019, 19:01:11 UTC Well, I don't know. I knew about (but had forgotten about) the BIOS; That's fixed. I didn't know about hyper-v but that wasn't enabled anyway... I'm a LOOOONG way from being competent in Windows 10. I don't know about any other errors in my setup - I simply don't know what to look for. It's likely a Win 10 issue since my Win 7 machine seems to be working okay. Purely as a philosophical observation, I tend to avoid projects that need a great deal of detailed tinkering with the operating system to be productive. I've already changed the profile for my Win 10 machine to not accept ATLAS or CMS tasks until this gets sorted out just as I stopped accepting VM tasks from cosmology@home and dropped climatepredictio@home all together. If that makes my equipment less useful, so be it. There are millions of other folks out there in the BOINC universe to choose from. And before I forget my manners and get a pop on the head from my Mom, thank you for your help. ID: 39596 · Reply Quote

BelgianEnthousiast Send message Joined: 5 Apr 15 Posts: 18 Credit: 5,910,849 RAC: 0	Message 40044 - Posted: 29 Sep 2019, 12:40:46 UTC Hi All, I'm not sure if my problems are related, but my Atlas WU's keep running forever, never ending. (counting up to 100 % but just keeps running on and on) I have to abort after 2 days knowing that before such a WU would take at most 2-6 hours. I run Theory, CMS, SixTrackx, Atlas, etc. in parallel. Up until early september all went well, but suddenly Atlas seems to have an issue. Proc : Intel Core i7 - 6850 K @3.6 GHz, not overclocked Mobo : Asus X299 Deluxe RAM : 32 GB Windows 10 Pro buid 1809 BOINC Mgr. : 6.14.2 (x64) VBox : 5.1.38 + associated extension Pack I only run single core WU's for any of the applications allowing a total of 9 out of 12 cores. 2 cores out of 9 are reserved for GPU Grid or Einstein if no GPUGrid WU's available. In terms of memory, I use in general on average 7 GB with a maximum of 11 that I have seen over the years. Out of 32, that should not be an issue either... I used Yeti's checklist and all is ok. LeoMoon CPU-V indicates that VT-x is supported and enabled. Any suggestions are very welcome ! ID: 40044 · Reply Quote

Crystal Pellet Volunteer moderator Volunteer tester Send message Joined: 14 Jan 10 Posts: 1495 Credit: 9,989,299 RAC: 904	Message 40046 - Posted: 29 Sep 2019, 13:06:43 UTC - in response to Message 40044. Last modified: 29 Sep 2019, 13:11:51 UTC Any suggestions are very welcome ! The tasks from the latest batches are running much longer. It seems you are running your ATLAS-tasks on 1 core. If that's the case a task could even last more than 4 days. Did you try following a task (event processing) with "Show Graphics" in BOINC Manager. Oracle VM VirtualBox Extension Pack should be installed for that to work. More info: https://lhcathome.cern.ch/lhcathome/forum_thread.php?id=5135 ID: 40046 · Reply Quote

computezrmle Volunteer moderator Volunteer developer Volunteer tester Help desk expert Send message Joined: 15 Jun 08 Posts: 2724 Credit: 299,002,782 RAC: 47,790	Message 40047 - Posted: 29 Sep 2019, 13:09:25 UTC - in response to Message 40044. I'm not sure if my problems are related, but my Atlas WU's keep running forever, never ending. (counting up to 100 % but just keeps running on and on) BOINC's runtime estimation can't deal with series of tasks where runtimes are spread over large intervals. Hence this parameter is extremely unreliable and should be ignored. I have to abort after 2 days knowing that before such a WU would take at most 2-6 hours. Each ATLAS task usually simulate 200 colission events. Depending on the input parameters each event requires a calculation time between a few seconds and far more than 1200 s. As long as the top console shows athena.py at nearly 100% cpu usage and the logfile console shows the last finished event (on a 1-core setup) is less than 200 there's no need to cancel the task. Just be patient and let it run. ID: 40047 · Reply Quote

Toby Broom Volunteer moderator Send message Joined: 27 Sep 08 Posts: 899 Credit: 771,512,709 RAC: 179,281	Message 40056 - Posted: 30 Sep 2019, 19:10:07 UTC - in response to Message 40047. In theory the project though could make a better estimate of the fpops, they could also reset the credit statistics server side as recommended if the job sizes change by 10x and/or finally they could also multi-size applications so that slower computers would be given small jobs and faster ones bigger jobs. so boinc could handle it if the project team, applied the tools in boinc fully. it's sort of strange as sixtrack never has these problem? I just leave them running until they pass the deadline, which is normally about 7day then abort them, since the calculations inside the VM are doing some work it seems the best for the project even though there is no credit for an aborted task. ID: 40056 · Reply Quote