Message boards :
Number crunching :
ATLAS/CMS Tasks all dying like flies LHC/LHC-dev both
Message board moderation
Author | Message |
---|---|
Send message Joined: 24 Jul 05 Posts: 15 Credit: 2,077,593 RAC: 1,752 ![]() ![]() |
All my tasks for ATLAS and CMS are dropping like flies on both LHC and LHC-dev on just one machine. The machine is Windows 10, BOINC 7.14.2 and VM 6.0.10. I posted this to the LHC-dev message board and was politely reminded to make sure VT-x was turned on, which it wasn't and which i took care of. According to Task Manager, Virtualization is now enabled. Obviously, the fault lies in my machine, Dear Horatio, but I'm darned if I know what. I read the log files, but don't understand a good deal of what I'm reading. Suggestions would be most welcome. It would be best, I suppose, if I were to stop getting LHC tasks until I can get this fixed. By the way, everything is running normally on my Win 7 machine which has the same BOINC and VM versions. |
Send message Joined: 2 May 07 Posts: 1752 Credit: 136,505,063 RAC: 28,175 ![]() ![]() ![]() |
Do you have Hypervisor from Microsoft enabled? Waiting for VM "boinc_4c8ff30495f2e512" to power on... VBoxManage.exe: error: Not in a hypervisor partition (HVP=0) (VERR_NEM_NOT_AVAILABLE). VBoxManage.exe: error: VT-x is disabled in the BIOS for all CPU modes (VERR_VMX_MSR_ALL_VMX_DISABLED) VBoxManage.exe: error: Details: code E_FAIL (0x80004005), component ConsoleWrap, interface IConsole 2019-08-11 17:49:35 (3952): VM failed to start. |
![]() Send message Joined: 29 Aug 05 Posts: 942 Credit: 6,161,855 RAC: 1,089 ![]() |
We did have a bit of downtime on CMS jobs (both sites) this afternoon, as a new submission regime was implemented. As far as I can see we are up again now, and my US colleague reports that his changes appear to have worked perfectly. If you still have CMS problems, please report on the dedicated message board. ![]() |
![]() Send message Joined: 15 Jun 08 Posts: 2182 Credit: 185,866,945 RAC: 186,985 ![]() ![]() ![]() |
VirtualBox as well as Hyper-V are both virtualization hypervisors. They must not be activated concurrently. As LHC@home's vbox apps require VirtualBox, Hyper-V must be switched off (Windows 10 computers). |
Send message Joined: 24 Jul 05 Posts: 15 Credit: 2,077,593 RAC: 1,752 ![]() ![]() |
The snapshot showing was taken before I fixed the BIOS. Even since then, everything has been dying. |
Send message Joined: 24 Jul 05 Posts: 15 Credit: 2,077,593 RAC: 1,752 ![]() ![]() |
Hyper-V was not enabled on this machine. |
![]() Send message Joined: 15 Jun 08 Posts: 2182 Credit: 185,866,945 RAC: 186,985 ![]() ![]() ![]() |
@Thund3rb1rd According to your logfiles ATLAS did most likely fail because of the missing VT-x in your computer's BIOS. This seems to be fixed according to your message: https://lhcathome.cern.ch/lhcathome/forum_thread.php?id=5111&postid=39589 In addition this computer has an ATLAS task "in progress" since 12 Aug 2019, 2:56:14 UTC that has not (yet?) failed. CMS had a short outage today which caused some of your tasks over at -dev to fail (not only your tasks and not only at -dev). CMS is now working fine again as Ivan explained and I can confirm this from my own logs. Are there any errors left on your windows 10 computer? |
Send message Joined: 24 Jul 05 Posts: 15 Credit: 2,077,593 RAC: 1,752 ![]() ![]() |
Well, I don't know. I knew about (but had forgotten about) the BIOS; That's fixed. I didn't know about hyper-v but that wasn't enabled anyway... I'm a LOOOONG way from being competent in Windows 10. I don't know about any other errors in my setup - I simply don't know what to look for. It's likely a Win 10 issue since my Win 7 machine seems to be working okay. Purely as a philosophical observation, I tend to avoid projects that need a great deal of detailed tinkering with the operating system to be productive. I've already changed the profile for my Win 10 machine to not accept ATLAS or CMS tasks until this gets sorted out just as I stopped accepting VM tasks from cosmology@home and dropped climatepredictio@home all together. If that makes my equipment less useful, so be it. There are millions of other folks out there in the BOINC universe to choose from. And before I forget my manners and get a pop on the head from my Mom, thank you for your help. |
Send message Joined: 5 Apr 15 Posts: 18 Credit: 5,910,849 RAC: 0 ![]() ![]() |
Hi All, I'm not sure if my problems are related, but my Atlas WU's keep running forever, never ending. (counting up to 100 % but just keeps running on and on) I have to abort after 2 days knowing that before such a WU would take at most 2-6 hours. I run Theory, CMS, SixTrackx, Atlas, etc. in parallel. Up until early september all went well, but suddenly Atlas seems to have an issue. Proc : Intel Core i7 - 6850 K @3.6 GHz, not overclocked Mobo : Asus X299 Deluxe RAM : 32 GB Windows 10 Pro buid 1809 BOINC Mgr. : 6.14.2 (x64) VBox : 5.1.38 + associated extension Pack I only run single core WU's for any of the applications allowing a total of 9 out of 12 cores. 2 cores out of 9 are reserved for GPU Grid or Einstein if no GPUGrid WU's available. In terms of memory, I use in general on average 7 GB with a maximum of 11 that I have seen over the years. Out of 32, that should not be an issue either... I used Yeti's checklist and all is ok. LeoMoon CPU-V indicates that VT-x is supported and enabled. Any suggestions are very welcome ! |
Send message Joined: 14 Jan 10 Posts: 1171 Credit: 7,316,931 RAC: 10,655 ![]() ![]() ![]() |
Any suggestions are very welcome !The tasks from the latest batches are running much longer. It seems you are running your ATLAS-tasks on 1 core. If that's the case a task could even last more than 4 days. Did you try following a task (event processing) with "Show Graphics" in BOINC Manager. Oracle VM VirtualBox Extension Pack should be installed for that to work. More info: https://lhcathome.cern.ch/lhcathome/forum_thread.php?id=5135 |
![]() Send message Joined: 15 Jun 08 Posts: 2182 Credit: 185,866,945 RAC: 186,985 ![]() ![]() ![]() |
I'm not sure if my problems are related, but my Atlas WU's keep running forever, never ending. BOINC's runtime estimation can't deal with series of tasks where runtimes are spread over large intervals. Hence this parameter is extremely unreliable and should be ignored. I have to abort after 2 days knowing that before such a WU would take at most 2-6 hours. Each ATLAS task usually simulate 200 colission events. Depending on the input parameters each event requires a calculation time between a few seconds and far more than 1200 s. As long as the top console shows athena.py at nearly 100% cpu usage and the logfile console shows the last finished event (on a 1-core setup) is less than 200 there's no need to cancel the task. Just be patient and let it run. |
Send message Joined: 27 Sep 08 Posts: 751 Credit: 570,942,136 RAC: 99,465 ![]() ![]() ![]() |
In theory the project though could make a better estimate of the fpops, they could also reset the credit statistics server side as recommended if the job sizes change by 10x and/or finally they could also multi-size applications so that slower computers would be given small jobs and faster ones bigger jobs. so boinc could handle it if the project team, applied the tools in boinc fully. it's sort of strange as sixtrack never has these problem? I just leave them running until they pass the deadline, which is normally about 7day then abort them, since the calculations inside the VM are doing some work it seems the best for the project even though there is no credit for an aborted task. |
©2023 CERN