Message boards :
Number crunching :
Checklist Version 3 for Atlas@Home (and other VM-based Projects) on your PC
Message board moderation
Author | Message |
---|---|
![]() ![]() Send message Joined: 2 Sep 04 Posts: 418 Credit: 102,794,105 RAC: 71,145 ![]() ![]() ![]() |
It is not so easy to run CERN-VirtualBox-Tasks on BOINC. You have to work out a good balance on your machine(s) between your Projects This checklist is the intention to help and was first developed for Atlas. But meanwhile you can also use this checklist for the other VM-Projects of LHC@Home, but Memory-Usage and Hardcopys are different. As BOINC doesn't allow us to keep the original-checklist up to date, we have to make a new thread from time to time. This Version is actualized with all new informations / hints we got since the first checklist was made. This checklist was last updated at 06.06.2017 Because of these Checklist-Updates it may be that the numbering may change / has already changed. To be sure that you point / get pointed to the correct detail I suggest to set the Version-Number of the Checklist in Front. So V3.P5 is Checklist 3 (this one here) Point 5 Please, check this list and be sure to check really all Details, step by step, all are important. * Do you use an actual BOINC-x64-Client ? At the Moment, 7.6.22, 7.6.33 or 7.8.3 does it very well. At 09.08.2018 I have started to test 7.12.1 and it seems to be fine. * VirtualBox * Do you have installed VirtualBox ? At the Moment, 5.1.30 is doing very well, Atlas-Team even recommends to use them. Atlas has stopped working on VirtualBox 4.x WIN10-Users should use 5.1.16 (or higher ones), as the upcoming 17xx-release is pronounced not to work with older VirtulBox-Versions At the moment I'm (09.08.2018) I'm trying 5.2.16 (together with BOINC 7.12.1) and it seems also to work fine with Atlas (I'm on Win10 1803) * Do you use Hyper-V or Docker ? They will interfere with VirtualBox and cause problems. You should deactivate, better uninstall them
[b]<p_vm_extensions_disabled>1</p_vm_extensions_disabled>[/b] If this is absent or the number is 0 / zero than all is fine. Otherwise change it to 0 / zero <p_vm_extensions_disabled>0</p_vm_extensions_disabled> and safe the file. Be carefull to save it as a real ascii-file Be carefull that you closed your BOINC-Client successfully before you change anything in client_state.xml. Otherwise BOINC will overwrite your changes * Local Resources * Check, if you have have enough RAM for Atlas available. Each SingeCore-Atlas-Task needs 2,1 GB free RAM, MultiCore-WUs need 3,0 GB + 0,9 GB * number of cores (Last Update from 01.08.2018) So 7,5 GB for a 5-Core WU. [Update 18.09.2018] Nowerdays Atlas runs only MultiCoreWUs, even if you run it with 1-Core only it will need up to 3,9 GB as SingleCore If you have an 8-core-processor, but only 8 GB RAM, BOINC will try to satisfy all 8 cores, this will lead to a point where one or more or all VMs get stalled with "postponed: waiting for memory ..." If you get messages like these you should first try to run only 1 WU and see, if this works well. If so, enable a second one and look how it works. And so on. If you have "postponed: waiting for memory ..." WUs sitting in your BOINC-Client you could exit BOINC and restart it after a short pause. Meanwhile Atlas focusses on MultiCoreWU, so one WU can use more than 1 core to crunch. Atlas is capabable from 1 to 8 core /WU. You can set the number of cores you want on the project-preferences. Set the "Number of Cores" to your wishes. Note, that this only works for newly downloaded WUs. Consider aborting already downloaded WUs. * Check, if you have enough disk-spcae and allowed BOINC to use them. You will find this in your Preferences
* Scenario A: Your WUs end up after 10 or 20 minutes then there could something still be wrong mostly on your PC or your Firewall. * Scenario B: Your WUs run more than 20 / 30 minutes but your CPU-Time is only 10 or 20 seconds, then we do not know exactly what is the reason. In one case we could identify a faulty DNS-Server as reason. You could help us to find the reason for this. First try a project reset of Atlas (LHC@Home). If this helped: fine! Let us know If this didn't help maybe you should consider to clean up the install as described in the last point * Scenario C: Your WUs end up after several seconds. In the logs you can find something like "Error Code: ERR_CPU_VM_EXTENSIONS_DISABLED" Then you should go back to Point Nr 4 (V3.P4) + 5 (V3.P5) above * Scenario D: Your WUs get stalled with "postponed: waiting for memory ...". Most of the time you have tried to start more WUs than the memory of your Machine can stand. Suspend several of these WUs, exit BOINC and make sure all tasks are ended, then start BOINC again. Try to run 1 task only to see if that works, than 2 and so on. May be you should check your settings about memory at https://lhcathome.cern.ch/lhcathome/prefs.php?subset=global&cols=1. Check for "memory when computer is in use" * Scenario E: Your WU runs and runs and runs and you are afraid you have a dead longrunner. Then you should go inside the VM Console (see below), click with the mouse into the Console and enter a Username at the Login-Prompt. Try Atlas as username and press enter. If you get the Password-Prompt, all seems to be fine and the VM seems to be still alive. If you don't get the Password-Prompt within 5 / 10 seconds, than the WU seems to be crashed and you should abort it ------------------------------------------------------------------------ Another way to check your WU is to mark the running WU in TASKS and then klick on the PROPERTIES-BUTTON at the left side. You will get a windows similar like this: ![]() The example is a running 3-Core-WU. You should check: CPU-Time at last checkpoint CPU-Time Elapsed Time CPU-Time should be something about "Elapsed Time" * NumberOfCores - 15 minutes If CPU-Time is something with 1 or 2 hours but your Elapsed-Time is already much higher, than the WU is dead and you should abort it
*ATLAS ALT/F2: *ATLAS ALT/F3: (~ TOP-SCREEN) This screen shows a running 3-Core-WU. Look at CPU% ![]() *ATLAS ALT/F3: example for a DEAD WU (it should run as an 1-Core-WU, you can see that it really is running as an 8-Core-WU) ![]() *Theory: Hardcopy follows *CMS ALT/F1: ![]() *CMS ALT/F2: ![]() *CMS ALT/F3: (~ TOP-SCREEN) ![]() *LHCb: Hardcopy follows *Alice: Hardcopy follows
* Set Atlas-Project / LHC@Home to "No New Tasks" * Abort all Atlas/LHC@Home-Tasks in BOINC-Manager * Force BOINC to communicate with Atlas/LHC@Home-Server until all Tasks are gone in your task-list * Exit BOINC * Open VirtualBoxManager and delete all VMs that are listed (be carefull not to delete VMs of vLHC or CMS) * Exit VirtualBoxManager * Reboot your PC
![]() Supporting BOINC, a great concept ! |
![]() Send message Joined: 19 Mar 17 Posts: 1 Credit: 56,625 RAC: 0 ![]() ![]() |
Yeti, Thank you so much! I followed the steps as directed, and was quickly able to find what wasn't working for me. |
![]() ![]() Send message Joined: 2 Sep 04 Posts: 418 Credit: 102,794,105 RAC: 71,145 ![]() ![]() ![]() |
|
Send message Joined: 23 Feb 09 Posts: 2 Credit: 1,909,339 RAC: 5,668 ![]() ![]() |
@Yeti, one thing to add. If your windows machine has docker installed it will break virtual box. |
![]() ![]() Send message Joined: 2 Sep 04 Posts: 418 Credit: 102,794,105 RAC: 71,145 ![]() ![]() ![]() |
|
Send message Joined: 27 Sep 08 Posts: 618 Credit: 386,378,705 RAC: 130,669 ![]() ![]() ![]() |
I never install the extension pack on my computers and it works fine, I agree it's useful for understanding issues if there is some |
Send message Joined: 18 Mar 17 Posts: 1 Credit: 3,085,311 RAC: 3,151 ![]() ![]() ![]() |
Yeti,Thank you very much for the detailed instruction. |
Send message Joined: 25 Mar 17 Posts: 1 Credit: 76,617 RAC: 0 ![]() ![]() |
Awesome instructions Yeti. Definitely got me up and running. One suggestion though, in step 6, I think it's important to specifically mention Computing Preferences and the memory and disk limitations configured there. These tripped me up on my dedicated folder with 32GB of RAM. The 50% limits there prevented multiple WU's and caused seemingly random behavior, when BOINC was really enforcing those limits. |
![]() ![]() Send message Joined: 2 Sep 04 Posts: 418 Credit: 102,794,105 RAC: 71,145 ![]() ![]() ![]() |
One suggestion though, in step 6, I think it's important to specifically mention Computing Preferences and the memory and disk limitations configured there. These tripped me up on my dedicated folder with 32GB of RAM. The 50% limits there prevented multiple WU's and caused seemingly random behavior, when BOINC was really enforcing those limits. Unfortunately I can't edit the checklist V3, so I have to wait until it is time for V3.5 (or V4, who knows ?). But your comments will make their way into the next Version. Thanks, Yeti ![]() Supporting BOINC, a great concept ! |
![]() ![]() Send message Joined: 2 Sep 04 Posts: 418 Credit: 102,794,105 RAC: 71,145 ![]() ![]() ![]() |
|
Send message Joined: 22 Jun 16 Posts: 4 Credit: 986,111 RAC: 0 ![]() ![]() |
I'm trying to run Atlas on a Proliant DL580 G7 with quad E7-4870's. HT is enabled and I'm running 10 concurrent tasks. Everything seems fine. I've verified everything with the checklist. The problem is that tasks are taking 12+ hours to complete. Is this normal? Or is there a scaling issue I'm not aware of? Any advice would be appreciated. Thanks! PS: I am only running 128gb RAM. But RAM usage never exceeds 65%. |
Send message Joined: 18 Dec 15 Posts: 1322 Credit: 24,466,934 RAC: 11,050 ![]() ![]() |
The problem is that tasks are taking 12+ hours to complete. Is this normal? With a 2.4GHz processor (I guess you did NOT overclock?), this seems to be the "normal" crunching time for the current ATLAS tasks. |
Send message Joined: 22 Jun 16 Posts: 4 Credit: 986,111 RAC: 0 ![]() ![]() |
One thing I didn't mention is that CPU usage never goes over 55%. I'm running Linux Mint 18.1 Cinnamon (standard desktop). On my 2P Windows 7 machines CPU usage is normally at 90% plus. I've also got RAM in eight cartridges. I'm wondering if dropping back to "single channel" four cartridges might help. Can also add 64gb additional RAM. |
Send message Joined: 18 Dec 15 Posts: 1322 Credit: 24,466,934 RAC: 11,050 ![]() ![]() |
One thing I didn't mention is that CPU usage never goes over 55%. This is logical if with 20 CPU threads available, you run (only) 10 tasks. You could easily run more than 10 tasks. |
Send message Joined: 22 Jun 16 Posts: 4 Credit: 986,111 RAC: 0 ![]() ![]() |
I'm assuming I need to run an app_config to run more than 10. Any idea what that might be? |
Send message Joined: 18 Dec 15 Posts: 1322 Credit: 24,466,934 RAC: 11,050 ![]() ![]() |
I'm assuming I need to run an app_config to run more than 10. Any idea what that might be? no app_config needed. You can set the number of tasks on your settings page (on the Homepage). Up to 24 tasks. Further, if you'd like to save some RAM, you could set multicore tasks, which, in total, need less RAM than single cores. Also the crunching time would decrease, of course (so a 2-core task would need about 6 hours, a 3-core task about 4 hours, ...) |
Send message Joined: 22 Jun 16 Posts: 4 Credit: 986,111 RAC: 0 ![]() ![]() |
I'm assuming I need to run an app_config to run more than 10. Any idea what that might be? I'm afraid changing the site settings will mess up my 2P rigs. It took me forever to get them to run correctly as it is. Which tasks are multi-core? It's apparent I need to learn a whole lot more about this. |
Send message Joined: 14 Jan 10 Posts: 996 Credit: 6,430,977 RAC: 521 ![]() ![]() |
Which tasks are multi-core? It's apparent I need to learn a whole lot more about this. All vbox-tasks (CMS, LHCb, Theory and ATLAS) can run multi-core, but only ATLAS will use the cores for 1 single job - will shorten the task. The other three sub-projects will load jobs for every core, so when you set Max # of CPUs to 4 the created VM will do 4 jobs within the VM. Towards the end of the task the jobs will end one after another - not at the same time. The VM will have idle cores until the last job has finished. With ATLAS the single job will use all defined cores and the task will run faster. Be aware that the credits for multi core are much lower than for single core due to BOINC's credit mechanism. |
Send message Joined: 18 Dec 15 Posts: 1322 Credit: 24,466,934 RAC: 11,050 ![]() ![]() |
In other words - presently, it makes sense only to run ATLAS on multi-core. |
![]() Send message Joined: 29 Aug 05 Posts: 743 Credit: 5,688,626 RAC: 4 |
In other words - presently, it makes sense only to run ATLAS on multi-core. We can run CMS multi-core in -dev, but in my experience you lose efficiency with more than two jobs, because of the jobs ending at different times as mentioned above, but also because they have a staggered start, in pairs, so as not to overload the system (disk, network). When I was looking at it the staging was at twenty-minute intervals; I don't know if this has been changed lately. ![]() |
©2021 CERN