Message boards :
ATLAS application :
Non-zero return code from EVNTtoHITS (65) (Error code 65)
Message board moderation
Previous · 1 · 2 · 3 · 4 · Next
Author | Message |
---|---|
Send message Joined: 13 Apr 18 Posts: 443 Credit: 8,438,885 RAC: 0 |
Did you try the app_info.xml as suggested by computezrmie? That would be the first thing to implement. Also, how many other projects are you crunching simultaneously in addition to ATLAS? How many cores are simultaneously occupied with BOINC tasks and what tasks are they? Maybe try suspending all of those (if any) and see if you can get just ATLAS working all by itself, just a single 2-core ATLAS task all by itself. If you can complete 4 of those in a row each with HITS files then slowly, 1 core at a time, allow other tasks on the other cores. It's not just a matter of having enough RAM, there are also timing issues involved. So start with just crawling, then walking and eventually running. You might find that you cannot work all 8 cores when an ATLAS is running. |
Send message Joined: 15 Jun 08 Posts: 2386 Credit: 222,994,413 RAC: 136,379 |
Did you try the app_info.xml ... Needs to be clarified: app_info.xml can also be used by BOINC but for a completely different usecase. What you need here is an app_config.xml. |
Send message Joined: 2 Sep 04 Posts: 453 Credit: 193,369,412 RAC: 10,065 |
A while ago Crystal Pellet pointed out a method how you can check the success of an ATLAS job using the PandaID. When I trie to use the Panda-Link, it asks me always for a login. Can someone provide a "how to" ? Thanks in advance Supporting BOINC, a great concept ! |
Send message Joined: 15 Jun 08 Posts: 2386 Credit: 222,994,413 RAC: 136,379 |
You may use one of the accounts given on the login page, e.g. your google account that you use for your android smartphone. |
Send message Joined: 13 Apr 18 Posts: 443 Credit: 8,438,885 RAC: 0 |
When I trie to use the Panda-Link, it asks me always for a login. Can someone provide a "how to" ? It asked me to login too but there was a choice to login with Facebook or Google. I don't do Facebook so I clicked Google. Then I got a box with my gmail address in the email input field and a blank password field. I entered my password, clicked "sign in" and it let me in, no verification required. I have never registered an account at that site so I was surprised it was that easy. Hmmm, maybe it was that easy because I used the same email address I used to attach to the LHC project? |
Send message Joined: 16 Sep 17 Posts: 100 Credit: 1,618,469 RAC: 0 |
What have you tried so far? Creating the app_config.xml as shown above, reading the config files and rebooting the system should clear up any issues. ATLAS generally doesn't like interruptions (stopping and starting), the safe bet is to set a fixed use limit instead of a flexible "when computer is in use" limit. It might be easier (at least for troubleshooting) to crunch one project and only one subproject e.g. LHC ATLAS. |
Send message Joined: 2 May 07 Posts: 2071 Credit: 156,179,585 RAC: 105,376 |
A while ago Crystal Pellet pointed out a method how you can check the success of an ATLAS job using the PandaID. When you click on the left side at the top - Panda, than you see Atlas-Panda and scrolling down through the graphs down to the end of the page. |
Send message Joined: 13 Apr 18 Posts: 443 Credit: 8,438,885 RAC: 0 |
Did you try the app_info.xml ... My bad. Thanks for clarification. |
Send message Joined: 1 Sep 04 Posts: 52 Credit: 11,154,428 RAC: 3,311 |
202778167 seems to have worked. Of course I've thought that before. Panda says "finished" though. The only change I made was to allow 8 CPU's. I'm very much over my head with this but I'm trying and I appreciate everyone's patience. - Dick |
Send message Joined: 15 Jun 08 Posts: 2386 Credit: 222,994,413 RAC: 136,379 |
The final result looks good. https://lhcathome.cern.ch/lhcathome/result.php?resultid=202778167 Nonetheless there are a few remarks: 1. The more cores you configure per VM the less efficient each VM will be. This is a design issue. 1-core VMs work most efficient in most cases but if you run a couple of them concurrently they need lots of RAM. To optimise the resource usage it is necessary to find the individual balance between RAM and CPU usage. 1-core and 2-core setups usually need a local RAM tuning via app_config.xml (>=4800MB). 2. "Setting CPU throttle for VM. (80%)" Under certain circumstances (not in this case) a CPU throttle less than 100% may cause timing errors. If possible, you may set a higher value. Disadvantages may be a higher system temperature and more fan noise. 3. 2018-07-22 19:04:15 (23900): VM state change detected. (old = 'running', new = 'paused') 2018-07-22 19:04:26 (23900): VM state change detected. (old = 'paused', new = 'running') 2018-07-22 19:05:08 (23900): VM state change detected. (old = 'running', new = 'paused') 2018-07-22 19:05:19 (23900): VM state change detected. (old = 'paused', new = 'running') 2018-07-22 19:12:54 (23900): VM state change detected. (old = 'running', new = 'paused') 2018-07-22 19:13:05 (23900): VM state change detected. (old = 'paused', new = 'running') 2018-07-22 19:14:15 (23900): VM state change detected. (old = 'running', new = 'paused') 2018-07-22 19:14:26 (23900): VM state change detected. (old = 'paused', new = 'running') 2018-07-22 19:15:09 (23900): VM state change detected. (old = 'running', new = 'paused') 2018-07-22 19:15:20 (23900): VM state change detected. (old = 'paused', new = 'running') 2018-07-22 19:15:30 (23900): Stopping VM. 2018-07-22 19:16:16 (23900): Error in stop VM for VM: -182 Command: VBoxManage -q controlvm "boinc_2c75cee3b312eaa0" savestate Output: 0%...10%...20%...30%...40%...50%...60%...70%... 2018-07-22 19:16:16 (23900): VM did not stop when requested. 2018-07-22 19:16:16 (23900): VM was successfully terminated. Looks suspect and close to a crash but your VM managed to recover. 4. 2018-07-23 02:54:17 (10616): Guest Log: -rw------- 1 atlas01 atlas01 141775081 Jul 23 02:49 HITS.14568781._061245.pool.root.1 Here it is, the famous HITS file. 5. 2018-07-23 02:54:17 (10616): VM Completion File Detected. 2018-07-23 02:54:17 (10616): Powering off VM. 2018-07-23 02:54:21 (10616): Successfully stopped VM. <rest of the log> looks good. |
Send message Joined: 13 Apr 18 Posts: 443 Credit: 8,438,885 RAC: 0 |
202778167 seems to have worked. Of course I've thought that before. Panda says "finished" though. If Panda says "finished" then it worked. I want to say something like... Well done, Dick :) Perseverance pays off. ...but with the "2018-07-22 19:16:16 (23900): Error in stop VM for VM: -182" I fear you might have just got lucky with that one. It started wobbling and recovered but next time you might not be so lucky. The 80% CPU throttle for the VM is not a good idea and I will add that it may have caused this VM: -182 error. The only change I made was to allow 8 CPU's Not a good idea from an efficiency perspective though I doubt it caused the VM: 182 error. 4 CPUs would be better, 2 should be your goal. I'm very much over my head with this but I'm trying and I appreciate everyone's patience. The only question(s) I have for you are: 1) What did you change to allow 8 CPUs? Did you change it in your website settings? In the app_info.xml? Or both? 2) Did you create the app_info.xml? And if you did create it then did you verify that BOINC can find it and that it doesn't contain errors? Verify by opening BOINC Manager and clicking Options -> Read config files then open the Event log (Tools -> Event log), scroll to the bottom and see if it says "LHC@home <date> Found app_config.xml". It it doesn't say that then you either did not create it or you created it in the wrong folder. If it does say "Found" then see if any red text follows. If it found errors then it will show those errors in red. If no errors then it is syntactically correct. |
Send message Joined: 15 Jun 08 Posts: 2386 Credit: 222,994,413 RAC: 136,379 |
1) What did you change to allow 8 CPUs? Did you change it in your website settings? In the app_info.xml? Or both? @bronco Again: Please, don't use "app_info.xml" in this context. It's "app_config.xml". |
Send message Joined: 1 Sep 04 Posts: 52 Credit: 11,154,428 RAC: 3,311 |
1) What did you change to allow 8 CPUs? Did you change it in your website settings? In the app_info.xml? Or both? Yup, figured out about the name. I was unsure where to create the file but went back to the other threads and found it, C:\Users\All Users\BOINC\projects\lhcathome.cern.ch_lhcathome. (Windows 10 doesn't make it easy to find that sucker. All Users is considered a system file and by default doesn't show up in Explorer or Command Prompt until you change the default view (although you can get to and its children it via the prompt if you know the exact path). At that point it shows up in Explorer but still not in Command Prompt. I think I'll call Bill Gates.) So, once i did that it was found by having BOINC read the config files. The file is as shown in another thread: <app_config> <app_version> <app_name>ATLAS</app_name> <plan_class>vbox64_mt_mcore_atlas</plan_class> <avg_ncpus>2.0</avg_ncpus> <cmdline>--nthreads 2 --memory_size_mb 4800</cmdline> </app_version> <project_max_concurrent>1</project_max_concurrent> </app_config> I've changed the # of CPUs on the website back to 2 which you suggested as the goal. BTW, I noticed another oddity on the LHC website that the admins there might want to know. When I gave up on this earlier I told it no more ATLAS and forgot I'd done it. Didn't matter, I kept getting tasks. Just allowed it again. - Dick |
Send message Joined: 15 Jun 08 Posts: 2386 Credit: 222,994,413 RAC: 136,379 |
BTW, I noticed another oddity on the LHC website that the admins there might want to know. When I gave up on this earlier I told it no more ATLAS and forgot I'd done it. Didn't matter, I kept getting tasks. Just allowed it again. Did you set "If no work for selected applications is available, accept work from other applications?" on your web preferences page? https://lhcathome.cern.ch/lhcathome/prefs.php?subset=project |
Send message Joined: 1 Sep 04 Posts: 52 Credit: 11,154,428 RAC: 3,311 |
Oops. Good point. |
Send message Joined: 13 Apr 18 Posts: 443 Credit: 8,438,885 RAC: 0 |
Again: Please, don't use "app_info.xml" in this context. It's "app_config.xml". Sorry, I did it again. Bad habits die hard. |
Send message Joined: 13 Apr 18 Posts: 443 Credit: 8,438,885 RAC: 0 |
I was unsure where to create the file Glad you found it. An easier way to find the path to BOINC's data directory it is to look in the event log (BOINC Manager -> Tools -> Show event log). It's in the first 10 lines. If it's not then it's because the log grew long and BOINC trimmed it. In that case just shutdown BOINC client, wait a few minutes for VBoxHeadless to shutdown, restart the client and look at the log again. |
Send message Joined: 1 Sep 04 Posts: 52 Credit: 11,154,428 RAC: 3,311 |
Questions re # CPUs: I've got the number of CPUs set to 2 at the website and in the app_config.xml. 1) Are these setting redundant? If not, what's the difference in the effect? 2) Do they limit the number of CPU's for one task or the total number of CPU's in use for all LHC tasks? 3) Do they have any effect on what else can run (i.e. tasks from other projects)? Here's the reason I'm asking. Right at this moment I have 3 running tasks (not including non-intensive and GPU), 1 LHC and 2 from another project, all single CPU. I have 4 ready to start tasks. All are LHC 2 CPU tasks, 2 ATLAS and 2 Theory. What I expected is for 2 of those 2 CPU tasks to be running to bring the total CPUs to 7. I don't understand why they don't start and why they seem to be blocking other projects from running. If I suspend LHC then BOINC immediately starts polling other projects looking for work and starts those tasks up to the max of 8. Thanks, - Dick[/list] |
Send message Joined: 15 Jun 08 Posts: 2386 Credit: 222,994,413 RAC: 136,379 |
1) Are these setting redundant? Not exactly. The website setting only affects multicore apps, in this case ATLAS and Theory. It configures the #cores to be used by the BOINC client's reports/calculations, the working set size which your local client needs to estimate if an additional task can be started and the RAM setting for your vbox VM. In addition it calculates the computer's GFLOPS value. Some values (avg_ncpus, nthreads, memory_size_mb) sent by the server can be overwritten by an app_config.xml, others not, e.g. the working set size. It's recommended to keep the website setting in sync with the app_config.xml. 2) Do they limit the number of CPU's for one task The setting is for 1 task => 3 2-core tasks use a total of 6 cores 3) Do they have any effect on what else can run Yes but it's your BOINC client that keeps track of your total ressources (cores, RAM, network access, ...) I don't understand why they don't start... Did you set any limits in your client GUI or via app_config.xml? If I suspend LHC then BOINC immediately starts polling other projects ... This is normal client behaviour and independent from LHC. |
Send message Joined: 1 Sep 04 Posts: 52 Credit: 11,154,428 RAC: 3,311 |
Thanks for the quick response, computezmle. I've been through several cycles of reading and rereading your response, writing and rewriting mine, and experimenting a bit more. Here's where I am. 1) Are these setting redundant? That makes sense and it's what I'd assumed. I do have them in sync. 2) Do they limit the number of CPU's for one task OK, IIRC that means it ought to be running however many tasks it can as long as my total number of cores (8) is not exceeded, right? In the case I described, 2 more 2 core tasks should have started bringing the total to 7. Another single core task could have started if I'd had one waiting. 3) Do they have any effect on what else can run The only app_config.xml file I have is in the LHC folder (c:\users\all users\BOINC\projects\lhcathome.cern.ch_lhcathome). I initially put it into the BOINC folder by mistake but then moved it and have had the manager reread the config files. As for the GUI, the preferences are set to 100% of CPUs. If I suspend LHC then BOINC immediately starts polling other projects ... Yup. I just included that to show it worked normally if I suspended BOINC. With LHC unsuspended BOINC not only didn't start the additional LHC tasks, it wouldn't ask for tasks from any other project even though I had 5 more cores available for work. At this point, since I fiddled a bit to see what happens, I've got 8 tasks running from other projects, 6 NFS and 2 WCG. I suspended NFS and WCG and my expectation was that 7 core worth of LHC tasks (3 2 core and 1 1 core were waiting) would start up. What happens is that 1 single core LHC project starts up and nothing else leaving 7 cores sitting on their hands. That's the behavior I don't understand. My little machine is not being fully utilized by BOINC in general or LHC in particular and the latter seems to be the bottleneck for some reason. I understand that this can happen if, for example, I have a task needing 8 CPUs on top of the priority list and waiting. BOINC will hold off on starting anything else until the high priority task can get what it needs. Same with a 2 core task if 7 of my 8 are already in use. As non-technical as I am, I get that. That's not what's happening here though. I've got idle cores crying for work! Well, actually they're not crying, just idle. I'm doing the cry^^^whining :-) |
©2024 CERN