Thread 'Non-zero return code from EVNTtoHITS (65) (Error code 65)'

Author	Message
bronco Send message Joined: 13 Apr 18 Posts: 443 Credit: 8,438,885 RAC: 0	Message 35993 - Posted: 22 Jul 2018, 2:51:07 UTC - in response to Message 35992. Did you try the app_info.xml as suggested by computezrmie? That would be the first thing to implement. Also, how many other projects are you crunching simultaneously in addition to ATLAS? How many cores are simultaneously occupied with BOINC tasks and what tasks are they? Maybe try suspending all of those (if any) and see if you can get just ATLAS working all by itself, just a single 2-core ATLAS task all by itself. If you can complete 4 of those in a row each with HITS files then slowly, 1 core at a time, allow other tasks on the other cores. It's not just a matter of having enough RAM, there are also timing issues involved. So start with just crawling, then walking and eventually running. You might find that you cannot work all 8 cores when an ATLAS is running. ID: 35993 · Reply Quote

computezrmle Volunteer moderator Volunteer developer Volunteer tester Help desk expert Send message Joined: 15 Jun 08 Posts: 2724 Credit: 299,020,504 RAC: 4,919	Message 35994 - Posted: 22 Jul 2018, 5:48:18 UTC - in response to Message 35993. Did you try the app_info.xml ... Needs to be clarified: app_info.xml can also be used by BOINC but for a completely different usecase. What you need here is an app_config.xml. ID: 35994 · Reply Quote

Yeti Volunteer moderator Send message Joined: 2 Sep 04 Posts: 468 Credit: 224,468,526 RAC: 133,200	Message 35995 - Posted: 22 Jul 2018, 6:41:49 UTC - in response to Message 35983. A while ago Crystal Pellet pointed out a method how you can check the success of an ATLAS job using the PandaID. Example: https://lhcathome.cern.ch/lhcathome/result.php?resultid=200236979 PandaID (from stderr.txt): 3994761077 Now check: https://bigpanda.cern.ch/job?pandaid=3994761077 Although some of Yeti's stderr logs were incomplete, that method shows a successful job. When I trie to use the Panda-Link, it asks me always for a login. Can someone provide a "how to" ? Thanks in advance Supporting BOINC, a great concept ! ID: 35995 · Reply Quote

computezrmle Volunteer moderator Volunteer developer Volunteer tester Help desk expert Send message Joined: 15 Jun 08 Posts: 2724 Credit: 299,020,504 RAC: 4,919	Message 35996 - Posted: 22 Jul 2018, 7:14:15 UTC - in response to Message 35995. You may use one of the accounts given on the login page, e.g. your google account that you use for your android smartphone. ID: 35996 · Reply Quote

bronco Send message Joined: 13 Apr 18 Posts: 443 Credit: 8,438,885 RAC: 0	Message 35997 - Posted: 22 Jul 2018, 7:22:36 UTC - in response to Message 35995. When I trie to use the Panda-Link, it asks me always for a login. Can someone provide a "how to" ? Thanks in advance It asked me to login too but there was a choice to login with Facebook or Google. I don't do Facebook so I clicked Google. Then I got a box with my gmail address in the email input field and a blank password field. I entered my password, clicked "sign in" and it let me in, no verification required. I have never registered an account at that site so I was surprised it was that easy. Hmmm, maybe it was that easy because I used the same email address I used to attach to the LHC project? ID: 35997 · Reply Quote

AuxRx Send message Joined: 16 Sep 17 Posts: 100 Credit: 1,618,469 RAC: 0	Message 35999 - Posted: 22 Jul 2018, 9:12:08 UTC - in response to Message 35992. What have you tried so far? Creating the app_config.xml as shown above, reading the config files and rebooting the system should clear up any issues. ATLAS generally doesn't like interruptions (stopping and starting), the safe bet is to set a fixed use limit instead of a flexible "when computer is in use" limit. It might be easier (at least for troubleshooting) to crunch one project and only one subproject e.g. LHC ATLAS. ID: 35999 · Reply Quote

maeax Send message Joined: 2 May 07 Posts: 2285 Credit: 178,823,324 RAC: 942	Message 36000 - Posted: 22 Jul 2018, 11:00:52 UTC - in response to Message 35995. A while ago Crystal Pellet pointed out a method how you can check the success of an ATLAS job using the PandaID. Example: https://lhcathome.cern.ch/lhcathome/result.php?resultid=200236979 PandaID (from stderr.txt): 3994761077 Now check: https://bigpanda.cern.ch/job?pandaid=3994761077 Although some of Yeti's stderr logs were incomplete, that method shows a successful job. When I trie to use the Panda-Link, it asks me always for a login. Can someone provide a "how to" ? Thanks in advance When you click on the left side at the top - Panda, than you see Atlas-Panda and scrolling down through the graphs down to the end of the page. ID: 36000 · Reply Quote

bronco Send message Joined: 13 Apr 18 Posts: 443 Credit: 8,438,885 RAC: 0	Message 36001 - Posted: 22 Jul 2018, 14:40:08 UTC - in response to Message 35994. Did you try the app_info.xml ... Needs to be clarified: app_info.xml can also be used by BOINC but for a completely different usecase. What you need here is an app_config.xml. My bad. Thanks for clarification. ID: 36001 · Reply Quote

dduggan47 Send message Joined: 1 Sep 04 Posts: 52 Credit: 11,767,629 RAC: 0	Message 36003 - Posted: 23 Jul 2018, 13:10:09 UTC - in response to Message 35992. 202778167 seems to have worked. Of course I've thought that before. Panda says "finished" though. The only change I made was to allow 8 CPU's. I'm very much over my head with this but I'm trying and I appreciate everyone's patience. - Dick ID: 36003 · Reply Quote

computezrmle Volunteer moderator Volunteer developer Volunteer tester Help desk expert Send message Joined: 15 Jun 08 Posts: 2724 Credit: 299,020,504 RAC: 4,919	Message 36004 - Posted: 23 Jul 2018, 13:52:01 UTC - in response to Message 36003. nal result looks good. https://lhcathome.cern.ch/lhcathome/result.php?resultid=202778167 Nonetheless there are a few remarks: 1. The more cores you configure per VM the less efficient each VM will be. This is a design issue. 1-core VMs work most efficient in most cases but if you run a couple of them concurrently they need lots of RAM. To optimise the resource usage it is necessary to find the individual balance between RAM and CPU usage. 1-core and 2-core setups usually need a local RAM tuning via app_config.xml (>=4800MB). 2. "Setting CPU throttle for VM. (80%)" Under certain circumstances (not in this case) a CPU throttle less than 100% may cause timing errors. If possible, you may set a higher value. Disadvantages may be a higher system temperature and more fan noise. 3. [pre]2018-07-22 19:04:15 (23900): VM state change detected. (old = 'running', new = 'paused') 2018-07-22 19:04:26 (23900): VM state change detected. (old = 'paused', new = 'running') 2018-07-22 19:05:08 (23900): VM state change detected. (old = 'running', new = 'paused') 2018-07-22 19:05:19 (23900): VM state change detected. (old = 'paused', new = 'running') 2018-07-22 19:12:54 (23900): VM state change detected. (old = 'running', new = 'paused') 2018-07-22 19:13:05 (23900): VM state change detected. (old = 'paused', new = 'running') 2018-07-22 19:14:15 (23900): VM state change detected. (old = 'running', new = 'paused') 2018-07-22 19:14:26 (23900): VM state change detected. (old = 'paused', new = 'running') 2018-07-22 19:15:09 (23900): VM state change detected. (old = 'running', new = 'paused') 2018-07-22 19:15:20 (23900): VM state change detected. (old = 'paused', new = 'running') 2018-07-22 19:15:30 (23900): Stopping VM. 2018-07-22 19:16:16 (23900): Error in stop VM for VM: -182 Command: VBoxManage -q controlvm "boinc_2c75cee3b312eaa0" savestate Output: 0%...10%...20%...30%...40%...50%...60%...70%... 2018-07-22 19:16:16 (23900): VM did not stop when requested. 2018-07-22 19:16:16 (23900): VM was successfully terminated.[/pre] Looks suspect and close to a crash but your VM managed to recover. 4. [pre]2018-07-23 02:54:17 (10616): Guest Log: -rw------- 1 atlas01 atlas01 141775081 Jul 23 02:49 HITS.14568781._061245.pool.root.1[/pre] Here it is, the famous HITS file. 5. [pre]2018-07-23 02:54:17 (10616): VM Completion File Detected. 2018-07-23 02:54:17 (10616): Powering off VM. 2018-07-23 02:54:21 (10616): Successfully stopped VM. <rest of the log>[/pre] looks good. ID: 36004 · Reply Quote

bronco Send message Joined: 13 Apr 18 Posts: 443 Credit: 8,438,885 RAC: 0	Message 36005 - Posted: 23 Jul 2018, 15:40:04 UTC - in response to Message 36003. 202778167 seems to have worked. Of course I've thought that before. Panda says "finished" though. If Panda says "finished" then it worked. I want to say something like... Well done, Dick :) Perseverance pays off. ...but with the "2018-07-22 19:16:16 (23900): Error in stop VM for VM: -182" I fear you might have just got lucky with that one. It started wobbling and recovered but next time you might not be so lucky. The 80% CPU throttle for the VM is not a good idea and I will add that it may have caused this VM: -182 error. The only change I made was to allow 8 CPU's Not a good idea from an efficiency perspective though I doubt it caused the VM: 182 error. 4 CPUs would be better, 2 should be your goal. I'm very much over my head with this but I'm trying and I appreciate everyone's patience. The only question(s) I have for you are: 1) What did you change to allow 8 CPUs? Did you change it in your website settings? In the app_info.xml? Or both? 2) Did you create the app_info.xml? And if you did create it then did you verify that BOINC can find it and that it doesn't contain errors? Verify by opening BOINC Manager and clicking Options -> Read config files then open the Event log (Tools -> Event log), scroll to the bottom and see if it says "LHC@home <date> Found app_config.xml". It it doesn't say that then you either did not create it or you created it in the wrong folder. If it does say "Found" then see if any red text follows. If it found errors then it will show those errors in red. If no errors then it is syntactically correct. ID: 36005 · Reply Quote

computezrmle Volunteer moderator Volunteer developer Volunteer tester Help desk expert Send message Joined: 15 Jun 08 Posts: 2724 Credit: 299,020,504 RAC: 4,919	Message 36006 - Posted: 23 Jul 2018, 15:55:53 UTC - in response to Message 36005. 1) What did you change to allow 8 CPUs? Did you change it in your website settings? In the app_info.xml? Or both? 2) Did you create the app_info.xml? ... @bronco Again: Please, don't use "app_info.xml" in this context. It's "app_config.xml". ID: 36006 · Reply Quote

dduggan47 Send message Joined: 1 Sep 04 Posts: 52 Credit: 11,767,629 RAC: 0	Message 36009 - Posted: 23 Jul 2018, 17:12:17 UTC - in response to Message 36006. 1) What did you change to allow 8 CPUs? Did you change it in your website settings? In the app_info.xml? Or both? 2) Did you create the app_info.xml? ... @bronco Again: Please, don't use "app_info.xml" in this context. It's "app_config.xml". Yup, figured out about the name. I was unsure where to create the file but went back to the other threads and found it, C:\Users\All Users\BOINC\projects\lhcathome.cern.ch_lhcathome. (Windows 10 doesn't make it easy to find that sucker. All Users is considered a system file and by default doesn't show up in Explorer or Command Prompt until you change the default view (although you can get to and its children it via the prompt if you know the exact path). At that point it shows up in Explorer but still not in Command Prompt. I think I'll call Bill Gates.) So, once i did that it was found by having BOINC read the config files. The file is as shown in another thread: <app_config> <app_version> <app_name>ATLAS</app_name> <plan_class>vbox64_mt_mcore_atlas</plan_class> <avg_ncpus>2.0</avg_ncpus> <cmdline>--nthreads 2 --memory_size_mb 4800</cmdline> </app_version> <project_max_concurrent>1</project_max_concurrent> </app_config> I've changed the # of CPUs on the website back to 2 which you suggested as the goal. BTW, I noticed another oddity on the LHC website that the admins there might want to know. When I gave up on this earlier I told it no more ATLAS and forgot I'd done it. Didn't matter, I kept getting tasks. Just allowed it again. - Dick ID: 36009 · Reply Quote

computezrmle Volunteer moderator Volunteer developer Volunteer tester Help desk expert Send message Joined: 15 Jun 08 Posts: 2724 Credit: 299,020,504 RAC: 4,919	Message 36010 - Posted: 23 Jul 2018, 17:37:13 UTC - in response to Message 36009. BTW, I noticed another oddity on the LHC website that the admins there might want to know. When I gave up on this earlier I told it no more ATLAS and forgot I'd done it. Didn't matter, I kept getting tasks. Just allowed it again. Did you set "If no work for selected applications is available, accept work from other applications?" on your web preferences page? https://lhcathome.cern.ch/lhcathome/prefs.php?subset=project ID: 36010 · Reply Quote

dduggan47 Send message Joined: 1 Sep 04 Posts: 52 Credit: 11,767,629 RAC: 0	Message 36011 - Posted: 23 Jul 2018, 17:48:53 UTC - in response to Message 36010. Oops. Good point. ID: 36011 · Reply Quote

bronco Send message Joined: 13 Apr 18 Posts: 443 Credit: 8,438,885 RAC: 0	Message 36013 - Posted: 23 Jul 2018, 18:14:39 UTC - in response to Message 36006. Again: Please, don't use "app_info.xml" in this context. It's "app_config.xml". Sorry, I did it again. Bad habits die hard. ID: 36013 · Reply Quote

bronco Send message Joined: 13 Apr 18 Posts: 443 Credit: 8,438,885 RAC: 0	Message 36014 - Posted: 23 Jul 2018, 18:29:08 UTC - in response to Message 36009. I was unsure where to create the file Glad you found it. An easier way to find the path to BOINC's data directory it is to look in the event log (BOINC Manager -> Tools -> Show event log). It's in the first 10 lines. If it's not then it's because the log grew long and BOINC trimmed it. In that case just shutdown BOINC client, wait a few minutes for VBoxHeadless to shutdown, restart the client and look at the log again. ID: 36014 · Reply Quote

dduggan47 Send message Joined: 1 Sep 04 Posts: 52 Credit: 11,767,629 RAC: 0	Message 36058 - Posted: 26 Jul 2018, 13:02:41 UTC - in response to Message 36014. Questions re # CPUs: I've got the number of CPUs set to 2 at the website and in the app_config.xml. 1) Are these setting redundant? If not, what's the difference in the effect? 2) Do they limit the number of CPU's for one task or the total number of CPU's in use for all LHC tasks? 3) Do they have any effect on what else can run (i.e. tasks from other projects)? Here's the reason I'm asking. Right at this moment I have 3 running tasks (not including non-intensive and GPU), 1 LHC and 2 from another project, all single CPU. I have 4 ready to start tasks. All are LHC 2 CPU tasks, 2 ATLAS and 2 Theory. What I expected is for 2 of those 2 CPU tasks to be running to bring the total CPUs to 7. I don't understand why they don't start and why they seem to be blocking other projects from running. If I suspend LHC then BOINC immediately starts polling other projects looking for work and starts those tasks up to the max of 8. Thanks, - Dick[/list] ID: 36058 · Reply Quote

computezrmle Volunteer moderator Volunteer developer Volunteer tester Help desk expert Send message Joined: 15 Jun 08 Posts: 2724 Credit: 299,020,504 RAC: 4,919	Message 36059 - Posted: 26 Jul 2018, 13:53:12 UTC - in response to Message 36058. 1) Are these setting redundant? Not exactly. The website setting only affects multicore apps, in this case ATLAS and Theory. It configures the #cores to be used by the BOINC client's reports/calculations, the working set size which your local client needs to estimate if an additional task can be started and the RAM setting for your vbox VM. In addition it calculates the computer's GFLOPS value. Some values (avg_ncpus, nthreads, memory_size_mb) sent by the server can be overwritten by an app_config.xml, others not, e.g. the working set size. It's recommended to keep the website setting in sync with the app_config.xml. 2) Do they limit the number of CPU's for one task The setting is for 1 task => 3 2-core tasks use a total of 6 cores 3) Do they have any effect on what else can run Yes but it's your BOINC client that keeps track of your total ressources (cores, RAM, network access, ...) I don't understand why they don't start... Did you set any limits in your client GUI or via app_config.xml? If I suspend LHC then BOINC immediately starts polling other projects ... This is normal client behaviour and independent from LHC. ID: 36059 · Reply Quote

dduggan47 Send message Joined: 1 Sep 04 Posts: 52 Credit: 11,767,629 RAC: 0	Message 36060 - Posted: 26 Jul 2018, 14:58:55 UTC - in response to Message 36059. Thanks for the quick response, computezmle. I've been through several cycles of reading and rereading your response, writing and rewriting mine, and experimenting a bit more. Here's where I am. 1) Are these setting redundant? Not exactly. The website setting only affects multicore apps, in this case ATLAS and Theory. It configures the #cores to be used by the BOINC client's reports/calculations, the working set size which your local client needs to estimate if an additional task can be started and the RAM setting for your vbox VM. In addition it calculates the computer's GFLOPS value. Some values (avg_ncpus, nthreads, memory_size_mb) sent by the server can be overwritten by an app_config.xml, others not, e.g. the working set size. It's recommended to keep the website setting in sync with the app_config.xml. That makes sense and it's what I'd assumed. I do have them in sync. 2) Do they limit the number of CPU's for one task The setting is for 1 task => 3 2-core tasks use a total of 6 cores OK, IIRC that means it ought to be running however many tasks it can as long as my total number of cores (8) is not exceeded, right? In the case I described, 2 more 2 core tasks should have started bringing the total to 7. Another single core task could have started if I'd had one waiting. 3) Do they have any effect on what else can run Yes but it's your BOINC client that keeps track of your total ressources (cores, RAM, network access, ...) I don't understand why they don't start... Did you set any limits in your client GUI or via app_config.xml? The only app_config.xml file I have is in the LHC folder (c:\users\all users\BOINC\projects\lhcathome.cern.ch_lhcathome). I initially put it into the BOINC folder by mistake but then moved it and have had the manager reread the config files. As for the GUI, the preferences are set to 100% of CPUs. If I suspend LHC then BOINC immediately starts polling other projects ... This is normal client behaviour and independent from LHC. Yup. I just included that to show it worked normally if I suspended BOINC. With LHC unsuspended BOINC not only didn't start the additional LHC tasks, it wouldn't ask for tasks from any other project even though I had 5 more cores available for work. At this point, since I fiddled a bit to see what happens, I've got 8 tasks running from other projects, 6 NFS and 2 WCG. I suspended NFS and WCG and my expectation was that 7 core worth of LHC tasks (3 2 core and 1 1 core were waiting) would start up. What happens is that 1 single core LHC project starts up and nothing else leaving 7 cores sitting on their hands. That's the behavior I don't understand. My little machine is not being fully utilized by BOINC in general or LHC in particular and the latter seems to be the bottleneck for some reason. I understand that this can happen if, for example, I have a task needing 8 CPUs on top of the priority list and waiting. BOINC will hold off on starting anything else until the high priority task can get what it needs. Same with a 2 core task if 7 of my 8 are already in use. As non-technical as I am, I get that. That's not what's happening here though. I've got idle cores crying for work! Well, actually they're not crying, just idle. I'm doing the cry^^^whining :-) ID: 36060 · Reply Quote