Message boards : ATLAS application : Non-zero return code from EVNTtoHITS (65) (Error code 65)
Message board moderation

To post messages, you must log in.

Previous · 1 · 2 · 3 · 4 · Next

AuthorMessage
bronco

Send message
Joined: 13 Apr 18
Posts: 443
Credit: 8,438,885
RAC: 0
Message 35993 - Posted: 22 Jul 2018, 2:51:07 UTC - in response to Message 35992.  

Did you try the app_info.xml as suggested by computezrmie? That would be the first thing to implement.

Also, how many other projects are you crunching simultaneously in addition to ATLAS? How many cores are simultaneously occupied with BOINC tasks and what tasks are they? Maybe try suspending all of those (if any) and see if you can get just ATLAS working all by itself, just a single 2-core ATLAS task all by itself. If you can complete 4 of those in a row each with HITS files then slowly, 1 core at a time, allow other tasks on the other cores. It's not just a matter of having enough RAM, there are also timing issues involved. So start with just crawling, then walking and eventually running. You might find that you cannot work all 8 cores when an ATLAS is running.
ID: 35993 · Report as offensive     Reply Quote
computezrmle
Volunteer moderator
Volunteer developer
Volunteer tester
Help desk expert
Avatar

Send message
Joined: 15 Jun 08
Posts: 2386
Credit: 222,994,413
RAC: 136,379
Message 35994 - Posted: 22 Jul 2018, 5:48:18 UTC - in response to Message 35993.  

Did you try the app_info.xml ...

Needs to be clarified:
app_info.xml can also be used by BOINC but for a completely different usecase.
What you need here is an app_config.xml.
ID: 35994 · Report as offensive     Reply Quote
Profile Yeti
Volunteer moderator
Avatar

Send message
Joined: 2 Sep 04
Posts: 453
Credit: 193,369,412
RAC: 10,065
Message 35995 - Posted: 22 Jul 2018, 6:41:49 UTC - in response to Message 35983.  

A while ago Crystal Pellet pointed out a method how you can check the success of an ATLAS job using the PandaID.

Example:
https://lhcathome.cern.ch/lhcathome/result.php?resultid=200236979
PandaID (from stderr.txt): 3994761077
Now check: https://bigpanda.cern.ch/job?pandaid=3994761077

Although some of Yeti's stderr logs were incomplete, that method shows a successful job.

When I trie to use the Panda-Link, it asks me always for a login. Can someone provide a "how to" ?

Thanks in advance


Supporting BOINC, a great concept !
ID: 35995 · Report as offensive     Reply Quote
computezrmle
Volunteer moderator
Volunteer developer
Volunteer tester
Help desk expert
Avatar

Send message
Joined: 15 Jun 08
Posts: 2386
Credit: 222,994,413
RAC: 136,379
Message 35996 - Posted: 22 Jul 2018, 7:14:15 UTC - in response to Message 35995.  

You may use one of the accounts given on the login page, e.g. your google account that you use for your android smartphone.
ID: 35996 · Report as offensive     Reply Quote
bronco

Send message
Joined: 13 Apr 18
Posts: 443
Credit: 8,438,885
RAC: 0
Message 35997 - Posted: 22 Jul 2018, 7:22:36 UTC - in response to Message 35995.  

When I trie to use the Panda-Link, it asks me always for a login. Can someone provide a "how to" ?

Thanks in advance

It asked me to login too but there was a choice to login with Facebook or Google. I don't do Facebook so I clicked Google. Then I got a box with my gmail address in the email input field and a blank password field. I entered my password, clicked "sign in" and it let me in, no verification required. I have never registered an account at that site so I was surprised it was that easy. Hmmm, maybe it was that easy because I used the same email address I used to attach to the LHC project?
ID: 35997 · Report as offensive     Reply Quote
AuxRx

Send message
Joined: 16 Sep 17
Posts: 100
Credit: 1,618,469
RAC: 0
Message 35999 - Posted: 22 Jul 2018, 9:12:08 UTC - in response to Message 35992.  

What have you tried so far?

Creating the app_config.xml as shown above, reading the config files and rebooting the system should clear up any issues. ATLAS generally doesn't like interruptions (stopping and starting), the safe bet is to set a fixed use limit instead of a flexible "when computer is in use" limit. It might be easier (at least for troubleshooting) to crunch one project and only one subproject e.g. LHC ATLAS.
ID: 35999 · Report as offensive     Reply Quote
maeax

Send message
Joined: 2 May 07
Posts: 2071
Credit: 156,179,585
RAC: 105,376
Message 36000 - Posted: 22 Jul 2018, 11:00:52 UTC - in response to Message 35995.  

A while ago Crystal Pellet pointed out a method how you can check the success of an ATLAS job using the PandaID.

Example:
https://lhcathome.cern.ch/lhcathome/result.php?resultid=200236979
PandaID (from stderr.txt): 3994761077
Now check: https://bigpanda.cern.ch/job?pandaid=3994761077

Although some of Yeti's stderr logs were incomplete, that method shows a successful job.

When I trie to use the Panda-Link, it asks me always for a login. Can someone provide a "how to" ?

Thanks in advance

When you click on the left side at the top - Panda, than
you see Atlas-Panda and scrolling down through the graphs down to the end of the page.
ID: 36000 · Report as offensive     Reply Quote
bronco

Send message
Joined: 13 Apr 18
Posts: 443
Credit: 8,438,885
RAC: 0
Message 36001 - Posted: 22 Jul 2018, 14:40:08 UTC - in response to Message 35994.  

Did you try the app_info.xml ...

Needs to be clarified:
app_info.xml can also be used by BOINC but for a completely different usecase.
What you need here is an app_config.xml.


My bad. Thanks for clarification.
ID: 36001 · Report as offensive     Reply Quote
dduggan47

Send message
Joined: 1 Sep 04
Posts: 52
Credit: 11,154,428
RAC: 3,311
Message 36003 - Posted: 23 Jul 2018, 13:10:09 UTC - in response to Message 35992.  

202778167 seems to have worked. Of course I've thought that before. Panda says "finished" though.

The only change I made was to allow 8 CPU's.

I'm very much over my head with this but I'm trying and I appreciate everyone's patience.

- Dick
ID: 36003 · Report as offensive     Reply Quote
computezrmle
Volunteer moderator
Volunteer developer
Volunteer tester
Help desk expert
Avatar

Send message
Joined: 15 Jun 08
Posts: 2386
Credit: 222,994,413
RAC: 136,379
Message 36004 - Posted: 23 Jul 2018, 13:52:01 UTC - in response to Message 36003.  

The final result looks good.
https://lhcathome.cern.ch/lhcathome/result.php?resultid=202778167


Nonetheless there are a few remarks:

1. The more cores you configure per VM the less efficient each VM will be.
This is a design issue.
1-core VMs work most efficient in most cases but if you run a couple of them concurrently they need lots of RAM.
To optimise the resource usage it is necessary to find the individual balance between RAM and CPU usage.
1-core and 2-core setups usually need a local RAM tuning via app_config.xml (>=4800MB).

2. "Setting CPU throttle for VM. (80%)"
Under certain circumstances (not in this case) a CPU throttle less than 100% may cause timing errors.
If possible, you may set a higher value.
Disadvantages may be a higher system temperature and more fan noise.

3.
2018-07-22 19:04:15 (23900): VM state change detected. (old = 'running', new = 'paused')
2018-07-22 19:04:26 (23900): VM state change detected. (old = 'paused', new = 'running')
2018-07-22 19:05:08 (23900): VM state change detected. (old = 'running', new = 'paused')
2018-07-22 19:05:19 (23900): VM state change detected. (old = 'paused', new = 'running')
2018-07-22 19:12:54 (23900): VM state change detected. (old = 'running', new = 'paused')
2018-07-22 19:13:05 (23900): VM state change detected. (old = 'paused', new = 'running')
2018-07-22 19:14:15 (23900): VM state change detected. (old = 'running', new = 'paused')
2018-07-22 19:14:26 (23900): VM state change detected. (old = 'paused', new = 'running')
2018-07-22 19:15:09 (23900): VM state change detected. (old = 'running', new = 'paused')
2018-07-22 19:15:20 (23900): VM state change detected. (old = 'paused', new = 'running')
2018-07-22 19:15:30 (23900): Stopping VM.
2018-07-22 19:16:16 (23900): Error in stop VM for VM: -182
Command:
VBoxManage -q controlvm "boinc_2c75cee3b312eaa0" savestate
Output:
0%...10%...20%...30%...40%...50%...60%...70%...
2018-07-22 19:16:16 (23900): VM did not stop when requested.
2018-07-22 19:16:16 (23900): VM was successfully terminated.

Looks suspect and close to a crash but your VM managed to recover.



4.
2018-07-23 02:54:17 (10616): Guest Log: -rw------- 1 atlas01 atlas01 141775081 Jul 23 02:49 HITS.14568781._061245.pool.root.1

Here it is, the famous HITS file.

5.
2018-07-23 02:54:17 (10616): VM Completion File Detected.
2018-07-23 02:54:17 (10616): Powering off VM.
2018-07-23 02:54:21 (10616): Successfully stopped VM.
<rest of the log>

looks good.
ID: 36004 · Report as offensive     Reply Quote
bronco

Send message
Joined: 13 Apr 18
Posts: 443
Credit: 8,438,885
RAC: 0
Message 36005 - Posted: 23 Jul 2018, 15:40:04 UTC - in response to Message 36003.  

202778167 seems to have worked. Of course I've thought that before. Panda says "finished" though.

If Panda says "finished" then it worked. I want to say something like...

Well done, Dick :) Perseverance pays off.

...but with the "2018-07-22 19:16:16 (23900): Error in stop VM for VM: -182" I fear you might have just got lucky with that one. It started wobbling and recovered but next time you might not be so lucky. The 80% CPU throttle for the VM is not a good idea and I will add that it may have caused this VM: -182 error.

The only change I made was to allow 8 CPU's

Not a good idea from an efficiency perspective though I doubt it caused the VM: 182 error. 4 CPUs would be better, 2 should be your goal.

I'm very much over my head with this but I'm trying and I appreciate everyone's patience.

The only question(s) I have for you are:
1) What did you change to allow 8 CPUs? Did you change it in your website settings? In the app_info.xml? Or both?
2) Did you create the app_info.xml? And if you did create it then did you verify that BOINC can find it and that it doesn't contain errors? Verify by opening BOINC Manager and clicking Options -> Read config files then open the Event log (Tools -> Event log), scroll to the bottom and see if it says "LHC@home <date> Found app_config.xml". It it doesn't say that then you either did not create it or you created it in the wrong folder. If it does say "Found" then see if any red text follows. If it found errors then it will show those errors in red. If no errors then it is syntactically correct.
ID: 36005 · Report as offensive     Reply Quote
computezrmle
Volunteer moderator
Volunteer developer
Volunteer tester
Help desk expert
Avatar

Send message
Joined: 15 Jun 08
Posts: 2386
Credit: 222,994,413
RAC: 136,379
Message 36006 - Posted: 23 Jul 2018, 15:55:53 UTC - in response to Message 36005.  

1) What did you change to allow 8 CPUs? Did you change it in your website settings? In the app_info.xml? Or both?
2) Did you create the app_info.xml? ...

@bronco
Again: Please, don't use "app_info.xml" in this context. It's "app_config.xml".
ID: 36006 · Report as offensive     Reply Quote
dduggan47

Send message
Joined: 1 Sep 04
Posts: 52
Credit: 11,154,428
RAC: 3,311
Message 36009 - Posted: 23 Jul 2018, 17:12:17 UTC - in response to Message 36006.  

1) What did you change to allow 8 CPUs? Did you change it in your website settings? In the app_info.xml? Or both?
2) Did you create the app_info.xml? ...

@bronco
Again: Please, don't use "app_info.xml" in this context. It's "app_config.xml".


Yup, figured out about the name.

I was unsure where to create the file but went back to the other threads and found it, C:\Users\All Users\BOINC\projects\lhcathome.cern.ch_lhcathome. (Windows 10 doesn't make it easy to find that sucker. All Users is considered a system file and by default doesn't show up in Explorer or Command Prompt until you change the default view (although you can get to and its children it via the prompt if you know the exact path). At that point it shows up in Explorer but still not in Command Prompt. I think I'll call Bill Gates.)

So, once i did that it was found by having BOINC read the config files. The file is as shown in another thread:

<app_config>
<app_version>
<app_name>ATLAS</app_name>
<plan_class>vbox64_mt_mcore_atlas</plan_class>
<avg_ncpus>2.0</avg_ncpus>
<cmdline>--nthreads 2 --memory_size_mb 4800</cmdline>
</app_version>
<project_max_concurrent>1</project_max_concurrent>
</app_config>

I've changed the # of CPUs on the website back to 2 which you suggested as the goal.

BTW, I noticed another oddity on the LHC website that the admins there might want to know. When I gave up on this earlier I told it no more ATLAS and forgot I'd done it. Didn't matter, I kept getting tasks. Just allowed it again.

- Dick
ID: 36009 · Report as offensive     Reply Quote
computezrmle
Volunteer moderator
Volunteer developer
Volunteer tester
Help desk expert
Avatar

Send message
Joined: 15 Jun 08
Posts: 2386
Credit: 222,994,413
RAC: 136,379
Message 36010 - Posted: 23 Jul 2018, 17:37:13 UTC - in response to Message 36009.  

BTW, I noticed another oddity on the LHC website that the admins there might want to know. When I gave up on this earlier I told it no more ATLAS and forgot I'd done it. Didn't matter, I kept getting tasks. Just allowed it again.

Did you set "If no work for selected applications is available, accept work from other applications?" on your web preferences page?
https://lhcathome.cern.ch/lhcathome/prefs.php?subset=project
ID: 36010 · Report as offensive     Reply Quote
dduggan47

Send message
Joined: 1 Sep 04
Posts: 52
Credit: 11,154,428
RAC: 3,311
Message 36011 - Posted: 23 Jul 2018, 17:48:53 UTC - in response to Message 36010.  

Oops. Good point.
ID: 36011 · Report as offensive     Reply Quote
bronco

Send message
Joined: 13 Apr 18
Posts: 443
Credit: 8,438,885
RAC: 0
Message 36013 - Posted: 23 Jul 2018, 18:14:39 UTC - in response to Message 36006.  

Again: Please, don't use "app_info.xml" in this context. It's "app_config.xml".


Sorry, I did it again. Bad habits die hard.
ID: 36013 · Report as offensive     Reply Quote
bronco

Send message
Joined: 13 Apr 18
Posts: 443
Credit: 8,438,885
RAC: 0
Message 36014 - Posted: 23 Jul 2018, 18:29:08 UTC - in response to Message 36009.  

I was unsure where to create the file

Glad you found it. An easier way to find the path to BOINC's data directory it is to look in the event log (BOINC Manager -> Tools -> Show event log). It's in the first 10 lines. If it's not then it's because the log grew long and BOINC trimmed it. In that case just shutdown BOINC client, wait a few minutes for VBoxHeadless to shutdown, restart the client and look at the log again.
ID: 36014 · Report as offensive     Reply Quote
dduggan47

Send message
Joined: 1 Sep 04
Posts: 52
Credit: 11,154,428
RAC: 3,311
Message 36058 - Posted: 26 Jul 2018, 13:02:41 UTC - in response to Message 36014.  

Questions re # CPUs:

I've got the number of CPUs set to 2 at the website and in the app_config.xml.

1) Are these setting redundant? If not, what's the difference in the effect?

2) Do they limit the number of CPU's for one task or the total number of CPU's in use for all LHC tasks?

3) Do they have any effect on what else can run (i.e. tasks from other projects)?

Here's the reason I'm asking. Right at this moment I have 3 running tasks (not including non-intensive and GPU), 1 LHC and 2 from another project, all single CPU. I have 4 ready to start tasks. All are LHC 2 CPU tasks, 2 ATLAS and 2 Theory.

What I expected is for 2 of those 2 CPU tasks to be running to bring the total CPUs to 7. I don't understand why they don't start and why they seem to be blocking other projects from running. If I suspend LHC then BOINC immediately starts polling other projects looking for work and starts those tasks up to the max of 8.

Thanks,

- Dick[/list]
ID: 36058 · Report as offensive     Reply Quote
computezrmle
Volunteer moderator
Volunteer developer
Volunteer tester
Help desk expert
Avatar

Send message
Joined: 15 Jun 08
Posts: 2386
Credit: 222,994,413
RAC: 136,379
Message 36059 - Posted: 26 Jul 2018, 13:53:12 UTC - in response to Message 36058.  

1) Are these setting redundant?

Not exactly.
The website setting only affects multicore apps, in this case ATLAS and Theory.
It configures the #cores to be used by the BOINC client's reports/calculations, the working set size which your local client needs to estimate if an additional task can be started and the RAM setting for your vbox VM.
In addition it calculates the computer's GFLOPS value.

Some values (avg_ncpus, nthreads, memory_size_mb) sent by the server can be overwritten by an app_config.xml, others not, e.g. the working set size.

It's recommended to keep the website setting in sync with the app_config.xml.


2) Do they limit the number of CPU's for one task

The setting is for 1 task => 3 2-core tasks use a total of 6 cores


3) Do they have any effect on what else can run

Yes but it's your BOINC client that keeps track of your total ressources (cores, RAM, network access, ...)


I don't understand why they don't start...

Did you set any limits in your client GUI or via app_config.xml?


If I suspend LHC then BOINC immediately starts polling other projects ...

This is normal client behaviour and independent from LHC.
ID: 36059 · Report as offensive     Reply Quote
dduggan47

Send message
Joined: 1 Sep 04
Posts: 52
Credit: 11,154,428
RAC: 3,311
Message 36060 - Posted: 26 Jul 2018, 14:58:55 UTC - in response to Message 36059.  

Thanks for the quick response, computezmle. I've been through several cycles of reading and rereading your response, writing and rewriting mine, and experimenting a bit more. Here's where I am.

1) Are these setting redundant?

Not exactly.
The website setting only affects multicore apps, in this case ATLAS and Theory.
It configures the #cores to be used by the BOINC client's reports/calculations, the working set size which your local client needs to estimate if an additional task can be started and the RAM setting for your vbox VM.
In addition it calculates the computer's GFLOPS value.

Some values (avg_ncpus, nthreads, memory_size_mb) sent by the server can be overwritten by an app_config.xml, others not, e.g. the working set size.

It's recommended to keep the website setting in sync with the app_config.xml.


That makes sense and it's what I'd assumed. I do have them in sync.


2) Do they limit the number of CPU's for one task

The setting is for 1 task => 3 2-core tasks use a total of 6 cores


OK, IIRC that means it ought to be running however many tasks it can as long as my total number of cores (8) is not exceeded, right? In the case I described, 2 more 2 core tasks should have started bringing the total to 7. Another single core task could have started if I'd had one waiting.

3) Do they have any effect on what else can run

Yes but it's your BOINC client that keeps track of your total ressources (cores, RAM, network access, ...)


I don't understand why they don't start...

Did you set any limits in your client GUI or via app_config.xml?


The only app_config.xml file I have is in the LHC folder (c:\users\all users\BOINC\projects\lhcathome.cern.ch_lhcathome). I initially put it into the BOINC folder by mistake but then moved it and have had the manager reread the config files.

As for the GUI, the preferences are set to 100% of CPUs.

If I suspend LHC then BOINC immediately starts polling other projects ...

This is normal client behaviour and independent from LHC.


Yup. I just included that to show it worked normally if I suspended BOINC. With LHC unsuspended BOINC not only didn't start the additional LHC tasks, it wouldn't ask for tasks from any other project even though I had 5 more cores available for work.

At this point, since I fiddled a bit to see what happens, I've got 8 tasks running from other projects, 6 NFS and 2 WCG. I suspended NFS and WCG and my expectation was that 7 core worth of LHC tasks (3 2 core and 1 1 core were waiting) would start up. What happens is that 1 single core LHC project starts up and nothing else leaving 7 cores sitting on their hands. That's the behavior I don't understand. My little machine is not being fully utilized by BOINC in general or LHC in particular and the latter seems to be the bottleneck for some reason.

I understand that this can happen if, for example, I have a task needing 8 CPUs on top of the priority list and waiting. BOINC will hold off on starting anything else until the high priority task can get what it needs. Same with a 2 core task if 7 of my 8 are already in use. As non-technical as I am, I get that. That's not what's happening here though. I've got idle cores crying for work! Well, actually they're not crying, just idle. I'm doing the cry^^^whining :-)
ID: 36060 · Report as offensive     Reply Quote
Previous · 1 · 2 · 3 · 4 · Next

Message boards : ATLAS application : Non-zero return code from EVNTtoHITS (65) (Error code 65)


©2024 CERN