Message boards : ATLAS application : Non-zero return code from EVNTtoHITS (65) (Error code 65)
Message board moderation

To post messages, you must log in.

1 · 2 · 3 · 4 · Next

AuthorMessage
dduggan47

Send message
Joined: 1 Sep 04
Posts: 52
Credit: 11,152,883
RAC: 3,351
Message 35958 - Posted: 19 Jul 2018, 23:22:29 UTC

I'm likely creating a duplicate thread for this problem but just haven't been able to find a current thread that addresses us. Please redirect me if that's the case.

I am consistently getting the above message on ATLAS tasks. I only became aware of this today when I got an email from an admin letting me know.

I gather that the most common cause for this is lack of memory. ATLAS requires 8GB. The computer I'm running LHC on has 16GB. My config is set to use up to 90% of that when the NOT in use. When it is in use I had it set to 50%, but that's still 8GB. I've upped it to 70% now so we'll see. I'm guessing that that won't solve it though.

My max CPUs is set in my account on the website at 2.

Not sure if this is relevant but I'm also occasionally getting this message on ATLAS tasks: "Postponed: VM job unmanageable, restarting later."

I'm not the most technical of users (as you can probably tell). For years I ran only sixtrack but this is my favorite BOINC project (my participation dates back to 9/1/4, the first day of LHC@home) so I wanted to do more. It took quite a while to figure out how on one machine and my other one simply refuses altogether. Sigh.

Any help greatly appreciated but best to keep it as simple as possible. :-)

Thanks.

- Dick
ID: 35958 · Report as offensive     Reply Quote
bronco

Send message
Joined: 13 Apr 18
Posts: 443
Credit: 8,438,885
RAC: 0
Message 35959 - Posted: 20 Jul 2018, 2:31:46 UTC - in response to Message 35958.  
Last modified: 20 Jul 2018, 2:34:23 UTC

The problem seems to be on this host with VirtualBox 5.2.8, 16GB RAM. CPU is Intel(R) Core(TM) i7-8550U which has 4 cores. BOINC reports 8 cores so it's multi-threading.

Any help greatly appreciated but best to keep it as simple as possible. :-)

Unfortunately there is no simple answer. There is no simple "flip switch A and your problems will be solved forever" kind of solution. You'll have to try a few different ideas and see what works for you. All we can do is give you some general ideas and then it's up to you to experiment a little until you find the settings/configuration that works for you. The first thing to realize is that ATLAS tasks require a lot of memory and other resources and when run in VirtualBox they put a huge load on your computer. If you are running other applications and BOINC projects that also load the system heavily then you get problems.

Not sure if this is relevant but I'm also occasionally getting this message on ATLAS tasks: "Postponed: VM job unmanageable, restarting later."

It is very relevant. It is an indication that VBox isn't getting the RAM and/or other resources it needs. Sometimes the job restarts when the system load is not as high, VBox gets things sorted out and the job proceeds. Sometimes it does not.

I gather that the most common cause for this is lack of memory. ATLAS requires 8GB. The computer I'm running LHC on has 16GB. My config is set to use up to 90% of that when the NOT in use. When it is in use I had it set to 50%, but that's still 8GB. I've upped it to 70% now so we'll see. I'm guessing that that won't solve it though.

Actually it might solve it. Remember ATLAS tasks run in VBox (VirtualBox) and VBox does not like to be suspended/resumed quickly. When you had the setting at 50% it's possible that other apps were using more than 50% for short durations and repeatedly which would have the effect of causing BOINC to suspend/resume VBox repeatedly and quickly which is precisely the thing VBox does not handle well. That's one theory for why you are getting the errors. If setting it at 70% doesn't help then try 90% or even higher.

My max CPUs is set in my account on the website at 2.

That should work on most 8 core CPUs with 16GB RAM if you are not overloading the system with other BOINC tasks and/or personal applications . The thing is you don't actually have 8 cores. You have only 4 cores. You also have multi-threading turned on which makes BOINC think you have 8 cores which might in turn be causing BOINC to run more tasks than your system can actually handle when one of them is ATLAS. You might try reducing "max CPUs" to 1. You might also try reducing the number of other applications and BOINC projects.
ID: 35959 · Report as offensive     Reply Quote
dduggan47

Send message
Joined: 1 Sep 04
Posts: 52
Credit: 11,152,883
RAC: 3,351
Message 35960 - Posted: 20 Jul 2018, 3:28:54 UTC - in response to Message 35959.  

Thank you Bronco. I wasn't hoping for an answer that was simple in implementation, just simple enough technically that I could understand the advice. You succeeded.

The easiest solutions you proposed seem to be:

1) The change I already made re % of CPU for BOINC when not in use from 50% to 70%. This require no immediate change. If it doesn't work I'll up the % further.
2) If that doesn't help I'll reduce the number of CPUs from 2 to 1.
3) Reread your note and go from there.

Thanks again.

- Dick
ID: 35960 · Report as offensive     Reply Quote
gyllic

Send message
Joined: 9 Dec 14
Posts: 202
Credit: 2,533,875
RAC: 0
Message 35961 - Posted: 20 Jul 2018, 6:33:54 UTC - in response to Message 35960.  

If the already mentioned ideas don't help, the error may be caused due to a "wrong server configuration". If you want to run 1-core or 2-core ATLAS Vbox tasks, the server sends you a too low RAM setting for those tasks.
To change that, you have to use an app_config.xml. See this post: https://lhcathome.cern.ch/lhcathome/forum_thread.php?id=4161&postid=35921
ID: 35961 · Report as offensive     Reply Quote
computezrmle
Volunteer moderator
Volunteer developer
Volunteer tester
Help desk expert
Avatar

Send message
Joined: 15 Jun 08
Posts: 2386
Credit: 222,933,564
RAC: 137,738
Message 35968 - Posted: 20 Jul 2018, 12:55:08 UTC - in response to Message 35958.  

... became aware of this today when I got an email from an admin letting me know.

Which admin and (more interesting as it may be helpful for others also) what was his/her suggestion?
ID: 35968 · Report as offensive     Reply Quote
dduggan47

Send message
Joined: 1 Sep 04
Posts: 52
Credit: 11,152,883
RAC: 3,351
Message 35969 - Posted: 20 Jul 2018, 12:58:27 UTC - in response to Message 35961.  

Thanks, gyllic. I had actually run across that in my searching and I'll add it to my list of things to try.
ID: 35969 · Report as offensive     Reply Quote
dduggan47

Send message
Joined: 1 Sep 04
Posts: 52
Credit: 11,152,883
RAC: 3,351
Message 35970 - Posted: 20 Jul 2018, 14:33:49 UTC - in response to Message 35968.  

... became aware of this today when I got an email from an admin letting me know.

Which admin and (more interesting as it may be helpful for others also) what was his/her suggestion?


Actually it was Bronco who answered above. His private advice was to come here and ask. :-)
ID: 35970 · Report as offensive     Reply Quote
bronco

Send message
Joined: 13 Apr 18
Posts: 443
Credit: 8,438,885
RAC: 0
Message 35971 - Posted: 20 Jul 2018, 16:46:40 UTC - in response to Message 35970.  

... became aware of this today when I got an email from an admin letting me know.

Which admin and (more interesting as it may be helpful for others also) what was his/her suggestion?


Actually it was Bronco who answered above. His private advice was to come here and ask. :-)


Yes, it was me and I contacted you via PM not email. Also, I am not an admin (maybe). But those are minor details.

As you said in your op, you wanted to do more. I do too but at the moment I don't have additional computing resources to devote to this great project. What I do have is time and a suspicion that a great number of ATLAS tasks are returning no useful work in spite of the fact that they validate. So I created a script that grabs the host IDs, result IDs and result pages from a range of user IDs and analyses their ATLAS results (if any). It counts CPU and run time for all ATLAS tasks and CPU and run time for ATLAS tasks that don't return a HITS file and saves a list of user IDs returning "no HITters" to disk . I ran the script on user IDs from 67 to 10,000 and discovered that 7% of run time spent on ATLAS tasks is a total waste.

Then I reasoned that if I could get that 7% down to 1% it would get more even more additional work done than me buying another computer to devote to LHC (which isn't going to happen with RAM as expensive as it is and the huge requirements of LHC tasks other than sixtrack) . I composed a "form letter" to send to user IDs that are returning ATLAS results sans HIT files. So far I have sent that letter only to dduggan47 for a number of reasons:

1) I wanted to do at least 1 test run of the form letter to help gauge user reaction to it
2) I'm still learning how to automate sending the form letter
3) don't want to create a flood of angry users seeking advice and wondering why the hell this project validates useless results and leaves them with the impression that all is OK when it is not, a trickle of angry users seems easier to manage and perhaps more effective
ID: 35971 · Report as offensive     Reply Quote
dduggan47

Send message
Joined: 1 Sep 04
Posts: 52
Credit: 11,152,883
RAC: 3,351
Message 35973 - Posted: 20 Jul 2018, 17:36:32 UTC - in response to Message 35971.  

... became aware of this today when I got an email from an admin letting me know.

Which admin and (more interesting as it may be helpful for others also) what was his/her suggestion?


Actually it was Bronco who answered above. His private advice was to come here and ask. :-)


Yes, it was me and I contacted you via PM not email. Also, I am not an admin (maybe). But those are minor details.

As you said in your op, you wanted to do more. I do too but at the moment I don't have additional computing resources to devote to this great project. What I do have is time and a suspicion that a great number of ATLAS tasks are returning no useful work in spite of the fact that they validate. So I created a script that grabs the host IDs, result IDs and result pages from a range of user IDs and analyses their ATLAS results (if any). It counts CPU and run time for all ATLAS tasks and CPU and run time for ATLAS tasks that don't return a HITS file and saves a list of user IDs returning "no HITters" to disk . I ran the script on user IDs from 67 to 10,000 and discovered that 7% of run time spent on ATLAS tasks is a total waste.

Then I reasoned that if I could get that 7% down to 1% it would get more even more additional work done than me buying another computer to devote to LHC (which isn't going to happen with RAM as expensive as it is and the huge requirements of LHC tasks other than sixtrack) . I composed a "form letter" to send to user IDs that are returning ATLAS results sans HIT files. So far I have sent that letter only to dduggan47 for a number of reasons:

1) I wanted to do at least 1 test run of the form letter to help gauge user reaction to it
2) I'm still learning how to automate sending the form letter
3) don't want to create a flood of angry users seeking advice and wondering why the hell this project validates useless results and leaves them with the impression that all is OK when it is not, a trickle of angry users seems easier to manage and perhaps more effective


OK. Just FYI, what I got was an email which contained all the text of your message (not just a link to the board). The email's subject was "[LHC@home] - private message". The from address was
"Admin.Lhcathome@cern.ch via cern.onmicrosoft.com ". In the text it said "From: bronco (ID 569117)".

As you do more of these I suspect you'll find others interpreting that as an email from an admin who goes by bronco.

Anyway, your test worked and you did not create an angry user. I'm happy to be your ginnea pig. :-) Using the information you and others have provided I'll either fix the problem or skip ATLAS.
ID: 35973 · Report as offensive     Reply Quote
bronco

Send message
Joined: 13 Apr 18
Posts: 443
Credit: 8,438,885
RAC: 0
Message 35975 - Posted: 20 Jul 2018, 17:58:38 UTC - in response to Message 35973.  
Last modified: 20 Jul 2018, 17:59:16 UTC

OK. Just FYI, what I got was an email which contained all the text of your message (not just a link to the board). The email's subject was "[LHC@home] - private message". The from address was
"Admin.Lhcathome@cern.ch via cern.onmicrosoft.com ". In the text it said "From: bronco (ID 569117)".

I suspected as much. I mentioned it only because I didn't want anybody thinking I have access to your registered email address. Perhaps I should mention in the PM that I am not an admin.

Anyway, your test worked and you did not create an angry user. I'm happy to be your ginnea pig. :-) Using the information you and others have provided I'll either fix the problem or skip ATLAS.

You are the earliest volunteer who is still returning results regularly so I thought the honor of being my first guinea pig should go to you. It's not much of an honor but it's the best I can do :)
Glad to see you're still dedicated after all these years.
ID: 35975 · Report as offensive     Reply Quote
dduggan47

Send message
Joined: 1 Sep 04
Posts: 52
Credit: 11,152,883
RAC: 3,351
Message 35976 - Posted: 20 Jul 2018, 18:42:59 UTC - in response to Message 35975.  


You are the earliest volunteer who is still returning results regularly so I thought the honor of being my first guinea pig should go to you. It's not much of an honor but it's the best I can do :)
Glad to see you're still dedicated after all these years.


At my age I'll take honors wherever I can get them. :-)
ID: 35976 · Report as offensive     Reply Quote
Profile Yeti
Volunteer moderator
Avatar

Send message
Joined: 2 Sep 04
Posts: 453
Credit: 193,369,412
RAC: 10,065
Message 35982 - Posted: 21 Jul 2018, 10:44:35 UTC

Okay, now I got a PM from bronco also.

Bronco, you mentioned these three machines of mine were producing bad results for ATLAS-Tasks:

WoolyW10: https://lhcathome.cern.ch/lhcathome/show_host_detail.php?hostid=10487431
- checked this result: https://lhcathome.cern.ch/lhcathome/result.php?resultid=200236826
- To me it has a hits-file, am I right ?

ObiWan: https://lhcathome.cern.ch/lhcathome/show_host_detail.php?hostid=10491929
This machine may really have a problem

PADME: https://lhcathome.cern.ch/lhcathome/show_host_detail.php?hostid=10331543
- checked this reuslt: https://lhcathome.cern.ch/lhcathome/result.php?resultid=200241085
- To me it has a hits-file, am I right ?

Bronco, as David Cameron has stated in the past, sometimes it may happen that some WUs produce no HITS-File and it is NOT a problem of the client !

As you can see with my hosts WoolyW10 and PADME, they produce results with hits-files.

So, I assume that you should re-program your script and check the relation of Results per Host with and without hits-files. And check if the latest results have a hits-file.

I would be interested to check my hosts periodic about the Hits-File-theme, what would be necessary to run your script for me ?

Can I see more details about my hosts you have found out ?


Supporting BOINC, a great concept !
ID: 35982 · Report as offensive     Reply Quote
computezrmle
Volunteer moderator
Volunteer developer
Volunteer tester
Help desk expert
Avatar

Send message
Joined: 15 Jun 08
Posts: 2386
Credit: 222,933,564
RAC: 137,738
Message 35983 - Posted: 21 Jul 2018, 12:34:49 UTC

A while ago Crystal Pellet pointed out a method how you can check the success of an ATLAS job using the PandaID.

Example:
https://lhcathome.cern.ch/lhcathome/result.php?resultid=200236979
PandaID (from stderr.txt): 3994761077
Now check: https://bigpanda.cern.ch/job?pandaid=3994761077

Although some of Yeti's stderr logs were incomplete, that method shows a successful job.
ID: 35983 · Report as offensive     Reply Quote
Erich56

Send message
Joined: 18 Dec 15
Posts: 1686
Credit: 100,373,366
RAC: 102,139
Message 35984 - Posted: 21 Jul 2018, 13:01:14 UTC - in response to Message 35983.  

A while ago Crystal Pellet pointed out a method how you can check the success of an ATLAS job using the PandaID.
I guess that was on occasion of my posting bringing up the problem that only 2 or 3 out of 10 finished ATLAS tasks showed a hits file in the stderr.

Following Crystal Pellets advice and checking with Panda, I noticed that ALL of my uploaded ATLAS results had a hits file.

From this, I figured that the information shown in the stderr is rather unreliable, not to say unuseable, for what reason ever ... ???
ID: 35984 · Report as offensive     Reply Quote
bronco

Send message
Joined: 13 Apr 18
Posts: 443
Credit: 8,438,885
RAC: 0
Message 35986 - Posted: 21 Jul 2018, 14:05:08 UTC - in response to Message 35982.  

So, I assume that you should re-program your script and check the relation of Results per Host with and without hits-files. And check if the latest results have a hits-file.

Thank you for your suggestion. I realize now the script is deficient as you describe and I will make the necessary adjustments before proceeding. Thanks also for pointing out that sometimes it is not the fault of the host. I wasn't aware of that fact.

I would be interested to check my hosts periodic about the Hits-File-theme, what would be necessary to run your script for me ?

Although you check your hosts periodically, there are perhaps many others who do not. I am also confident that there are numerous participants who don't know about the problem. I apologize for hurting your feelings but I believe that over time my project will help many volunteers and that they will be grateful.

Can I see more details about my hosts you have found out ?

I have already revealed to you everything I know about your hosts. I don't have a "back door" into LHC's database or into your hosts if that is what you mean. The script does only what any volunteer can do manually. It fetches the publicly available web pages for your host(s) and extracts info pertaining to your ATLAS results, info that any volunteer with an LHC@home account can access.

BTW, my plan has always been to run the script once and only once for each participant and to contact participants one time only. I will not be re-examining anybody's results to see if they have made corrections to their hosts. There will be no regular reminders or anything of that nature. It's a once-then-off effort.
ID: 35986 · Report as offensive     Reply Quote
bronco

Send message
Joined: 13 Apr 18
Posts: 443
Credit: 8,438,885
RAC: 0
Message 35987 - Posted: 21 Jul 2018, 14:50:56 UTC - in response to Message 35983.  

A while ago Crystal Pellet pointed out a method how you can check the success of an ATLAS job using the PandaID...

Thanks. I will definitely use that method.
ID: 35987 · Report as offensive     Reply Quote
dduggan47

Send message
Joined: 1 Sep 04
Posts: 52
Credit: 11,152,883
RAC: 3,351
Message 35988 - Posted: 21 Jul 2018, 18:19:00 UTC - in response to Message 35975.  

The change from 50% to 70% of RAM when the CPU is in use had no effect. Still got the HITS error.

My next thought was to modify the app_config.xml. That file doesn't seem to exist on my computer. If I should create it, where should I put it? Even though I worked around it (as you'll see below) I'd like to know for future references where it is/should be.

Meanwhile I modified the # of CPU's from 2 to 3. Task 201884680 did not have the error! I'm pretty sure it's that's the first good one I've had. I'd suspended 2 other tasks before they started to see how that one came out and have now released them. If they work then problem definitely solved!

Thanks all, but especially bronco, for the help.
ID: 35988 · Report as offensive     Reply Quote
computezrmle
Volunteer moderator
Volunteer developer
Volunteer tester
Help desk expert
Avatar

Send message
Joined: 15 Jun 08
Posts: 2386
Credit: 222,933,564
RAC: 137,738
Message 35989 - Posted: 21 Jul 2018, 20:07:43 UTC - in response to Message 35988.  

The change from 50% to 70% of RAM when the CPU is in use had no effect. Still got the HITS error.

This can have an influence on the "postponed ..." error but not on the HITS error.

My next thought was to modify the app_config.xml.

That's the right way.
Create it in "\<basic_boinc_folder>\projects\lhcathome.cern.ch_lhcathome".
This is an example:
<app_config>
  <app>
    <name>ATLAS</name>
    <max_concurrent>1</max_concurrent>
  </app>
  <app_version>
    <app_name>ATLAS</app_name>
    <plan_class>vbox64_mt_mcore_atlas</plan_class>
    <avg_ncpus>2.0</avg_ncpus>
    <cmdline>--nthreads 2 --memory_size_mb 4800</cmdline>
  </app_version>
</app_config> 


To activate it (doesn't change already running VMs) run "reload config files" from your BOINC GUI.

Meanwhile I modified the # of CPU's from 2 to 3. Task 201884680 did not have the error!

The default RAM setting (for your VM) is calculated by the server based on the #cores.
Recent tasks obviously need more RAM to expand the EVNT files than it is set for 1 core or 2 core setups.
4800MB should be enough so you may stay at a 2 core setup which is more efficient than a setup with more cores.
ID: 35989 · Report as offensive     Reply Quote
bronco

Send message
Joined: 13 Apr 18
Posts: 443
Credit: 8,438,885
RAC: 0
Message 35990 - Posted: 21 Jul 2018, 22:18:47 UTC - in response to Message 35988.  

Task 201884680 did not have the error! I'm pretty sure it's that's the first good one I've had.

True it does not have the "Non-zero return code from EVNTtoHITS (65) (Error code 65)" but unfortunately it does not have the line that indicates it created the HITS file which looks similar to...

2018-07-19 17:12:45 (5688): Guest Log: -rw------- 1 atlas01 atlas01 139559066 Jul 19 17:08 HITS.14568781._033697.pool.root.1

The non-bolded portion of the above line will be different for every result due to differences in the dates but the bolded portion should (I think) be very similar and should differ only in the numbers between "HITS" and ".pool.root.1".

Also that task ( 201884680) ran for only 33 minutes. I am led to believe that tasks that produce HITS files usually require about 2 hours on the fastest CPUs.

Anyway, new info provided in this thread by Yeti, computezrmie and Erich56 convince me that the above tests are not reliable. It seems that if BOINC and/or VBox and/or other involved processes get too busy then they fail to log properly to stderr output. See this post regarding PandaID for what looks to be the most reliable test. BTW, the tasks I mentioned in my PM to you fail the Panda test too.

I will be revamping my script to use the PandaID test instead of the less reliable tests.
ID: 35990 · Report as offensive     Reply Quote
dduggan47

Send message
Joined: 1 Sep 04
Posts: 52
Credit: 11,152,883
RAC: 3,351
Message 35992 - Posted: 22 Jul 2018, 2:11:44 UTC - in response to Message 35990.  

Thanks, bronco. Sigh.

I let the 2 still running complete just in case but I've changed my preferences to turn off Atlas.

If you or anybody else has any thoughts on what i might do to solve the problem, please pass them along and I'll give it another shot.
ID: 35992 · Report as offensive     Reply Quote
1 · 2 · 3 · 4 · Next

Message boards : ATLAS application : Non-zero return code from EVNTtoHITS (65) (Error code 65)


©2024 CERN