Message boards :
ATLAS application :
Non-zero return code from EVNTtoHITS (65) (Error code 65)
Message board moderation
Author | Message |
---|---|
Send message Joined: 1 Sep 04 Posts: 52 Credit: 11,152,883 RAC: 3,351 |
I'm likely creating a duplicate thread for this problem but just haven't been able to find a current thread that addresses us. Please redirect me if that's the case. I am consistently getting the above message on ATLAS tasks. I only became aware of this today when I got an email from an admin letting me know. I gather that the most common cause for this is lack of memory. ATLAS requires 8GB. The computer I'm running LHC on has 16GB. My config is set to use up to 90% of that when the NOT in use. When it is in use I had it set to 50%, but that's still 8GB. I've upped it to 70% now so we'll see. I'm guessing that that won't solve it though. My max CPUs is set in my account on the website at 2. Not sure if this is relevant but I'm also occasionally getting this message on ATLAS tasks: "Postponed: VM job unmanageable, restarting later." I'm not the most technical of users (as you can probably tell). For years I ran only sixtrack but this is my favorite BOINC project (my participation dates back to 9/1/4, the first day of LHC@home) so I wanted to do more. It took quite a while to figure out how on one machine and my other one simply refuses altogether. Sigh. Any help greatly appreciated but best to keep it as simple as possible. :-) Thanks. - Dick |
Send message Joined: 13 Apr 18 Posts: 443 Credit: 8,438,885 RAC: 0 |
The problem seems to be on this host with VirtualBox 5.2.8, 16GB RAM. CPU is Intel(R) Core(TM) i7-8550U which has 4 cores. BOINC reports 8 cores so it's multi-threading. Any help greatly appreciated but best to keep it as simple as possible. :-) Unfortunately there is no simple answer. There is no simple "flip switch A and your problems will be solved forever" kind of solution. You'll have to try a few different ideas and see what works for you. All we can do is give you some general ideas and then it's up to you to experiment a little until you find the settings/configuration that works for you. The first thing to realize is that ATLAS tasks require a lot of memory and other resources and when run in VirtualBox they put a huge load on your computer. If you are running other applications and BOINC projects that also load the system heavily then you get problems. Not sure if this is relevant but I'm also occasionally getting this message on ATLAS tasks: "Postponed: VM job unmanageable, restarting later." It is very relevant. It is an indication that VBox isn't getting the RAM and/or other resources it needs. Sometimes the job restarts when the system load is not as high, VBox gets things sorted out and the job proceeds. Sometimes it does not. I gather that the most common cause for this is lack of memory. ATLAS requires 8GB. The computer I'm running LHC on has 16GB. My config is set to use up to 90% of that when the NOT in use. When it is in use I had it set to 50%, but that's still 8GB. I've upped it to 70% now so we'll see. I'm guessing that that won't solve it though. Actually it might solve it. Remember ATLAS tasks run in VBox (VirtualBox) and VBox does not like to be suspended/resumed quickly. When you had the setting at 50% it's possible that other apps were using more than 50% for short durations and repeatedly which would have the effect of causing BOINC to suspend/resume VBox repeatedly and quickly which is precisely the thing VBox does not handle well. That's one theory for why you are getting the errors. If setting it at 70% doesn't help then try 90% or even higher. My max CPUs is set in my account on the website at 2. That should work on most 8 core CPUs with 16GB RAM if you are not overloading the system with other BOINC tasks and/or personal applications . The thing is you don't actually have 8 cores. You have only 4 cores. You also have multi-threading turned on which makes BOINC think you have 8 cores which might in turn be causing BOINC to run more tasks than your system can actually handle when one of them is ATLAS. You might try reducing "max CPUs" to 1. You might also try reducing the number of other applications and BOINC projects. |
Send message Joined: 1 Sep 04 Posts: 52 Credit: 11,152,883 RAC: 3,351 |
Thank you Bronco. I wasn't hoping for an answer that was simple in implementation, just simple enough technically that I could understand the advice. You succeeded. The easiest solutions you proposed seem to be: 1) The change I already made re % of CPU for BOINC when not in use from 50% to 70%. This require no immediate change. If it doesn't work I'll up the % further. 2) If that doesn't help I'll reduce the number of CPUs from 2 to 1. 3) Reread your note and go from there. Thanks again. - Dick |
Send message Joined: 9 Dec 14 Posts: 202 Credit: 2,533,875 RAC: 0 |
If the already mentioned ideas don't help, the error may be caused due to a "wrong server configuration". If you want to run 1-core or 2-core ATLAS Vbox tasks, the server sends you a too low RAM setting for those tasks. To change that, you have to use an app_config.xml. See this post: https://lhcathome.cern.ch/lhcathome/forum_thread.php?id=4161&postid=35921 |
Send message Joined: 15 Jun 08 Posts: 2386 Credit: 222,933,564 RAC: 137,738 |
... became aware of this today when I got an email from an admin letting me know. Which admin and (more interesting as it may be helpful for others also) what was his/her suggestion? |
Send message Joined: 1 Sep 04 Posts: 52 Credit: 11,152,883 RAC: 3,351 |
Thanks, gyllic. I had actually run across that in my searching and I'll add it to my list of things to try. |
Send message Joined: 1 Sep 04 Posts: 52 Credit: 11,152,883 RAC: 3,351 |
... became aware of this today when I got an email from an admin letting me know. Actually it was Bronco who answered above. His private advice was to come here and ask. :-) |
Send message Joined: 13 Apr 18 Posts: 443 Credit: 8,438,885 RAC: 0 |
... became aware of this today when I got an email from an admin letting me know. Yes, it was me and I contacted you via PM not email. Also, I am not an admin (maybe). But those are minor details. As you said in your op, you wanted to do more. I do too but at the moment I don't have additional computing resources to devote to this great project. What I do have is time and a suspicion that a great number of ATLAS tasks are returning no useful work in spite of the fact that they validate. So I created a script that grabs the host IDs, result IDs and result pages from a range of user IDs and analyses their ATLAS results (if any). It counts CPU and run time for all ATLAS tasks and CPU and run time for ATLAS tasks that don't return a HITS file and saves a list of user IDs returning "no HITters" to disk . I ran the script on user IDs from 67 to 10,000 and discovered that 7% of run time spent on ATLAS tasks is a total waste. Then I reasoned that if I could get that 7% down to 1% it would get more even more additional work done than me buying another computer to devote to LHC (which isn't going to happen with RAM as expensive as it is and the huge requirements of LHC tasks other than sixtrack) . I composed a "form letter" to send to user IDs that are returning ATLAS results sans HIT files. So far I have sent that letter only to dduggan47 for a number of reasons: 1) I wanted to do at least 1 test run of the form letter to help gauge user reaction to it 2) I'm still learning how to automate sending the form letter 3) don't want to create a flood of angry users seeking advice and wondering why the hell this project validates useless results and leaves them with the impression that all is OK when it is not, a trickle of angry users seems easier to manage and perhaps more effective |
Send message Joined: 1 Sep 04 Posts: 52 Credit: 11,152,883 RAC: 3,351 |
... became aware of this today when I got an email from an admin letting me know. OK. Just FYI, what I got was an email which contained all the text of your message (not just a link to the board). The email's subject was "[LHC@home] - private message". The from address was "Admin.Lhcathome@cern.ch via cern.onmicrosoft.com ". In the text it said "From: bronco (ID 569117)". As you do more of these I suspect you'll find others interpreting that as an email from an admin who goes by bronco. Anyway, your test worked and you did not create an angry user. I'm happy to be your ginnea pig. :-) Using the information you and others have provided I'll either fix the problem or skip ATLAS. |
Send message Joined: 13 Apr 18 Posts: 443 Credit: 8,438,885 RAC: 0 |
OK. Just FYI, what I got was an email which contained all the text of your message (not just a link to the board). The email's subject was "[LHC@home] - private message". The from address was I suspected as much. I mentioned it only because I didn't want anybody thinking I have access to your registered email address. Perhaps I should mention in the PM that I am not an admin. Anyway, your test worked and you did not create an angry user. I'm happy to be your ginnea pig. :-) Using the information you and others have provided I'll either fix the problem or skip ATLAS. You are the earliest volunteer who is still returning results regularly so I thought the honor of being my first guinea pig should go to you. It's not much of an honor but it's the best I can do :) Glad to see you're still dedicated after all these years. |
Send message Joined: 1 Sep 04 Posts: 52 Credit: 11,152,883 RAC: 3,351 |
At my age I'll take honors wherever I can get them. :-) |
Send message Joined: 2 Sep 04 Posts: 453 Credit: 193,369,412 RAC: 10,065 |
Okay, now I got a PM from bronco also. Bronco, you mentioned these three machines of mine were producing bad results for ATLAS-Tasks: WoolyW10: https://lhcathome.cern.ch/lhcathome/show_host_detail.php?hostid=10487431 - checked this result: https://lhcathome.cern.ch/lhcathome/result.php?resultid=200236826 - To me it has a hits-file, am I right ? ObiWan: https://lhcathome.cern.ch/lhcathome/show_host_detail.php?hostid=10491929 This machine may really have a problem PADME: https://lhcathome.cern.ch/lhcathome/show_host_detail.php?hostid=10331543 - checked this reuslt: https://lhcathome.cern.ch/lhcathome/result.php?resultid=200241085 - To me it has a hits-file, am I right ? Bronco, as David Cameron has stated in the past, sometimes it may happen that some WUs produce no HITS-File and it is NOT a problem of the client ! As you can see with my hosts WoolyW10 and PADME, they produce results with hits-files. So, I assume that you should re-program your script and check the relation of Results per Host with and without hits-files. And check if the latest results have a hits-file. I would be interested to check my hosts periodic about the Hits-File-theme, what would be necessary to run your script for me ? Can I see more details about my hosts you have found out ? Supporting BOINC, a great concept ! |
Send message Joined: 15 Jun 08 Posts: 2386 Credit: 222,933,564 RAC: 137,738 |
A while ago Crystal Pellet pointed out a method how you can check the success of an ATLAS job using the PandaID. Example: https://lhcathome.cern.ch/lhcathome/result.php?resultid=200236979 PandaID (from stderr.txt): 3994761077 Now check: https://bigpanda.cern.ch/job?pandaid=3994761077 Although some of Yeti's stderr logs were incomplete, that method shows a successful job. |
Send message Joined: 18 Dec 15 Posts: 1686 Credit: 100,373,366 RAC: 102,139 |
A while ago Crystal Pellet pointed out a method how you can check the success of an ATLAS job using the PandaID.I guess that was on occasion of my posting bringing up the problem that only 2 or 3 out of 10 finished ATLAS tasks showed a hits file in the stderr. Following Crystal Pellets advice and checking with Panda, I noticed that ALL of my uploaded ATLAS results had a hits file. From this, I figured that the information shown in the stderr is rather unreliable, not to say unuseable, for what reason ever ... ??? |
Send message Joined: 13 Apr 18 Posts: 443 Credit: 8,438,885 RAC: 0 |
So, I assume that you should re-program your script and check the relation of Results per Host with and without hits-files. And check if the latest results have a hits-file. Thank you for your suggestion. I realize now the script is deficient as you describe and I will make the necessary adjustments before proceeding. Thanks also for pointing out that sometimes it is not the fault of the host. I wasn't aware of that fact. I would be interested to check my hosts periodic about the Hits-File-theme, what would be necessary to run your script for me ? Although you check your hosts periodically, there are perhaps many others who do not. I am also confident that there are numerous participants who don't know about the problem. I apologize for hurting your feelings but I believe that over time my project will help many volunteers and that they will be grateful. Can I see more details about my hosts you have found out ? I have already revealed to you everything I know about your hosts. I don't have a "back door" into LHC's database or into your hosts if that is what you mean. The script does only what any volunteer can do manually. It fetches the publicly available web pages for your host(s) and extracts info pertaining to your ATLAS results, info that any volunteer with an LHC@home account can access. BTW, my plan has always been to run the script once and only once for each participant and to contact participants one time only. I will not be re-examining anybody's results to see if they have made corrections to their hosts. There will be no regular reminders or anything of that nature. It's a once-then-off effort. |
Send message Joined: 13 Apr 18 Posts: 443 Credit: 8,438,885 RAC: 0 |
A while ago Crystal Pellet pointed out a method how you can check the success of an ATLAS job using the PandaID... Thanks. I will definitely use that method. |
Send message Joined: 1 Sep 04 Posts: 52 Credit: 11,152,883 RAC: 3,351 |
The change from 50% to 70% of RAM when the CPU is in use had no effect. Still got the HITS error. My next thought was to modify the app_config.xml. That file doesn't seem to exist on my computer. If I should create it, where should I put it? Even though I worked around it (as you'll see below) I'd like to know for future references where it is/should be. Meanwhile I modified the # of CPU's from 2 to 3. Task 201884680 did not have the error! I'm pretty sure it's that's the first good one I've had. I'd suspended 2 other tasks before they started to see how that one came out and have now released them. If they work then problem definitely solved! Thanks all, but especially bronco, for the help. |
Send message Joined: 15 Jun 08 Posts: 2386 Credit: 222,933,564 RAC: 137,738 |
The change from 50% to 70% of RAM when the CPU is in use had no effect. Still got the HITS error. This can have an influence on the "postponed ..." error but not on the HITS error. My next thought was to modify the app_config.xml. That's the right way. Create it in "\<basic_boinc_folder>\projects\lhcathome.cern.ch_lhcathome". This is an example: <app_config> <app> <name>ATLAS</name> <max_concurrent>1</max_concurrent> </app> <app_version> <app_name>ATLAS</app_name> <plan_class>vbox64_mt_mcore_atlas</plan_class> <avg_ncpus>2.0</avg_ncpus> <cmdline>--nthreads 2 --memory_size_mb 4800</cmdline> </app_version> </app_config> To activate it (doesn't change already running VMs) run "reload config files" from your BOINC GUI. Meanwhile I modified the # of CPU's from 2 to 3. Task 201884680 did not have the error! The default RAM setting (for your VM) is calculated by the server based on the #cores. Recent tasks obviously need more RAM to expand the EVNT files than it is set for 1 core or 2 core setups. 4800MB should be enough so you may stay at a 2 core setup which is more efficient than a setup with more cores. |
Send message Joined: 13 Apr 18 Posts: 443 Credit: 8,438,885 RAC: 0 |
Task 201884680 did not have the error! I'm pretty sure it's that's the first good one I've had. True it does not have the "Non-zero return code from EVNTtoHITS (65) (Error code 65)" but unfortunately it does not have the line that indicates it created the HITS file which looks similar to... 2018-07-19 17:12:45 (5688): Guest Log: -rw------- 1 atlas01 atlas01 139559066 Jul 19 17:08 HITS.14568781._033697.pool.root.1 The non-bolded portion of the above line will be different for every result due to differences in the dates but the bolded portion should (I think) be very similar and should differ only in the numbers between "HITS" and ".pool.root.1". Also that task ( 201884680) ran for only 33 minutes. I am led to believe that tasks that produce HITS files usually require about 2 hours on the fastest CPUs. Anyway, new info provided in this thread by Yeti, computezrmie and Erich56 convince me that the above tests are not reliable. It seems that if BOINC and/or VBox and/or other involved processes get too busy then they fail to log properly to stderr output. See this post regarding PandaID for what looks to be the most reliable test. BTW, the tasks I mentioned in my PM to you fail the Panda test too. I will be revamping my script to use the PandaID test instead of the less reliable tests. |
Send message Joined: 1 Sep 04 Posts: 52 Credit: 11,152,883 RAC: 3,351 |
Thanks, bronco. Sigh. I let the 2 still running complete just in case but I've changed my preferences to turn off Atlas. If you or anybody else has any thoughts on what i might do to solve the problem, please pass them along and I'll give it another shot. |
©2024 CERN