Message boards : ATLAS application : Non-zero return code from EVNTtoHITS (65) (Error code 65)
Message board moderation

To post messages, you must log in.

Previous · 1 · 2 · 3 · 4 · Next

AuthorMessage
computezrmle
Volunteer moderator
Volunteer developer
Volunteer tester
Avatar

Send message
Joined: 15 Jun 08
Posts: 1461
Credit: 77,713,187
RAC: 88,933
Message 36061 - Posted: 26 Jul 2018, 15:18:58 UTC - in response to Message 36060.  

CPU usage is not the only variable that can be configured via the client GUI.
Others are RAM usage, SWAP usage ...

In addition you may consider to use some of the logging flags mentioned in the BOINC documentation:
http://boinc.berkeley.edu/wiki/client_configuration

Last but not least the BOINC client occasionally does odd things that sometimes disappear after a reboot.
ID: 36061 · Report as offensive     Reply Quote
Profile Yeti
Volunteer moderator
Avatar

Send message
Joined: 2 Sep 04
Posts: 406
Credit: 96,116,916
RAC: 1
Message 36062 - Posted: 26 Jul 2018, 15:39:09 UTC

dduggan47 wrote:
What happens is that 1 single core LHC project starts up and nothing else leaving 7 cores sitting on their hands. That's the behavior I don't understand. My little machine is not being fully utilized by BOINC in general or LHC in particular and the latter seems to be the bottleneck for some reason.

Boinc-client isn't good regarding MultiCore-Apps. You have to make a lot of micro-management to get things running together as you want. It is not a bad idea to run only 1 kind of MultiCore-Projekt, e.g. Atlas or others


Supporting BOINC, a great concept !
ID: 36062 · Report as offensive     Reply Quote
dduggan47

Send message
Joined: 1 Sep 04
Posts: 47
Credit: 4,751,479
RAC: 0
Message 36063 - Posted: 26 Jul 2018, 16:08:03 UTC - in response to Message 36062.  

dduggan47 wrote:
What happens is that 1 single core LHC project starts up and nothing else leaving 7 cores sitting on their hands. That's the behavior I don't understand. My little machine is not being fully utilized by BOINC in general or LHC in particular and the latter seems to be the bottleneck for some reason.

Boinc-client isn't good regarding MultiCore-Apps. You have to make a lot of micro-management to get things running together as you want. It is not a bad idea to run only 1 kind of MultiCore-Projekt, e.g. Atlas or others


Yup, I think you hit it. I continued to experiment and found that the behavior I described is not consistent. It consistently prevents more than 2 cores worth of LHC projects to run but after suspending and unsuspending projects or tasks often enough BOINC suddenly started letting other tasks run on the idle cores. Just for grins I renamed the app_config.xml file and suddenly it will run the LHC tasks it can.

I'm pretty near the end of the time I'm going to spend on this (unless I or somebody else has a new thought on it). I'm probably going to give an ATLAS task one more shot and if it doesn't work I'll just disable ATLAS. All the others seem to run fine.

Thanks again to all who have taken their own time to try to educate me on this and help me figure it out. It's a great community!

- Dick
ID: 36063 · Report as offensive     Reply Quote
Profile Yeti
Volunteer moderator
Avatar

Send message
Joined: 2 Sep 04
Posts: 406
Credit: 96,116,916
RAC: 1
Message 36064 - Posted: 26 Jul 2018, 16:12:04 UTC - in response to Message 36063.  

I'm pretty near the end of the time I'm going to spend on this (unless I or somebody else has a new thought on it). I'm probably going to give an ATLAS task one more shot and if it doesn't work I'll just disable ATLAS. All the others seem to run fine.

Perhaps you will find some hints in my checklist to get it work fine: https://lhcathome.cern.ch/lhcathome/forum_thread.php?id=4161&postid=29359#29359


Supporting BOINC, a great concept !
ID: 36064 · Report as offensive     Reply Quote
dduggan47

Send message
Joined: 1 Sep 04
Posts: 47
Credit: 4,751,479
RAC: 0
Message 36066 - Posted: 26 Jul 2018, 16:15:22 UTC - in response to Message 36064.  

Thanks, Yeti. You're right, I'll do that. I've looked at it before but haven't gone back to it in a while.
ID: 36066 · Report as offensive     Reply Quote
AuxRx

Send message
Joined: 16 Sep 17
Posts: 100
Credit: 1,566,282
RAC: 853
Message 36067 - Posted: 26 Jul 2018, 16:44:22 UTC - in response to Message 36063.  

I've found that changing the number of cores allocated to any task or project introduces this behaviour. I recently had this very issue, because I was messing with an app_config.xml. Never touch a running system seems to really hold some truth.

Here are my amateur remarks:
The problem is particularly rampant, when BOINC still holds some tasks that were requested before changes were introduced.
Especially affected are tasks that are running or suspended at the time of transition, but not exclusively.
Reboots or deleting single tasks does not help. Tasks will run in parallel for a while, but sooner or later it's down to 1 task again.

Basically, set "No new tasks", run through your buffer previously downloaded tasks (as not to give up on your WUs). It might be fixed now. Still, to err on the safe side, reset all projects. Reboot for good measure. Allow new tasks.

I did some version of the above, and it's working now ... or for now, at least.
ID: 36067 · Report as offensive     Reply Quote
dduggan47

Send message
Joined: 1 Sep 04
Posts: 47
Credit: 4,751,479
RAC: 0
Message 36068 - Posted: 26 Jul 2018, 17:02:20 UTC - in response to Message 36067.  

Thanks, AurRx. Glad to hear it's not just me!
ID: 36068 · Report as offensive     Reply Quote
bronco

Send message
Joined: 13 Apr 18
Posts: 443
Credit: 8,438,885
RAC: 0
Message 36073 - Posted: 26 Jul 2018, 18:50:40 UTC - in response to Message 36068.  

I can confirm what AuxRx said, that's exactly what happens here too, been through it numerous times as well and that's why I run ATLAS and only ATLAS on 2 of my hosts and Theory and only Theory on the third. The only solution seems to be to seperate projects by putting them on different physical machines or install multiple BOINC clients on one machine which is doable but tricky on Linux, not sure about Windows.

No shame in walking away from ATLAS. It's a tough nut to crack.
ID: 36073 · Report as offensive     Reply Quote
Toby Broom
Volunteer moderator

Send message
Joined: 27 Sep 08
Posts: 598
Credit: 373,734,563
RAC: 41,187
Message 36075 - Posted: 26 Jul 2018, 22:11:01 UTC - in response to Message 36073.  

You can run multiple on windows too, its pretty much same procedure as with Linux.

All of my tasks created no HITS files, so I gave up on ATLAS again for a bit, it was working fine but they make a pain to setup.
ID: 36075 · Report as offensive     Reply Quote
bronco

Send message
Joined: 13 Apr 18
Posts: 443
Credit: 8,438,885
RAC: 0
Message 36077 - Posted: 27 Jul 2018, 1:04:43 UTC - in response to Message 36075.  

You have a Linux box plus several Windows rigs. ATLAS was a pain here too until I ditched VBox and went native. A proper app_config.xml was necessary too Have you tried that on your Linux rig? Gyllic's sticky guide at the top of this board makes it easy. The only thing I found that didn't work exactly as he said were the su commands and I assume that's because his instructions are for Debian and I use Ubuntu. I believe Debian and Ubuntu are a bit different in that regard. I worked around it with the sudo command, it was easy.
ID: 36077 · Report as offensive     Reply Quote
Toby Broom
Volunteer moderator

Send message
Joined: 27 Sep 08
Posts: 598
Credit: 373,734,563
RAC: 41,187
Message 36079 - Posted: 27 Jul 2018, 5:31:27 UTC

I didn't try to setup on my Linux box as it one of the weaker computers I have so I just use it for sixtrack.

I used the app_config with my windows ones but since you can configure the setting completely it again make it a challenge to configure as required, e.g the working set is wrong so it blocks other work etc.
ID: 36079 · Report as offensive     Reply Quote
Profile tazzduke

Send message
Joined: 24 Jun 10
Posts: 21
Credit: 1,726,398
RAC: 0
Message 36082 - Posted: 27 Jul 2018, 9:34:01 UTC

Hi All

Seems as I am not the only one who is completing workunits that have been marked as valid (at boinc level) but no HITS file is present.

This is an extract from my last workunit - https://lhcathome.cern.ch/lhcathome/result.php?resultid=200161943

2018-07-16 05:50:28 (7532): Guest Log: Starting ATLAS job. (PandaID=3983550564 taskID=14530897)
2018-07-16 06:08:08 (7532): Guest Log: log_extracts:
2018-07-16 06:08:08 (7532): Guest Log: - Last 10 lines from /home/atlas01/RunAtlas/Panda_Pilot_3444_1531691438/PandaJob/athena_stdout.txt -
2018-07-16 06:08:08 (7532): Guest Log: PyJobTransforms.trfExe.preExecute 2018-07-15 23:57:31,806 INFO Batch/grid running - command outputs will not be echoed. Logs for EVNTtoHITS are in log.EVNTtoHITS
2018-07-16 06:08:08 (7532): Guest Log: PyJobTransforms.trfExe.preExecute 2018-07-15 23:57:31,808 INFO Now writing wrapper for substep executor EVNTtoHITS
2018-07-16 06:08:08 (7532): Guest Log: PyJobTransforms.trfExe._writeAthenaWrapper 2018-07-15 23:57:31,808 INFO Valgrind not engaged
2018-07-16 06:08:08 (7532): Guest Log: PyJobTransforms.trfExe.preExecute 2018-07-15 23:57:31,808 INFO Athena will be executed in a subshell via ['./runwrapper.EVNTtoHITS.sh']
2018-07-16 06:08:08 (7532): Guest Log: PyJobTransforms.trfExe.execute 2018-07-15 23:57:31,808 INFO Starting execution of EVNTtoHITS (['./runwrapper.EVNTtoHITS.sh'])
2018-07-16 06:08:08 (7532): Guest Log: PyJobTransforms.trfExe.execute 2018-07-16 00:05:00,192 INFO EVNTtoHITS executor returns 139
2018-07-16 06:08:08 (7532): Guest Log: PyJobTransforms.trfExe.validate 2018-07-16 00:05:01,628 ERROR Validation of return code failed: EVNTtoHITS got a SIGSEGV signal (exit code 139) (Error code 65)
2018-07-16 06:08:08 (7532): Guest Log: PyJobTransforms.trfExe.validate 2018-07-16 00:05:01,679 INFO Scanning logfile log.EVNTtoHITS for errors
2018-07-16 06:08:08 (7532): Guest Log: PyJobTransforms.transform.execute 2018-07-16 00:05:01,724 CRITICAL Transform executor raised TransformValidationException: EVNTtoHITS got a SIGSEGV signal (exit code 139); Long ERROR message at line 1783 (see jobReport for further details)
2018-07-16 06:08:08 (7532): Guest Log: PyJobTransforms.transform.execute 2018-07-16 00:05:05,645 WARNING Transform now exiting early with exit code 65 (EVNTtoHITS got a SIGSEGV signal (exit code 139); Long ERROR message at line 1783 (see jobReport for further details))

I have, reset my preferences for MAX# Jobs=1 and MAX# Cores=2, also have the app_config.xml file setting 4800mb for my 2 core cpu workunit.

I might try and start again, by first finding another user who is validating with a hits file who is using Win 7 x64 and seeing which version of VB and BOINC they are using as well.

Might also need to do some re reading as well.

Regards
ID: 36082 · Report as offensive     Reply Quote
bronco

Send message
Joined: 13 Apr 18
Posts: 443
Credit: 8,438,885
RAC: 0
Message 36083 - Posted: 27 Jul 2018, 13:40:45 UTC - in response to Message 36082.  
Last modified: 27 Jul 2018, 13:45:24 UTC

I have, reset my preferences for MAX# Jobs=1 and MAX# Cores=2, also have the app_config.xml file setting 4800mb for my 2 core cpu workunit.
The app_config.xml is a tricky thing. If it contains any spelling or syntax errors then it gets ignored. Luckily BOINC checks the spelling and syntax and spits out a warning for us in the event log. So... did you verify that BOINC is able to find the app_config.xml and that it is free of errors?

Verify by opening BOINC Manager and clicking Options -> Read config files then open the Event log (Tools -> Event log), scroll to the bottom and see if it says "Found app_config.xml". It it doesn't say that then you created it in the wrong folder. If it does say "Found" then see if any red text follows. If it found errors then it will show those errors in red. If no errors then it is syntactically correct.

I might try and start again, by first finding another user who is validating with a hits file who is using Win 7 x64 and seeing which version of VB and BOINC they are using as well.
I would be suspicious of any anecdotal evidence that version xyz of this or that is buggy. Note that for almost every report that xyz is proven beyond a shadow of a doubt to be buggy you will find an equally credible report from another user saying it works fine for them. If it's recent then assume for now that it likely works well enough and focus instead on double-checking and verifying your settings/preferences and app_config.xml.

edit:
It wouldn't hurt to post your entire app_config.xml here as well as verifying it as described above. BOINC checks for syntax and spelling errors but it doesn't check for logic errors.
ID: 36083 · Report as offensive     Reply Quote
Profile Yeti
Volunteer moderator
Avatar

Send message
Joined: 2 Sep 04
Posts: 406
Credit: 96,116,916
RAC: 1
Message 36084 - Posted: 27 Jul 2018, 13:57:46 UTC - in response to Message 36083.  
Last modified: 27 Jul 2018, 13:58:17 UTC

I might try and start again, by first finding another user who is validating with a hits file who is using Win 7 x64 and seeing which version of VB and BOINC they are using as well.

I'm running Win7 x64 with BOINC 7.8.3 and VirtualBox 5.1.30

You can check my results here: https://lhcathome.cern.ch/lhcathome/results.php?hostid=10359162


Supporting BOINC, a great concept !
ID: 36084 · Report as offensive     Reply Quote
Profile tazzduke

Send message
Joined: 24 Jun 10
Posts: 21
Credit: 1,726,398
RAC: 0
Message 36087 - Posted: 27 Jul 2018, 14:11:37 UTC - in response to Message 36084.  

Hi Yeti

Thankyou, currently emptying my workcache and then start all over again.

Regards
ID: 36087 · Report as offensive     Reply Quote
Profile tazzduke

Send message
Joined: 24 Jun 10
Posts: 21
Credit: 1,726,398
RAC: 0
Message 36088 - Posted: 27 Jul 2018, 14:22:49 UTC

Hi bronco

Yes BOINC finds it and reads it correctly, I was helped out by user Computezrmle in the number crunching thread.

Have used an app_config file on other projects ie SETI, Primegrid. :-)

Waiting for my machine to empty its workcache and going to start with a fresh install of VB and BOINC.

Then we will see how that goes.

Regards
ID: 36088 · Report as offensive     Reply Quote
Profile tazzduke

Send message
Joined: 24 Jun 10
Posts: 21
Credit: 1,726,398
RAC: 0
Message 36102 - Posted: 28 Jul 2018, 10:37:01 UTC

Greetings

Just to update, I did a complete reinstall on VB and BOINC.

Did a reset of LHC and then downloaded one workunit to triy.

Then as I was watching it, two things I noticed.

1st thing - Nil CPU usage was indicated whilst workunit was running.

2nd thing - I was looking at the events log and nil entries were being recorded, IE when you show vm and do the ALT F2 on a workunit.

So then I checked the stderr of the workunit and it was not a normal valid workunit with a hits file ref.

https://lhcathome.cern.ch/lhcathome/result.php?resultid=203310529

Cheers
ID: 36102 · Report as offensive     Reply Quote
computezrmle
Volunteer moderator
Volunteer developer
Volunteer tester
Avatar

Send message
Joined: 15 Jun 08
Posts: 1461
Credit: 77,713,187
RAC: 88,933
Message 36103 - Posted: 28 Jul 2018, 12:22:52 UTC - in response to Message 36102.  

Strange.
Your stderr.txt doesn't show an obvious error.

Did you notice any firewall issues?
If not, this may be a case for real experts who are able to interpret the logs at bigpanda.cern.ch.



BTW:
It's possible to mark URLs as URLs when you edit your posts.
Just highlight them and press the corresponding button.
https://lhcathome.cern.ch/lhcathome/result.php?resultid=203310529
ID: 36103 · Report as offensive     Reply Quote
Profile Yeti
Volunteer moderator
Avatar

Send message
Joined: 2 Sep 04
Posts: 406
Credit: 96,116,916
RAC: 1
Message 36104 - Posted: 28 Jul 2018, 13:03:28 UTC - in response to Message 36102.  
Last modified: 28 Jul 2018, 13:03:58 UTC

1st thing - Nil CPU usage was indicated whilst workunit was running.

2nd thing - I was looking at the events log and nil entries were being recorded, IE when you show vm and do the ALT F2 on a workunit.

So then I checked the stderr of the workunit and it was not a normal valid workunit with a hits file ref.

https://lhcathome.cern.ch/lhcathome/result.php?resultid=203310529

tazzduke, did you ever take a walk through my checklist? Please, go through it step by step, but especially keep an eye on Number 10

For me it looks as if the VM can't connect to the needed servers during spinup ...


Supporting BOINC, a great concept !
ID: 36104 · Report as offensive     Reply Quote
Profile tazzduke

Send message
Joined: 24 Jun 10
Posts: 21
Credit: 1,726,398
RAC: 0
Message 36105 - Posted: 28 Jul 2018, 13:49:08 UTC

Greetings Yeti

Thankyou, you may have given me a lightbulb moment, I think I need to change to a different PC to handle this.

Current PC being used to run these tasks, also has programs that run in background to keep it child safe.

So I will switch over to a spare QUAD Core Win 7 x64 system, that just runs BOINC and does nothing else.

Then I will re read up on step 10 and howto's on opening up said ports.

Cheers.
ID: 36105 · Report as offensive     Reply Quote
Previous · 1 · 2 · 3 · 4 · Next

Message boards : ATLAS application : Non-zero return code from EVNTtoHITS (65) (Error code 65)


©2020 CERN