Message boards : ATLAS application : what's the average share of finished tasks with hits created?
Message board moderation

To post messages, you must log in.

1 · 2 · Next

AuthorMessage
Erich56

Send message
Joined: 18 Dec 15
Posts: 1686
Credit: 100,396,059
RAC: 102,184
Message 35167 - Posted: 4 May 2018, 14:27:08 UTC

I have looked up my finished ATLAS tasks of the past 2 weeks and found out that less than a third of them had hits.
I may be wrong, but I think to remember that some time ago, when I also checked this out, the share was some higher.
Has there ever been figured out what is the average share of finished tasks with hits?
ID: 35167 · Report as offensive     Reply Quote
Harri Liljeroos
Avatar

Send message
Joined: 28 Sep 04
Posts: 674
Credit: 43,152,472
RAC: 15,698
Message 35168 - Posted: 4 May 2018, 21:17:01 UTC

I checked the two hosts that I have doing Atlas and the other one had 18 finished tasks all with the hits file. The other one has about 120 finished tasks and I checked about 60-70 tasks and found two without the hits file.
ID: 35168 · Report as offensive     Reply Quote
maeax

Send message
Joined: 2 May 07
Posts: 2071
Credit: 156,144,248
RAC: 105,364
Message 35169 - Posted: 5 May 2018, 5:41:45 UTC

David wrote a thread "Information on ATLAS tasks" in March 2017 and explained hits-file yes or no.
https://lhcathome.cern.ch/lhcathome/forum_thread.php?id=4178
ID: 35169 · Report as offensive     Reply Quote
Erich56

Send message
Joined: 18 Dec 15
Posts: 1686
Credit: 100,396,059
RAC: 102,184
Message 35170 - Posted: 5 May 2018, 6:08:51 UTC

Thanks for the information.

This really makes me wonder what could be going wrong with my ATLAS tasks processing.
Particularly since David wrote "Therefore a truly successful WU must have a valid HITS file produced, however you can still get credit even if no HITS file is present because we don't want people to suffer from problems in ATLAS software or infrastructure".

So if not even a third of my finished ATLAS tasks contain a HITS file, there may be a problem somewhere, right?
ID: 35170 · Report as offensive     Reply Quote
computezrmle
Volunteer moderator
Volunteer developer
Volunteer tester
Help desk expert
Avatar

Send message
Joined: 15 Jun 08
Posts: 2386
Credit: 222,955,082
RAC: 136,930
Message 35172 - Posted: 5 May 2018, 7:13:11 UTC

Erich56,

IIRC you explained somewhere in an old post why you can't upgrade that Win XP to a recent OS.
Is this still valid?
If not, you may consider to upgrade the OS as well as your VirtualBox (-> to 5.1.36, not yet 5.2.x as that may also cause problems).

Your logs show that you spend 20 GB RAM per VM.
Is this a typo?
If not, what is it good for?
2 of those VMs running concurrently would request more RAM that is installed and thus force the rest of your system to make heavy use of your swap/pagefile.
ID: 35172 · Report as offensive     Reply Quote
Erich56

Send message
Joined: 18 Dec 15
Posts: 1686
Credit: 100,396,059
RAC: 102,184
Message 35173 - Posted: 5 May 2018, 7:58:49 UTC

@computezrmle: good points :-)

1) yes, the OS still is WinXP, because I am running GPUGRID with two GTX980ti, and any OS beyond XP increases the GPU processing time by about 20%, due to the WDDM overhead in the newer OSs.
In fact, GPUGRID had announced some time ago that XP support will end by April 2018, so I was expecting to need to upgrade the machine to Win10 anyway. However, so far they still support XP.

2) the 20 GB RAM is not a typo; I recently increased it in the app_config.xml because I had a few 3-core ATLAS tasks which, for whatever reason, ran short of the 9 GB RAM as set before. Obviously, these tasks were somehow faulty, because console 3 showed 6 athena.py (insted of 3). Hence, to prevent any other such strange task to run out of memory after many hours of processing time, I increased the RAM allocation to 20GB. Which I though should not be too much of a problem with 32GB RAM available.
However, I have not received such a strange task for a while now, so I guess I could reduce the allocated RAM to "normal" values.

What I plan to try is to from now on download and process 1-core ATLAS tasks. Maybe the problem lies with the core allocation, who knows. I will see what happens.
If this does not help, I might upgrade the VB to 5.1.36 as suggested by you (any proof that this one runs well with XP?)
ID: 35173 · Report as offensive     Reply Quote
computezrmle
Volunteer moderator
Volunteer developer
Volunteer tester
Help desk expert
Avatar

Send message
Joined: 15 Jun 08
Posts: 2386
Credit: 222,955,082
RAC: 136,930
Message 35176 - Posted: 5 May 2018, 8:39:17 UTC - in response to Message 35173.  

... Obviously, these tasks were somehow faulty, because console 3 showed 6 athena.py (insted of 3). ... Maybe the problem lies with the core allocation ...

This is a configuration error that occurred occasionally in the past.
Nothing you can solve locally but you may post a comment in the MB to make the CERN team aware.

... I might upgrade the VB to 5.1.36 as suggested by you (any proof that this one runs well with XP?)

No guarantee as it is "brand new".
I'm trying it on one of my host to see if some nasty things disappear that occur with 5.2.x.
ID: 35176 · Report as offensive     Reply Quote
Erich56

Send message
Joined: 18 Dec 15
Posts: 1686
Credit: 100,396,059
RAC: 102,184
Message 35178 - Posted: 5 May 2018, 12:07:44 UTC

since the most recently finished 3-core task
https://lhcathome.cern.ch/lhcathome/result.php?resultid=188892823
also did not produce a HITS file, I now changed the settings for 1-core tasks. We'll see what happens.

Meanwhile, I also checked quite a number of ATLAS tasks of several other crunchers and found out that about 95% of them had HITS-files (the way it should be). Which makes clear to me that something is going wrong here.

Theoretically, I could lean back and say that as long as I get credits also for all these tasks that do not contain HITS-files, it doesn't matter to me.
However, I participate in this project in order to contribute with valuable results for a given project. To get credits is nice, of course, but it's not the main reason for participating.
ID: 35178 · Report as offensive     Reply Quote
gyllic

Send message
Joined: 9 Dec 14
Posts: 202
Credit: 2,533,875
RAC: 0
Message 35179 - Posted: 5 May 2018, 14:16:38 UTC - in response to Message 35173.  
Last modified: 5 May 2018, 14:17:55 UTC

1) yes, the OS still is WinXP, because I am running GPUGRID with two GTX980ti, and any OS beyond XP increases the GPU processing time by about 20%, due to the WDDM overhead in the newer OSs.
In fact, GPUGRID had announced some time ago that XP support will end by April 2018, so I was expecting to need to upgrade the machine to Win10 anyway. However, so far they still support XP.
Is MS still providing security updates for XP? Is this WDDM thing also true for Linux? Otherwise you might consider running Linux with the native ATLAS app which is working like a charm (as long as you dont stop and restart the running tasks (you will still produce valid results but if you restart them they start all over again i think)) and needs much less RAM. For GPU performance you can install the proprietary nvidia driver for linux.
ID: 35179 · Report as offensive     Reply Quote
Erich56

Send message
Joined: 18 Dec 15
Posts: 1686
Credit: 100,396,059
RAC: 102,184
Message 35180 - Posted: 5 May 2018, 16:39:55 UTC - in response to Message 35179.  

... Is this WDDM thing also true for Linux? Otherwise you might consider running Linux with the native ATLAS app
the WDDM "brake" is NOT true for Linux; however, the setback with Linux, what concerns GPUGRID crunching, is that no SWANsync can be set. Thus, the full CPU power cannot be applied :-(
ID: 35180 · Report as offensive     Reply Quote
Erich56

Send message
Joined: 18 Dec 15
Posts: 1686
Credit: 100,396,059
RAC: 102,184
Message 35181 - Posted: 6 May 2018, 11:36:58 UTC - in response to Message 35178.  
Last modified: 6 May 2018, 12:06:32 UTC

before, I wrote:
... I now changed the settings for 1-core tasks. We'll see what happens.
three 1-core tasks just got finished after about 23 hours (200 events ea). Unfortunately, again no HITS-file. 70 hours crunching time for nothing :-(

The tasks can be seen here:

https://lhcathome.cern.ch/lhcathome/result.php?resultid=188979805
https://lhcathome.cern.ch/lhcathome/result.php?resultid=188970907
https://lhcathome.cern.ch/lhcathome/result.php?resultid=189002333

is anyone of the experts able to detect from the stderr what the problem could be?
ID: 35181 · Report as offensive     Reply Quote
tullio

Send message
Joined: 19 Feb 08
Posts: 708
Credit: 4,336,250
RAC: 0
Message 35206 - Posted: 9 May 2018, 12:19:29 UTC

All one core tasks on my main Linux box end with HITS file. All two core tasks on the Windows 10 PC end and validate with no HITS file. The Linux box has 8 GB RAM, the Windows 10 22 GB.
Tullio
ID: 35206 · Report as offensive     Reply Quote
maeax

Send message
Joined: 2 May 07
Posts: 2071
Credit: 156,144,248
RAC: 105,364
Message 35207 - Posted: 9 May 2018, 13:58:58 UTC - in response to Message 35206.  

Tullio,
when you change back to 5.1.26->boinc.berkeley.edu, is this the same without HITS in Windows?
ID: 35207 · Report as offensive     Reply Quote
PHILIPPE

Send message
Joined: 24 Jul 16
Posts: 88
Credit: 239,917
RAC: 0
Message 35220 - Posted: 10 May 2018, 13:41:05 UTC - in response to Message 35207.  

In order to help people to understand what is happening , face to troubles encountered (hits files missing,...),i choose several graphs from the dashboard.
All the plots are made during a sliding period of one week.

Time evolution of success and failure jobs :



Pie graph in percentage :



Pie graph of the causes of failure jobs sorted by exitcodes :



More detailed exitcodes by number of cores :



More detailed transformation exitcodes by number of cores :



Failed jobs by number of cores :


Observations :
If someone has the meaning of the exitcodes , he should give the information to other people of the forum.
It may be sorted , distinguishing the server side failure and the client boinc failure , to enable volunteer to repair their host(s) if the troubles come from their host(s).
What is the difference between execution and transformation (for exitcodes)?

For tullio's host , his logs reveals an exitcode 65 :
2018-05-07 01:54:41 (7388): Guest Log: ATHENA_PROC_NUMBER=2
2018-05-07 01:54:41 (7388): Guest Log: Starting ATLAS job. (PandaID=3920314561 taskID=13910415)
2018-05-07 02:05:02 (7388): Guest Log: log_extracts:
2018-05-07 02:05:02 (7388): Guest Log: - Last 10 lines from /home/atlas01/RunAtlas/Panda_Pilot_3446_1525650883/PandaJob/athena_stdout.txt -
2018-05-07 02:05:02 (7388): Guest Log: PyJobTransforms.trfExe.preExecute 2018-05-07 01:57:19,134 INFO Batch/grid running - command outputs will not be echoed. Logs for EVNTtoHITS are in log.EVNTtoHITS
2018-05-07 02:05:02 (7388): Guest Log: PyJobTransforms.trfExe.preExecute 2018-05-07 01:57:19,136 INFO Now writing wrapper for substep executor EVNTtoHITS
2018-05-07 02:05:02 (7388): Guest Log: PyJobTransforms.trfExe._writeAthenaWrapper 2018-05-07 01:57:19,136 INFO Valgrind not engaged
2018-05-07 02:05:02 (7388): Guest Log: PyJobTransforms.trfExe.preExecute 2018-05-07 01:57:19,137 INFO Athena will be executed in a subshell via ['./runwrapper.EVNTtoHITS.sh']
2018-05-07 02:05:02 (7388): Guest Log: PyJobTransforms.trfExe.execute 2018-05-07 01:57:19,137 INFO Starting execution of EVNTtoHITS (['./runwrapper.EVNTtoHITS.sh'])
2018-05-07 02:05:02 (7388): Guest Log: PyJobTransforms.trfExe.execute 2018-05-07 02:02:29,204 INFO EVNTtoHITS executor returns 65
2018-05-07 02:05:02 (7388): Guest Log: PyJobTransforms.trfExe.validate 2018-05-07 02:02:30,113 ERROR Validation of return code failed: Non-zero return code from EVNTtoHITS (65) (Error code 65)
2018-05-07 02:05:02 (7388): Guest Log: PyJobTransforms.trfExe.validate 2018-05-07 02:02:30,149 INFO Scanning logfile log.EVNTtoHITS for errors
2018-05-07 02:05:02 (7388): Guest Log: PyJobTransforms.transform.execute 2018-05-07 02:02:30,438 CRITICAL Transform executor raised TransformValidationException: Non-zero return code from EVNTtoHITS (65); Logfile error in log.EVNTtoHITS: "AthMpEvtLoopMgr     FATAL makePool failed for AthMpEvtLoopMgr.SharedEvtQueueProvider"
2018-05-07 02:05:02 (7388): Guest Log: PyJobTransforms.transform.execute 2018-05-07 02:02:33,655 WARNING Transform now exiting early with exit code 65 (Non-zero return code from EVNTtoHITS (65); Logfile error in log.EVNTtoHITS: "AthMpEvtLoopMgr     FATAL makePool failed for AthMpEvtLoopMgr.SharedEvtQueueProvider")

For Erich's host , no exit code , so the problem may be different.
And sorry , but the graph showing the results divided by version of virtualbox doesn't exist.
(It would enable to see if the choice of the virtualbox version has an influence on the behavior of results, if someone has a doubt in it).
ID: 35220 · Report as offensive     Reply Quote
tullio

Send message
Joined: 19 Feb 08
Posts: 708
Credit: 4,336,250
RAC: 0
Message 35232 - Posted: 11 May 2018, 18:28:18 UTC - in response to Message 35207.  

Tullio,
when you change back to 5.1.26->boinc.berkeley.edu, is this the same without HITS in Windows?

I am using VirtualBox 5.2.10 in all PCs, both Linux and Windows. The only difference is that on the 2 Linux boxes I am using only one core out of two. My AMD A10-6700 on the Windows 10 PC is sold as 4 cores but the Windows Task Manager says it has 2 cores and 4 logical processors. I am using the two cores Atlas on it.
Tullio
ID: 35232 · Report as offensive     Reply Quote
David Cameron
Project administrator
Project developer
Project scientist

Send message
Joined: 13 May 14
Posts: 387
Credit: 15,314,184
RAC: 0
Message 35268 - Posted: 15 May 2018, 14:35:41 UTC - in response to Message 35232.  

A good answer to the original question is given by the pie chart posted by Philippe, which shows that roughly 8% of tasks fail. The majority of those tasks fail at the start so in terms of wasted CPU time it's a lot less than 8%.

The "makePool failed" error is normally because there is not enough memory in the VM. We saw this a lot before with 2-core tasks and so I wonder if we need to increase the memory limits for the current tasks. Unfortunately task logs are deleted from the server after one week so none of the previous links in this thread are working any more.
ID: 35268 · Report as offensive     Reply Quote
Erich56

Send message
Joined: 18 Dec 15
Posts: 1686
Credit: 100,396,059
RAC: 102,184
Message 35309 - Posted: 19 May 2018, 16:59:20 UTC - in response to Message 35167.  

I wrote before:
I have looked up my finished ATLAS tasks of the past 2 weeks and found out that less than a third of them had hits.
one of the recommandations was to change to a newer VB version.
This I did yesterday: I replaced 5.1.6 with 5.1.38.
However, the first four 2-core tasks I processed with the new VB version again did NOT yield a HITS-file :-(
The next four are still running.
But I guess I can say already now that updating the VB version did not help to solve the problem :-(
ID: 35309 · Report as offensive     Reply Quote
Crystal Pellet
Volunteer moderator
Volunteer tester

Send message
Joined: 14 Jan 10
Posts: 1268
Credit: 8,421,616
RAC: 2,139
Message 35312 - Posted: 19 May 2018, 20:22:53 UTC - in response to Message 35309.  

I wrote before:
I have looked up my finished ATLAS tasks of the past 2 weeks and found out that less than a third of them had hits.
one of the recommandations was to change to a newer VB version.
This I did yesterday: I replaced 5.1.6 with 5.1.38.
However, the first four 2-core tasks I processed with the new VB version again did NOT yield a HITS-file :-(
The next four are still running.
But I guess I can say already now that updating the VB version did not help to solve the problem :-(


Not sure why this 'HITS' and other files are not mentioned in your Stderr outputs, but it seems you are returning valids results seen the CPU-times and the results from Erich56@gpugrid:

https://bigpanda.cern.ch/jobs/?computingsite=BOINC_MCORE&modificationhost=Erich56@gpugrid&hours=12&jobstatus=finished&mode=nodrop&display_limit=100

You could still try a real upgrade of your VBox version to 5.2.12
ID: 35312 · Report as offensive     Reply Quote
Erich56

Send message
Joined: 18 Dec 15
Posts: 1686
Credit: 100,396,059
RAC: 102,184
Message 35314 - Posted: 19 May 2018, 20:35:50 UTC - in response to Message 35312.  

You could still try a real upgrade of your VBox version to 5.2.12
hm, I think to remember some mentioning here in the forum that 5.2 makes kind of problems (whatever exactly).
ID: 35314 · Report as offensive     Reply Quote
Erich56

Send message
Joined: 18 Dec 15
Posts: 1686
Credit: 100,396,059
RAC: 102,184
Message 35317 - Posted: 20 May 2018, 11:34:18 UTC - in response to Message 35312.  

I wrote before:
I have looked up my finished ATLAS tasks of the past 2 weeks and found out that less than a third of them had hits.
one of the recommandations was to change to a newer VB version.
This I did yesterday: I replaced 5.1.6 with 5.1.38.
However, the first four 2-core tasks I processed with the new VB version again did NOT yield a HITS-file :-(
The next four are still running.
But I guess I can say already now that updating the VB version did not help to solve the problem :-(

Crystal Pellet wrote:

Not sure why this 'HITS' and other files are not mentioned in your Stderr outputs, but it seems you are returning valids results seen the CPU-times and the results from Erich56@gpugrid:

https://bigpanda.cern.ch/jobs/?computingsite=BOINC_MCORE&modificationhost=Erich56@gpugrid&hours=12&jobstatus=finished&mode=nodrop&display_limit=100

You could still try a real upgrade of your VBox version to 5.2.12
now the remaining 2 tasks got finished, plus another 4 since yesterday.
So from the total of 8 tasks after the change from VB version 5.1.6 to 5.1.38, not even one produced a HITS-file :-(

And yes, all these tasks are seen as "valid" and I receive credit points for them. However, I am not sure at all whether they are of any value for the project.
Can anyone tell me for sure as to whether such tasks are good for the project or not. I would hate to spend my CPU capacity for simply nothing.
ID: 35317 · Report as offensive     Reply Quote
1 · 2 · Next

Message boards : ATLAS application : what's the average share of finished tasks with hits created?


©2024 CERN