Message boards :
ATLAS application :
what's the average share of finished tasks with hits created?
Message board moderation
Author | Message |
---|---|
Send message Joined: 18 Dec 15 Posts: 1788 Credit: 117,851,004 RAC: 86,317 |
I have looked up my finished ATLAS tasks of the past 2 weeks and found out that less than a third of them had hits. I may be wrong, but I think to remember that some time ago, when I also checked this out, the share was some higher. Has there ever been figured out what is the average share of finished tasks with hits? |
Send message Joined: 28 Sep 04 Posts: 722 Credit: 48,467,813 RAC: 27,570 |
I checked the two hosts that I have doing Atlas and the other one had 18 finished tasks all with the hits file. The other one has about 120 finished tasks and I checked about 60-70 tasks and found two without the hits file. |
Send message Joined: 2 May 07 Posts: 2230 Credit: 173,874,515 RAC: 16,668 |
David wrote a thread "Information on ATLAS tasks" in March 2017 and explained hits-file yes or no. https://lhcathome.cern.ch/lhcathome/forum_thread.php?id=4178 |
Send message Joined: 18 Dec 15 Posts: 1788 Credit: 117,851,004 RAC: 86,317 |
Thanks for the information. This really makes me wonder what could be going wrong with my ATLAS tasks processing. Particularly since David wrote "Therefore a truly successful WU must have a valid HITS file produced, however you can still get credit even if no HITS file is present because we don't want people to suffer from problems in ATLAS software or infrastructure". So if not even a third of my finished ATLAS tasks contain a HITS file, there may be a problem somewhere, right? |
Send message Joined: 15 Jun 08 Posts: 2521 Credit: 252,724,114 RAC: 137,544 |
Erich56, IIRC you explained somewhere in an old post why you can't upgrade that Win XP to a recent OS. Is this still valid? If not, you may consider to upgrade the OS as well as your VirtualBox (-> to 5.1.36, not yet 5.2.x as that may also cause problems). Your logs show that you spend 20 GB RAM per VM. Is this a typo? If not, what is it good for? 2 of those VMs running concurrently would request more RAM that is installed and thus force the rest of your system to make heavy use of your swap/pagefile. |
Send message Joined: 18 Dec 15 Posts: 1788 Credit: 117,851,004 RAC: 86,317 |
@computezrmle: good points :-) 1) yes, the OS still is WinXP, because I am running GPUGRID with two GTX980ti, and any OS beyond XP increases the GPU processing time by about 20%, due to the WDDM overhead in the newer OSs. In fact, GPUGRID had announced some time ago that XP support will end by April 2018, so I was expecting to need to upgrade the machine to Win10 anyway. However, so far they still support XP. 2) the 20 GB RAM is not a typo; I recently increased it in the app_config.xml because I had a few 3-core ATLAS tasks which, for whatever reason, ran short of the 9 GB RAM as set before. Obviously, these tasks were somehow faulty, because console 3 showed 6 athena.py (insted of 3). Hence, to prevent any other such strange task to run out of memory after many hours of processing time, I increased the RAM allocation to 20GB. Which I though should not be too much of a problem with 32GB RAM available. However, I have not received such a strange task for a while now, so I guess I could reduce the allocated RAM to "normal" values. What I plan to try is to from now on download and process 1-core ATLAS tasks. Maybe the problem lies with the core allocation, who knows. I will see what happens. If this does not help, I might upgrade the VB to 5.1.36 as suggested by you (any proof that this one runs well with XP?) |
Send message Joined: 15 Jun 08 Posts: 2521 Credit: 252,724,114 RAC: 137,544 |
... Obviously, these tasks were somehow faulty, because console 3 showed 6 athena.py (insted of 3). ... Maybe the problem lies with the core allocation ... This is a configuration error that occurred occasionally in the past. Nothing you can solve locally but you may post a comment in the MB to make the CERN team aware. ... I might upgrade the VB to 5.1.36 as suggested by you (any proof that this one runs well with XP?) No guarantee as it is "brand new". I'm trying it on one of my host to see if some nasty things disappear that occur with 5.2.x. |
Send message Joined: 18 Dec 15 Posts: 1788 Credit: 117,851,004 RAC: 86,317 |
since the most recently finished 3-core task https://lhcathome.cern.ch/lhcathome/result.php?resultid=188892823 also did not produce a HITS file, I now changed the settings for 1-core tasks. We'll see what happens. Meanwhile, I also checked quite a number of ATLAS tasks of several other crunchers and found out that about 95% of them had HITS-files (the way it should be). Which makes clear to me that something is going wrong here. Theoretically, I could lean back and say that as long as I get credits also for all these tasks that do not contain HITS-files, it doesn't matter to me. However, I participate in this project in order to contribute with valuable results for a given project. To get credits is nice, of course, but it's not the main reason for participating. |
Send message Joined: 9 Dec 14 Posts: 202 Credit: 2,533,875 RAC: 0 |
1) yes, the OS still is WinXP, because I am running GPUGRID with two GTX980ti, and any OS beyond XP increases the GPU processing time by about 20%, due to the WDDM overhead in the newer OSs.Is MS still providing security updates for XP? Is this WDDM thing also true for Linux? Otherwise you might consider running Linux with the native ATLAS app which is working like a charm (as long as you dont stop and restart the running tasks (you will still produce valid results but if you restart them they start all over again i think)) and needs much less RAM. For GPU performance you can install the proprietary nvidia driver for linux. |
Send message Joined: 18 Dec 15 Posts: 1788 Credit: 117,851,004 RAC: 86,317 |
... Is this WDDM thing also true for Linux? Otherwise you might consider running Linux with the native ATLAS appthe WDDM "brake" is NOT true for Linux; however, the setback with Linux, what concerns GPUGRID crunching, is that no SWANsync can be set. Thus, the full CPU power cannot be applied :-( |
Send message Joined: 18 Dec 15 Posts: 1788 Credit: 117,851,004 RAC: 86,317 |
before, I wrote: ... I now changed the settings for 1-core tasks. We'll see what happens.three 1-core tasks just got finished after about 23 hours (200 events ea). Unfortunately, again no HITS-file. 70 hours crunching time for nothing :-( The tasks can be seen here: https://lhcathome.cern.ch/lhcathome/result.php?resultid=188979805 https://lhcathome.cern.ch/lhcathome/result.php?resultid=188970907 https://lhcathome.cern.ch/lhcathome/result.php?resultid=189002333 is anyone of the experts able to detect from the stderr what the problem could be? |
Send message Joined: 19 Feb 08 Posts: 708 Credit: 4,336,250 RAC: 0 |
All one core tasks on my main Linux box end with HITS file. All two core tasks on the Windows 10 PC end and validate with no HITS file. The Linux box has 8 GB RAM, the Windows 10 22 GB. Tullio |
Send message Joined: 2 May 07 Posts: 2230 Credit: 173,874,515 RAC: 16,668 |
Tullio, when you change back to 5.1.26->boinc.berkeley.edu, is this the same without HITS in Windows? |
Send message Joined: 24 Jul 16 Posts: 88 Credit: 239,917 RAC: 0 |
In order to help people to understand what is happening , face to troubles encountered (hits files missing,...),i choose several graphs from the dashboard. All the plots are made during a sliding period of one week. Time evolution of success and failure jobs : Pie graph in percentage : Pie graph of the causes of failure jobs sorted by exitcodes : More detailed exitcodes by number of cores : More detailed transformation exitcodes by number of cores : Failed jobs by number of cores : Observations : If someone has the meaning of the exitcodes , he should give the information to other people of the forum. It may be sorted , distinguishing the server side failure and the client boinc failure , to enable volunteer to repair their host(s) if the troubles come from their host(s). What is the difference between execution and transformation (for exitcodes)? For tullio's host , his logs reveals an exitcode 65 : 2018-05-07 01:54:41 (7388): Guest Log: ATHENA_PROC_NUMBER=2 2018-05-07 01:54:41 (7388): Guest Log: Starting ATLAS job. (PandaID=3920314561 taskID=13910415) 2018-05-07 02:05:02 (7388): Guest Log: log_extracts: 2018-05-07 02:05:02 (7388): Guest Log: - Last 10 lines from /home/atlas01/RunAtlas/Panda_Pilot_3446_1525650883/PandaJob/athena_stdout.txt - 2018-05-07 02:05:02 (7388): Guest Log: PyJobTransforms.trfExe.preExecute 2018-05-07 01:57:19,134 INFO Batch/grid running - command outputs will not be echoed. Logs for EVNTtoHITS are in log.EVNTtoHITS 2018-05-07 02:05:02 (7388): Guest Log: PyJobTransforms.trfExe.preExecute 2018-05-07 01:57:19,136 INFO Now writing wrapper for substep executor EVNTtoHITS 2018-05-07 02:05:02 (7388): Guest Log: PyJobTransforms.trfExe._writeAthenaWrapper 2018-05-07 01:57:19,136 INFO Valgrind not engaged 2018-05-07 02:05:02 (7388): Guest Log: PyJobTransforms.trfExe.preExecute 2018-05-07 01:57:19,137 INFO Athena will be executed in a subshell via ['./runwrapper.EVNTtoHITS.sh'] 2018-05-07 02:05:02 (7388): Guest Log: PyJobTransforms.trfExe.execute 2018-05-07 01:57:19,137 INFO Starting execution of EVNTtoHITS (['./runwrapper.EVNTtoHITS.sh']) 2018-05-07 02:05:02 (7388): Guest Log: PyJobTransforms.trfExe.execute 2018-05-07 02:02:29,204 INFO EVNTtoHITS executor returns 65 2018-05-07 02:05:02 (7388): Guest Log: PyJobTransforms.trfExe.validate 2018-05-07 02:02:30,113 ERROR Validation of return code failed: Non-zero return code from EVNTtoHITS (65) (Error code 65) 2018-05-07 02:05:02 (7388): Guest Log: PyJobTransforms.trfExe.validate 2018-05-07 02:02:30,149 INFO Scanning logfile log.EVNTtoHITS for errors 2018-05-07 02:05:02 (7388): Guest Log: PyJobTransforms.transform.execute 2018-05-07 02:02:30,438 CRITICAL Transform executor raised TransformValidationException: Non-zero return code from EVNTtoHITS (65); Logfile error in log.EVNTtoHITS: "AthMpEvtLoopMgr FATAL makePool failed for AthMpEvtLoopMgr.SharedEvtQueueProvider" 2018-05-07 02:05:02 (7388): Guest Log: PyJobTransforms.transform.execute 2018-05-07 02:02:33,655 WARNING Transform now exiting early with exit code 65 (Non-zero return code from EVNTtoHITS (65); Logfile error in log.EVNTtoHITS: "AthMpEvtLoopMgr FATAL makePool failed for AthMpEvtLoopMgr.SharedEvtQueueProvider") For Erich's host , no exit code , so the problem may be different. And sorry , but the graph showing the results divided by version of virtualbox doesn't exist. (It would enable to see if the choice of the virtualbox version has an influence on the behavior of results, if someone has a doubt in it). |
Send message Joined: 19 Feb 08 Posts: 708 Credit: 4,336,250 RAC: 0 |
Tullio, I am using VirtualBox 5.2.10 in all PCs, both Linux and Windows. The only difference is that on the 2 Linux boxes I am using only one core out of two. My AMD A10-6700 on the Windows 10 PC is sold as 4 cores but the Windows Task Manager says it has 2 cores and 4 logical processors. I am using the two cores Atlas on it. Tullio |
Send message Joined: 13 May 14 Posts: 387 Credit: 15,314,184 RAC: 0 |
A good answer to the original question is given by the pie chart posted by Philippe, which shows that roughly 8% of tasks fail. The majority of those tasks fail at the start so in terms of wasted CPU time it's a lot less than 8%. The "makePool failed" error is normally because there is not enough memory in the VM. We saw this a lot before with 2-core tasks and so I wonder if we need to increase the memory limits for the current tasks. Unfortunately task logs are deleted from the server after one week so none of the previous links in this thread are working any more. |
Send message Joined: 18 Dec 15 Posts: 1788 Credit: 117,851,004 RAC: 86,317 |
I wrote before: I have looked up my finished ATLAS tasks of the past 2 weeks and found out that less than a third of them had hits.one of the recommandations was to change to a newer VB version. This I did yesterday: I replaced 5.1.6 with 5.1.38. However, the first four 2-core tasks I processed with the new VB version again did NOT yield a HITS-file :-( The next four are still running. But I guess I can say already now that updating the VB version did not help to solve the problem :-( |
Send message Joined: 14 Jan 10 Posts: 1413 Credit: 9,435,981 RAC: 6,967 |
I wrote before:I have looked up my finished ATLAS tasks of the past 2 weeks and found out that less than a third of them had hits.one of the recommandations was to change to a newer VB version. Not sure why this 'HITS' and other files are not mentioned in your Stderr outputs, but it seems you are returning valids results seen the CPU-times and the results from Erich56@gpugrid: https://bigpanda.cern.ch/jobs/?computingsite=BOINC_MCORE&modificationhost=Erich56@gpugrid&hours=12&jobstatus=finished&mode=nodrop&display_limit=100 You could still try a real upgrade of your VBox version to 5.2.12 |
Send message Joined: 18 Dec 15 Posts: 1788 Credit: 117,851,004 RAC: 86,317 |
You could still try a real upgrade of your VBox version to 5.2.12hm, I think to remember some mentioning here in the forum that 5.2 makes kind of problems (whatever exactly). |
Send message Joined: 18 Dec 15 Posts: 1788 Credit: 117,851,004 RAC: 86,317 |
now the remaining 2 tasks got finished, plus another 4 since yesterday.I wrote before:I have looked up my finished ATLAS tasks of the past 2 weeks and found out that less than a third of them had hits.one of the recommandations was to change to a newer VB version. So from the total of 8 tasks after the change from VB version 5.1.6 to 5.1.38, not even one produced a HITS-file :-( And yes, all these tasks are seen as "valid" and I receive credit points for them. However, I am not sure at all whether they are of any value for the project. Can anyone tell me for sure as to whether such tasks are good for the project or not. I would hate to spend my CPU capacity for simply nothing. |
©2024 CERN