Message boards :
ATLAS application :
Very long tasks in the queue
Message board moderation
Previous · 1 · 2 · 3 · 4 · 5 . . . 6 · Next
Author | Message |
---|---|
Send message Joined: 19 Feb 08 Posts: 708 Credit: 4,336,250 RAC: 0 |
Quis validate validators? Tullio |
Send message Joined: 9 Dec 14 Posts: 202 Credit: 2,533,875 RAC: 0 |
Ok, but they validate on the Linux box even if the elapsed time is greater that the CPU time. They don't validate on the Windows 10 PC, which has much more RAM. That is an interesting point. I have checked a couple of my valid results which were running on an Linux hosts, and all i have checked had that particular file. I dont know what that file stands for, but it is interesting that with the same host OS (Linux) different amount of files are being produced and both are valid tasks. Maybe different types of tasks (if there are any at lhc at the moment). |
Send message Joined: 13 May 14 Posts: 387 Credit: 15,314,184 RAC: 0 |
The validator gives credit if the CPU or walltime is above a certain amount, even if the task failed. This is so that if someone spends a long time running a task and it fails they are not penalised. The task is failed and retried higher up the chain. By the way I finished my first longrunner: https://lhcathome.cern.ch/lhcathome/result.php?resultid=126170632 11 hours runtime on 4-cores. The credit was around 10 times what I normally get. |
Send message Joined: 14 Jan 10 Posts: 1289 Credit: 8,528,821 RAC: 2,748 |
By the way I finished my first longrunner: That's rather fast, but you have set hyper-threading off. The upload seems to be more than 500MB => 569MB The credit granting mechanism is generous to you. I got for a 'normal' 4-core ATLAS task 70 cobblestones. |
Send message Joined: 15 Jun 08 Posts: 2433 Credit: 227,975,941 RAC: 125,826 |
By the way I finished my first longrunner: Really successful (for science)? There is no HITS*.root result file in the log. Edit: A quick crosscheck with my last results show that only about 70 % of the WUs have a HITS* file. Is the presence of such a file still a criterion for a successful scientific result as mentioned here? |
Send message Joined: 2 Sep 04 Posts: 453 Credit: 193,576,736 RAC: 3,063 |
And now I could finish my first longrunner: https://lhcathome.cern.ch/lhcathome/result.php?resultid=126170630 It is exact 10x normal runtime, but it is not exact 10 times the credit (3.359 versus 420) Supporting BOINC, a great concept ! |
Send message Joined: 2 Sep 04 Posts: 453 Credit: 193,576,736 RAC: 3,063 |
This one didn't survive: https://lhcathome.cern.ch/lhcathome/result.php?resultid=126170737 Supporting BOINC, a great concept ! |
Send message Joined: 17 Sep 04 Posts: 99 Credit: 30,836,799 RAC: 6,637 |
|
Send message Joined: 9 Dec 14 Posts: 202 Credit: 2,533,875 RAC: 0 |
success here to: https://lhcathome.cern.ch/lhcathome/result.php?resultid=126170634 a little bit less than 10 times of the "normal" tasks. |
Send message Joined: 2 Sep 04 Posts: 453 Credit: 193,576,736 RAC: 3,063 |
|
Send message Joined: 24 Oct 04 Posts: 1130 Credit: 49,813,154 RAC: 7,113 |
over night 8 Longrunners have been finished and succesfull validated Those 4,000+ credit tasks do look nice Yeti https://lhcathome.cern.ch/lhcathome/results.php?hostid=10359162 Volunteer Mad Scientist For Life |
Send message Joined: 13 May 14 Posts: 387 Credit: 15,314,184 RAC: 0 |
Let me explain a little what actually happens in ATLAS tasks. The large input file which is called EVNT.* is a collection of "events". Each event represents a simulated collision of protons inside the ATLAS detector and in this file are descriptions of particles which are produced by these collisions. The chance of a certain particle (eg Higgs boson) being produced in a collision has a certain probability so these events are randomly generated according to these probabilities. What the ATLAS WU do is simulate how those particles in each event interact with the detector, which consists of many extremely complex components. The description of the detector is partly in ATLAS simulation software but partly in database services (the services which were not working last weekend and caused WU to fail). The output of the simulation is in the HITS file, which is a description of where each particle "hits" (i.e interacts with) the detector. Therefore a truly successful WU must have a valid HITS file produced, but as mentioned above you can still get credit even if no HITS file is present because we don't want people to suffer from problems in ATLAS software or infrastructure. |
Send message Joined: 19 Feb 08 Posts: 708 Credit: 4,336,250 RAC: 0 |
Thanks David. Since Atlas and SixTrack are the only programs running on my PCs, Linux or Windows 10, since the LHC consolidation, I am glad to spend time and electricity on them. Tullio |
Send message Joined: 9 Dec 14 Posts: 202 Credit: 2,533,875 RAC: 0 |
Let me explain a little what actually happens in ATLAS tasks. The large input file which is called EVNT.* is a collection of "events". Each event represents a ... ...is in the HITS file, which is a description of where each particle "hits" (i.e interacts with) the detector. Very interesting information, thank you! Therefore a truly successful WU must have a valid HITS file produced, but as mentioned above you can still get credit even if no HITS file is present because we don't want people to suffer from problems in ATLAS software or infrastructure. Am i understanding this right?: You get credit and the WU seems succsessfull although in reality it is not (no valid hits file). Isnt there a way to tell if the WU is actually good or bad while it is running? Because at least i am here not to get credit in the first place, but to support, in my opinion, an important and facinating project (whole CERN) with my CPU power and produce real useable results. And if there is no "valid hits" file, all the CPU time was wasted right? Do you know a percentage of "successfull" WUs that actually are not useable (no correct hits file) relevant to really successfull ones (with good hits file)? |
Send message Joined: 14 Jan 10 Posts: 1289 Credit: 8,528,821 RAC: 2,748 |
Therefore a truly successful WU must have a valid HITS file produced, but as mentioned above you can still get credit even if no HITS file is present because we don't want people to suffer from problems in ATLAS software or infrastructure. Hi David, Sometimes one sees in the stderr.txt the HITS-file in the list of files and sometimes not in the list, but surely targeted to srm://srm.ndgf.org. For BOINC they are both valid, but are they both also valid for the project? From your tasks: Not in list: https://lhcathome.cern.ch/lhcathome/result.php?resultid=126170632 File in list: https://lhcathome.cern.ch/lhcathome/result.php?resultid=126170942 |
Send message Joined: 2 Sep 04 Posts: 453 Credit: 193,576,736 RAC: 3,063 |
Therefore a truly successful WU must have a valid HITS file produced, but as mentioned above you can still get credit even if no HITS file is present because we don't want people to suffer from problems in ATLAS software or infrastructure. Do I understand this right, that this can happen regardless of the WU being crunched in your own Data-Center or here on volunteer-machines ? If this may happen at all places, then it is not wasted CPU-Time Do you know a percentage of "successfull" WUs that actually are not useable (no correct hits file) relevant to really successfull ones (with good hits file)? That would really be interesting Supporting BOINC, a great concept ! |
Send message Joined: 15 Jun 08 Posts: 2433 Credit: 227,975,941 RAC: 125,826 |
While testing the native linux app a few days ago I made the following experience. Running it on 1 core the resulting HITS* was 20 kB. Running it on 2 cores the resulting HITS* was 50 MB. David mentioned that the 50 MB file was the correct one. As far as I understood from one of David´s postings in the DEV board - always the same input data was used during the test - the input data was taken from the life system How reliable are the results from a science perspective if the output differs depending on the number of used cores? Can this happen inside the VM app also? |
Send message Joined: 6 Sep 08 Posts: 116 Credit: 11,131,774 RAC: 2,877 |
While testing the native linux app a few days ago I made the following experience. This is from this single core VM job:- 2017-03-16 05:18:26 (4821): Guest Log: -rw------- 1 root root 52108715 Mar 16 05:15 HITS.10165253._194503.pool.root.1 |
Send message Joined: 13 May 14 Posts: 387 Credit: 15,314,184 RAC: 0 |
While testing the native linux app a few days ago I made the following experience. The 20kB file would not be a valid file and would be rejected further up the chain. How reliable are the results from a science perspective if the output differs depending on the number of used cores? The results are independent of number of cores. On the grid we can run the same kinds of tasks with different numbers of cores depending on the computing centre and the results in terms of physics are the same. This is because the events are processed independently on each core and then at the end of the task they are merged together into the HITS result file. I can get some statistics next week on the success rate, but I do not think it is much worse than the tasks running on other ATLAS resources. Most of the failed or invalid WU run for a very short period of time so so not waste resources. However there are always bugs in software or infrastructure problems which can affect the task and so we cannot guarantee that every WU will produce something useful for science. I'm sure this is true for every volunteer computing project or indeed scientific computation in general |
Send message Joined: 19 Feb 08 Posts: 708 Credit: 4,336,250 RAC: 0 |
All my Atlas tasks validate on the Linux box even if not having the HITS file. All my Atlas tasks are invalidated on the Windows 10 PC despite it having a more modern AMD CPU and three times its RAM. Tullio |
©2024 CERN