Very long tasks in the queue

Author	Message
tullio Send message Joined: 19 Feb 08 Posts: 708 Credit: 4,336,250 RAC: 0	Message 29334 - Posted: 16 Mar 2017, 18:10:17 UTC - in response to Message 29333. Quis validate validators? Tullio ID: 29334 · Reply Quote

gyllic Send message Joined: 9 Dec 14 Posts: 202 Credit: 2,561,307 RAC: 1,486	Message 29336 - Posted: 16 Mar 2017, 19:49:12 UTC - in response to Message 29333. Last modified: 16 Mar 2017, 19:50:28 UTC Ok, but they validate on the Linux box even if the elapsed time is greater that the CPU time. They don't validate on the Windows 10 PC, which has much more RAM. Tullio I think there is something wrong with the validator for the Linux tasks. No one of your valid tasks on your Linux box displays the HITS*.root result file of about 60MB for upload. IMO those tasks can't be valid. That is an interesting point. I have checked a couple of my valid results which were running on an Linux hosts, and all i have checked had that particular file. I dont know what that file stands for, but it is interesting that with the same host OS (Linux) different amount of files are being produced and both are valid tasks. Maybe different types of tasks (if there are any at lhc at the moment). ID: 29336 · Reply Quote

David Cameron Project administrator Project developer Project scientist Send message Joined: 13 May 14 Posts: 387 Credit: 15,314,184 RAC: 0	Message 29337 - Posted: 16 Mar 2017, 19:51:56 UTC - in response to Message 29334. The validator gives credit if the CPU or walltime is above a certain amount, even if the task failed. This is so that if someone spends a long time running a task and it fails they are not penalised. The task is failed and retried higher up the chain. By the way I finished my first longrunner: https://lhcathome.cern.ch/lhcathome/result.php?resultid=126170632 11 hours runtime on 4-cores. The credit was around 10 times what I normally get. ID: 29337 · Reply Quote

Crystal Pellet Volunteer moderator Volunteer tester Send message Joined: 14 Jan 10 Posts: 1471 Credit: 9,931,173 RAC: 1,389	Message 29339 - Posted: 16 Mar 2017, 21:01:41 UTC - in response to Message 29337. By the way I finished my first longrunner: https://lhcathome.cern.ch/lhcathome/result.php?resultid=126170632 11 hours runtime on 4-cores. The credit was around 10 times what I normally get. That's rather fast, but you have set hyper-threading off. The upload seems to be more than 500MB => 569MB The credit granting mechanism is generous to you. I got for a 'normal' 4-core ATLAS task 70 cobblestones. ID: 29339 · Reply Quote

computezrmle Volunteer moderator Volunteer developer Volunteer tester Help desk expert Send message Joined: 15 Jun 08 Posts: 2712 Credit: 292,923,925 RAC: 149,007	Message 29340 - Posted: 16 Mar 2017, 21:02:19 UTC - in response to Message 29337. Last modified: 16 Mar 2017, 21:21:22 UTC By the way I finished my first longrunner: https://lhcathome.cern.ch/lhcathome/result.php?resultid=126170632 11 hours runtime on 4-cores. The credit was around 10 times what I normally get. Really successful (for science)? There is no HITS.root result file in the log. Edit: A quick crosscheck with my last results show that only about 70 % of the WUs have a HITS file. Is the presence of such a file still a criterion for a successful scientific result as mentioned here? ID: 29340 · Reply Quote

Yeti Volunteer moderator Send message Joined: 2 Sep 04 Posts: 468 Credit: 215,197,406 RAC: 981	Message 29341 - Posted: 16 Mar 2017, 21:43:37 UTC Last modified: 16 Mar 2017, 21:43:50 UTC And now I could finish my first longrunner: https://lhcathome.cern.ch/lhcathome/result.php?resultid=126170630 It is exact 10x normal runtime, but it is not exact 10 times the credit (3.359 versus 420) Supporting BOINC, a great concept ! ID: 29341 · Reply Quote

Yeti Volunteer moderator Send message Joined: 2 Sep 04 Posts: 468 Credit: 215,197,406 RAC: 981	Message 29342 - Posted: 16 Mar 2017, 21:52:11 UTC This one didn't survive: https://lhcathome.cern.ch/lhcathome/result.php?resultid=126170737 Supporting BOINC, a great concept ! ID: 29342 · Reply Quote

rbpeake Send message Joined: 17 Sep 04 Posts: 106 Credit: 36,549,147 RAC: 12,041	Message 29343 - Posted: 16 Mar 2017, 22:25:22 UTC Success here! https://lhcathome.cern.ch/lhcathome/result.php?resultid=126170627 Regards, Bob P. ID: 29343 · Reply Quote

gyllic Send message Joined: 9 Dec 14 Posts: 202 Credit: 2,561,307 RAC: 1,486	Message 29344 - Posted: 17 Mar 2017, 1:47:54 UTC success here to: https://lhcathome.cern.ch/lhcathome/result.php?resultid=126170634 a little bit less than 10 times of the "normal" tasks. ID: 29344 · Reply Quote

Yeti Volunteer moderator Send message Joined: 2 Sep 04 Posts: 468 Credit: 215,197,406 RAC: 981	Message 29346 - Posted: 17 Mar 2017, 6:08:44 UTC over night 8 Longrunners have been finished and succesfull validated Supporting BOINC, a great concept ! ID: 29346 · Reply Quote

Magic Quantum Mechanic Send message Joined: 24 Oct 04 Posts: 1244 Credit: 86,009,829 RAC: 148,104	Message 29348 - Posted: 17 Mar 2017, 7:55:23 UTC - in response to Message 29346. over night 8 Longrunners have been finished and succesfull validated Those 4,000+ credit tasks do look nice Yeti https://lhcathome.cern.ch/lhcathome/results.php?hostid=10359162 Volunteer Mad Scientist For Life ID: 29348 · Reply Quote

David Cameron Project administrator Project developer Project scientist Send message Joined: 13 May 14 Posts: 387 Credit: 15,314,184 RAC: 0	Message 29349 - Posted: 17 Mar 2017, 8:46:07 UTC Let me explain a little what actually happens in ATLAS tasks. The large input file which is called EVNT.* is a collection of "events". Each event represents a simulated collision of protons inside the ATLAS detector and in this file are descriptions of particles which are produced by these collisions. The chance of a certain particle (eg Higgs boson) being produced in a collision has a certain probability so these events are randomly generated according to these probabilities. What the ATLAS WU do is simulate how those particles in each event interact with the detector, which consists of many extremely complex components. The description of the detector is partly in ATLAS simulation software but partly in database services (the services which were not working last weekend and caused WU to fail). The output of the simulation is in the HITS file, which is a description of where each particle "hits" (i.e interacts with) the detector. Therefore a truly successful WU must have a valid HITS file produced, but as mentioned above you can still get credit even if no HITS file is present because we don't want people to suffer from problems in ATLAS software or infrastructure. ID: 29349 · Reply Quote

tullio Send message Joined: 19 Feb 08 Posts: 708 Credit: 4,336,250 RAC: 0	Message 29350 - Posted: 17 Mar 2017, 9:58:10 UTC Thanks David. Since Atlas and SixTrack are the only programs running on my PCs, Linux or Windows 10, since the LHC consolidation, I am glad to spend time and electricity on them. Tullio ID: 29350 · Reply Quote

gyllic Send message Joined: 9 Dec 14 Posts: 202 Credit: 2,561,307 RAC: 1,486	Message 29351 - Posted: 17 Mar 2017, 10:53:15 UTC - in response to Message 29349. Last modified: 17 Mar 2017, 10:55:00 UTC Let me explain a little what actually happens in ATLAS tasks. The large input file which is called EVNT.* is a collection of "events". Each event represents a ... ...is in the HITS file, which is a description of where each particle "hits" (i.e interacts with) the detector. Very interesting information, thank you! Therefore a truly successful WU must have a valid HITS file produced, but as mentioned above you can still get credit even if no HITS file is present because we don't want people to suffer from problems in ATLAS software or infrastructure. Am i understanding this right?: You get credit and the WU seems succsessfull although in reality it is not (no valid hits file). Isnt there a way to tell if the WU is actually good or bad while it is running? Because at least i am here not to get credit in the first place, but to support, in my opinion, an important and facinating project (whole CERN) with my CPU power and produce real useable results. And if there is no "valid hits" file, all the CPU time was wasted right? Do you know a percentage of "successfull" WUs that actually are not useable (no correct hits file) relevant to really successfull ones (with good hits file)? ID: 29351 · Reply Quote

Crystal Pellet Volunteer moderator Volunteer tester Send message Joined: 14 Jan 10 Posts: 1471 Credit: 9,931,173 RAC: 1,389	Message 29352 - Posted: 17 Mar 2017, 11:18:55 UTC - in response to Message 29349. Therefore a truly successful WU must have a valid HITS file produced, but as mentioned above you can still get credit even if no HITS file is present because we don't want people to suffer from problems in ATLAS software or infrastructure. Hi David, Sometimes one sees in the stderr.txt the HITS-file in the list of files and sometimes not in the list, but surely targeted to srm://srm.ndgf.org. For BOINC they are both valid, but are they both also valid for the project? From your tasks: Not in list: https://lhcathome.cern.ch/lhcathome/result.php?resultid=126170632 File in list: https://lhcathome.cern.ch/lhcathome/result.php?resultid=126170942 ID: 29352 · Reply Quote

Yeti Volunteer moderator Send message Joined: 2 Sep 04 Posts: 468 Credit: 215,197,406 RAC: 981	Message 29353 - Posted: 17 Mar 2017, 11:39:11 UTC - in response to Message 29351. Therefore a truly successful WU must have a valid HITS file produced, but as mentioned above you can still get credit even if no HITS file is present because we don't want people to suffer from problems in ATLAS software or infrastructure. Am i understanding this right?: You get credit and the WU seems succsessfull although in reality it is not (no valid hits file). Isnt there a way to tell if the WU is actually good or bad while it is running? Because at least i am here not to get credit in the first place, but to support, in my opinion, an important and facinating project (whole CERN) with my CPU power and produce real useable results. And if there is no "valid hits" file, all the CPU time was wasted right? Do I understand this right, that this can happen regardless of the WU being crunched in your own Data-Center or here on volunteer-machines ? If this may happen at all places, then it is not wasted CPU-Time Do you know a percentage of "successfull" WUs that actually are not useable (no correct hits file) relevant to really successfull ones (with good hits file)? That would really be interesting Supporting BOINC, a great concept ! ID: 29353 · Reply Quote

computezrmle Volunteer moderator Volunteer developer Volunteer tester Help desk expert Send message Joined: 15 Jun 08 Posts: 2712 Credit: 292,923,925 RAC: 149,007	Message 29354 - Posted: 17 Mar 2017, 13:01:43 UTC While testing the native linux app a few days ago I made the following experience. Running it on 1 core the resulting HITS* was 20 kB. Running it on 2 cores the resulting HITS* was 50 MB. David mentioned that the 50 MB file was the correct one. As far as I understood from one of DavidÂ´s postings in the DEV board - always the same input data was used during the test - the input data was taken from the life system How reliable are the results from a science perspective if the output differs depending on the number of used cores? Can this happen inside the VM app also? ID: 29354 · Reply Quote

m Send message Joined: 6 Sep 08 Posts: 119 Credit: 14,648,414 RAC: 11,123	Message 29356 - Posted: 17 Mar 2017, 13:24:10 UTC - in response to Message 29354. Last modified: 17 Mar 2017, 13:28:29 UTC While testing the native linux app a few days ago I made the following experience. Running it on 1 core the resulting HITS* was 20 kB. Running it on 2 cores the resulting HITS* was 50 MB. David mentioned that the 50 MB file was the correct one. As far as I understood from one of DavidÂ´s postings in the DEV board - always the same input data was used during the test - the input data was taken from the life system How reliable are the results from a science perspective if the output differs depending on the number of used cores? Can this happen inside the VM app also? This is from this single core VM job:- 2017-03-16 05:18:26 (4821): Guest Log: -rw------- 1 root root 52108715 Mar 16 05:15 HITS.10165253._194503.pool.root.1 ID: 29356 · Reply Quote

David Cameron Project administrator Project developer Project scientist Send message Joined: 13 May 14 Posts: 387 Credit: 15,314,184 RAC: 0	Message 29357 - Posted: 17 Mar 2017, 13:39:48 UTC - in response to Message 29354. While testing the native linux app a few days ago I made the following experience. Running it on 1 core the resulting HITS* was 20 kB. Running it on 2 cores the resulting HITS* was 50 MB. The 20kB file would not be a valid file and would be rejected further up the chain. How reliable are the results from a science perspective if the output differs depending on the number of used cores? Can this happen inside the VM app also? The results are independent of number of cores. On the grid we can run the same kinds of tasks with different numbers of cores depending on the computing centre and the results in terms of physics are the same. This is because the events are processed independently on each core and then at the end of the task they are merged together into the HITS result file. I can get some statistics next week on the success rate, but I do not think it is much worse than the tasks running on other ATLAS resources. Most of the failed or invalid WU run for a very short period of time so so not waste resources. However there are always bugs in software or infrastructure problems which can affect the task and so we cannot guarantee that every WU will produce something useful for science. I'm sure this is true for every volunteer computing project or indeed scientific computation in general ID: 29357 · Reply Quote

tullio Send message Joined: 19 Feb 08 Posts: 708 Credit: 4,336,250 RAC: 0	Message 29358 - Posted: 17 Mar 2017, 15:00:12 UTC All my Atlas tasks validate on the Linux box even if not having the HITS file. All my Atlas tasks are invalidated on the Windows 10 PC despite it having a more modern AMD CPU and three times its RAM. Tullio ID: 29358 · Reply Quote

LHC@home