Message boards : ATLAS application : Very long tasks in the queue
Message board moderation

To post messages, you must log in.

Previous · 1 · 2 · 3 · 4 · 5 . . . 6 · Next

AuthorMessage
tullio

Send message
Joined: 19 Feb 08
Posts: 708
Credit: 4,336,250
RAC: 0
Message 29334 - Posted: 16 Mar 2017, 18:10:17 UTC - in response to Message 29333.  

Quis validate validators?
Tullio
ID: 29334 · Report as offensive     Reply Quote
gyllic

Send message
Joined: 9 Dec 14
Posts: 202
Credit: 2,533,875
RAC: 0
Message 29336 - Posted: 16 Mar 2017, 19:49:12 UTC - in response to Message 29333.  
Last modified: 16 Mar 2017, 19:50:28 UTC

Ok, but they validate on the Linux box even if the elapsed time is greater that the CPU time. They don't validate on the Windows 10 PC, which has much more RAM.
Tullio

I think there is something wrong with the validator for the Linux tasks.
No one of your valid tasks on your Linux box displays the HITS*.root result file of about 60MB for upload.
IMO those tasks can't be valid.


That is an interesting point. I have checked a couple of my valid results which were running on an Linux hosts, and all i have checked had that particular file. I dont know what that file stands for, but it is interesting that with the same host OS (Linux) different amount of files are being produced and both are valid tasks. Maybe different types of tasks (if there are any at lhc at the moment).
ID: 29336 · Report as offensive     Reply Quote
David Cameron
Project administrator
Project developer
Project scientist

Send message
Joined: 13 May 14
Posts: 387
Credit: 15,314,184
RAC: 0
Message 29337 - Posted: 16 Mar 2017, 19:51:56 UTC - in response to Message 29334.  

The validator gives credit if the CPU or walltime is above a certain amount, even if the task failed. This is so that if someone spends a long time running a task and it fails they are not penalised. The task is failed and retried higher up the chain.

By the way I finished my first longrunner:

https://lhcathome.cern.ch/lhcathome/result.php?resultid=126170632

11 hours runtime on 4-cores. The credit was around 10 times what I normally get.
ID: 29337 · Report as offensive     Reply Quote
Crystal Pellet
Volunteer moderator
Volunteer tester

Send message
Joined: 14 Jan 10
Posts: 1289
Credit: 8,528,821
RAC: 2,748
Message 29339 - Posted: 16 Mar 2017, 21:01:41 UTC - in response to Message 29337.  

By the way I finished my first longrunner:

https://lhcathome.cern.ch/lhcathome/result.php?resultid=126170632

11 hours runtime on 4-cores. The credit was around 10 times what I normally get.

That's rather fast, but you have set hyper-threading off.

The upload seems to be more than 500MB => 569MB

The credit granting mechanism is generous to you. I got for a 'normal' 4-core ATLAS task 70 cobblestones.
ID: 29339 · Report as offensive     Reply Quote
computezrmle
Volunteer moderator
Volunteer developer
Volunteer tester
Help desk expert
Avatar

Send message
Joined: 15 Jun 08
Posts: 2433
Credit: 227,975,941
RAC: 125,826
Message 29340 - Posted: 16 Mar 2017, 21:02:19 UTC - in response to Message 29337.  
Last modified: 16 Mar 2017, 21:21:22 UTC

By the way I finished my first longrunner:

https://lhcathome.cern.ch/lhcathome/result.php?resultid=126170632

11 hours runtime on 4-cores. The credit was around 10 times what I normally get.

Really successful (for science)?
There is no HITS*.root result file in the log.

Edit:
A quick crosscheck with my last results show that only about 70 % of the WUs have a HITS* file.
Is the presence of such a file still a criterion for a successful scientific result as mentioned here?
ID: 29340 · Report as offensive     Reply Quote
Profile Yeti
Volunteer moderator
Avatar

Send message
Joined: 2 Sep 04
Posts: 453
Credit: 193,576,736
RAC: 3,063
Message 29341 - Posted: 16 Mar 2017, 21:43:37 UTC
Last modified: 16 Mar 2017, 21:43:50 UTC

And now I could finish my first longrunner:

https://lhcathome.cern.ch/lhcathome/result.php?resultid=126170630

It is exact 10x normal runtime, but it is not exact 10 times the credit (3.359 versus 420)


Supporting BOINC, a great concept !
ID: 29341 · Report as offensive     Reply Quote
Profile Yeti
Volunteer moderator
Avatar

Send message
Joined: 2 Sep 04
Posts: 453
Credit: 193,576,736
RAC: 3,063
Message 29342 - Posted: 16 Mar 2017, 21:52:11 UTC

This one didn't survive: https://lhcathome.cern.ch/lhcathome/result.php?resultid=126170737


Supporting BOINC, a great concept !
ID: 29342 · Report as offensive     Reply Quote
Profile rbpeake

Send message
Joined: 17 Sep 04
Posts: 99
Credit: 30,836,799
RAC: 6,637
Message 29343 - Posted: 16 Mar 2017, 22:25:22 UTC

ID: 29343 · Report as offensive     Reply Quote
gyllic

Send message
Joined: 9 Dec 14
Posts: 202
Credit: 2,533,875
RAC: 0
Message 29344 - Posted: 17 Mar 2017, 1:47:54 UTC

success here to:
https://lhcathome.cern.ch/lhcathome/result.php?resultid=126170634

a little bit less than 10 times of the "normal" tasks.
ID: 29344 · Report as offensive     Reply Quote
Profile Yeti
Volunteer moderator
Avatar

Send message
Joined: 2 Sep 04
Posts: 453
Credit: 193,576,736
RAC: 3,063
Message 29346 - Posted: 17 Mar 2017, 6:08:44 UTC

over night 8 Longrunners have been finished and succesfull validated


Supporting BOINC, a great concept !
ID: 29346 · Report as offensive     Reply Quote
Profile Magic Quantum Mechanic
Avatar

Send message
Joined: 24 Oct 04
Posts: 1130
Credit: 49,813,154
RAC: 7,113
Message 29348 - Posted: 17 Mar 2017, 7:55:23 UTC - in response to Message 29346.  

over night 8 Longrunners have been finished and succesfull validated


Those 4,000+ credit tasks do look nice Yeti

https://lhcathome.cern.ch/lhcathome/results.php?hostid=10359162
Volunteer Mad Scientist For Life
ID: 29348 · Report as offensive     Reply Quote
David Cameron
Project administrator
Project developer
Project scientist

Send message
Joined: 13 May 14
Posts: 387
Credit: 15,314,184
RAC: 0
Message 29349 - Posted: 17 Mar 2017, 8:46:07 UTC

Let me explain a little what actually happens in ATLAS tasks. The large input file which is called EVNT.* is a collection of "events". Each event represents a simulated collision of protons inside the ATLAS detector and in this file are descriptions of particles which are produced by these collisions. The chance of a certain particle (eg Higgs boson) being produced in a collision has a certain probability so these events are randomly generated according to these probabilities.

What the ATLAS WU do is simulate how those particles in each event interact with the detector, which consists of many extremely complex components. The description of the detector is partly in ATLAS simulation software but partly in database services (the services which were not working last weekend and caused WU to fail). The output of the simulation is in the HITS file, which is a description of where each particle "hits" (i.e interacts with) the detector.

Therefore a truly successful WU must have a valid HITS file produced, but as mentioned above you can still get credit even if no HITS file is present because we don't want people to suffer from problems in ATLAS software or infrastructure.
ID: 29349 · Report as offensive     Reply Quote
tullio

Send message
Joined: 19 Feb 08
Posts: 708
Credit: 4,336,250
RAC: 0
Message 29350 - Posted: 17 Mar 2017, 9:58:10 UTC

Thanks David. Since Atlas and SixTrack are the only programs running on my PCs, Linux or Windows 10, since the LHC consolidation, I am glad to spend time and electricity on them.
Tullio
ID: 29350 · Report as offensive     Reply Quote
gyllic

Send message
Joined: 9 Dec 14
Posts: 202
Credit: 2,533,875
RAC: 0
Message 29351 - Posted: 17 Mar 2017, 10:53:15 UTC - in response to Message 29349.  
Last modified: 17 Mar 2017, 10:55:00 UTC

Let me explain a little what actually happens in ATLAS tasks. The large input file which is called EVNT.* is a collection of "events". Each event represents a ... ...is in the HITS file, which is a description of where each particle "hits" (i.e interacts with) the detector.

Very interesting information, thank you!

Therefore a truly successful WU must have a valid HITS file produced, but as mentioned above you can still get credit even if no HITS file is present because we don't want people to suffer from problems in ATLAS software or infrastructure.

Am i understanding this right?: You get credit and the WU seems succsessfull although in reality it is not (no valid hits file). Isnt there a way to tell if the WU is actually good or bad while it is running?
Because at least i am here not to get credit in the first place, but to support, in my opinion, an important and facinating project (whole CERN) with my CPU power and produce real useable results. And if there is no "valid hits" file, all the CPU time was wasted right?
Do you know a percentage of "successfull" WUs that actually are not useable (no correct hits file) relevant to really successfull ones (with good hits file)?
ID: 29351 · Report as offensive     Reply Quote
Crystal Pellet
Volunteer moderator
Volunteer tester

Send message
Joined: 14 Jan 10
Posts: 1289
Credit: 8,528,821
RAC: 2,748
Message 29352 - Posted: 17 Mar 2017, 11:18:55 UTC - in response to Message 29349.  

Therefore a truly successful WU must have a valid HITS file produced, but as mentioned above you can still get credit even if no HITS file is present because we don't want people to suffer from problems in ATLAS software or infrastructure.


Hi David,

Sometimes one sees in the stderr.txt the HITS-file in the list of files and sometimes not in the list, but surely targeted to srm://srm.ndgf.org.

For BOINC they are both valid, but are they both also valid for the project?

From your tasks:
Not in list: https://lhcathome.cern.ch/lhcathome/result.php?resultid=126170632
File in list: https://lhcathome.cern.ch/lhcathome/result.php?resultid=126170942
ID: 29352 · Report as offensive     Reply Quote
Profile Yeti
Volunteer moderator
Avatar

Send message
Joined: 2 Sep 04
Posts: 453
Credit: 193,576,736
RAC: 3,063
Message 29353 - Posted: 17 Mar 2017, 11:39:11 UTC - in response to Message 29351.  

Therefore a truly successful WU must have a valid HITS file produced, but as mentioned above you can still get credit even if no HITS file is present because we don't want people to suffer from problems in ATLAS software or infrastructure.

Am i understanding this right?: You get credit and the WU seems succsessfull although in reality it is not (no valid hits file). Isnt there a way to tell if the WU is actually good or bad while it is running?
Because at least i am here not to get credit in the first place, but to support, in my opinion, an important and facinating project (whole CERN) with my CPU power and produce real useable results. And if there is no "valid hits" file, all the CPU time was wasted right?

Do I understand this right, that this can happen regardless of the WU being crunched in your own Data-Center or here on volunteer-machines ?

If this may happen at all places, then it is not wasted CPU-Time

Do you know a percentage of "successfull" WUs that actually are not useable (no correct hits file) relevant to really successfull ones (with good hits file)?

That would really be interesting


Supporting BOINC, a great concept !
ID: 29353 · Report as offensive     Reply Quote
computezrmle
Volunteer moderator
Volunteer developer
Volunteer tester
Help desk expert
Avatar

Send message
Joined: 15 Jun 08
Posts: 2433
Credit: 227,975,941
RAC: 125,826
Message 29354 - Posted: 17 Mar 2017, 13:01:43 UTC

While testing the native linux app a few days ago I made the following experience.
Running it on 1 core the resulting HITS* was 20 kB.
Running it on 2 cores the resulting HITS* was 50 MB.

David mentioned that the 50 MB file was the correct one.

As far as I understood from one of David´s postings in the DEV board
- always the same input data was used during the test
- the input data was taken from the life system

How reliable are the results from a science perspective if the output differs depending on the number of used cores?
Can this happen inside the VM app also?
ID: 29354 · Report as offensive     Reply Quote
m

Send message
Joined: 6 Sep 08
Posts: 116
Credit: 11,131,774
RAC: 2,877
Message 29356 - Posted: 17 Mar 2017, 13:24:10 UTC - in response to Message 29354.  
Last modified: 17 Mar 2017, 13:28:29 UTC

While testing the native linux app a few days ago I made the following experience.
Running it on 1 core the resulting HITS* was 20 kB.
Running it on 2 cores the resulting HITS* was 50 MB.

David mentioned that the 50 MB file was the correct one.

As far as I understood from one of David´s postings in the DEV board
- always the same input data was used during the test
- the input data was taken from the life system

How reliable are the results from a science perspective if the output differs depending on the number of used cores?
Can this happen inside the VM app also?

This is from this single core VM job:-

2017-03-16 05:18:26 (4821): Guest Log: -rw------- 1 root root 52108715 Mar 16 05:15 HITS.10165253._194503.pool.root.1
ID: 29356 · Report as offensive     Reply Quote
David Cameron
Project administrator
Project developer
Project scientist

Send message
Joined: 13 May 14
Posts: 387
Credit: 15,314,184
RAC: 0
Message 29357 - Posted: 17 Mar 2017, 13:39:48 UTC - in response to Message 29354.  

While testing the native linux app a few days ago I made the following experience.
Running it on 1 core the resulting HITS* was 20 kB.
Running it on 2 cores the resulting HITS* was 50 MB.


The 20kB file would not be a valid file and would be rejected further up the chain.

How reliable are the results from a science perspective if the output differs depending on the number of used cores?
Can this happen inside the VM app also?


The results are independent of number of cores. On the grid we can run the same kinds of tasks with different numbers of cores depending on the computing centre and the results in terms of physics are the same. This is because the events are processed independently on each core and then at the end of the task they are merged together into the HITS result file.

I can get some statistics next week on the success rate, but I do not think it is much worse than the tasks running on other ATLAS resources. Most of the failed or invalid WU run for a very short period of time so so not waste resources. However there are always bugs in software or infrastructure problems which can affect the task and so we cannot guarantee that every WU will produce something useful for science. I'm sure this is true for every volunteer computing project or indeed scientific computation in general
ID: 29357 · Report as offensive     Reply Quote
tullio

Send message
Joined: 19 Feb 08
Posts: 708
Credit: 4,336,250
RAC: 0
Message 29358 - Posted: 17 Mar 2017, 15:00:12 UTC

All my Atlas tasks validate on the Linux box even if not having the HITS file. All my Atlas tasks are invalidated on the Windows 10 PC despite it having a more modern AMD CPU and three times its RAM.
Tullio
ID: 29358 · Report as offensive     Reply Quote
Previous · 1 · 2 · 3 · 4 · 5 . . . 6 · Next

Message boards : ATLAS application : Very long tasks in the queue


©2024 CERN