Message boards : ATLAS application : Very long tasks in the queue
Message board moderation

To post messages, you must log in.

Previous · 1 · 2 · 3 · 4 · 5 · 6 · Next

AuthorMessage
Profile rbpeake

Send message
Joined: 17 Sep 04
Posts: 99
Credit: 30,642,553
RAC: 1,979
Message 29360 - Posted: 17 Mar 2017, 15:08:49 UTC - in response to Message 29358.  

All my Atlas tasks validate on the Linux box even if not having the HITS file. All my Atlas tasks are invalidated on the Windows 10 PC despite it having a more modern AMD CPU and three times its RAM.
Tullio

I find I need to run Atlas by itself. It did not do well when I was also running Einstein@home.
Regards,
Bob P.
ID: 29360 · Report as offensive     Reply Quote
Profile Yeti
Volunteer moderator
Avatar

Send message
Joined: 2 Sep 04
Posts: 453
Credit: 193,464,258
RAC: 5,837
Message 29361 - Posted: 17 Mar 2017, 15:16:51 UTC - in response to Message 29348.  

over night 8 Longrunners have been finished and succesfull validated


Those 4,000+ credit tasks do look nice Yeti

https://lhcathome.cern.ch/lhcathome/results.php?hostid=10359162

So far my best: https://lhcathome.cern.ch/lhcathome/result.php?resultid=126170665

6,665.59 credits *smile*


Supporting BOINC, a great concept !
ID: 29361 · Report as offensive     Reply Quote
Profile Yeti
Volunteer moderator
Avatar

Send message
Joined: 2 Sep 04
Posts: 453
Credit: 193,464,258
RAC: 5,837
Message 29362 - Posted: 17 Mar 2017, 15:18:42 UTC - in response to Message 29360.  

I find I need to run Atlas by itself. It did not do well when I was also running Einstein@home.

Yeah, BOINC has a lot of problems running MultiCoreWUs side by side with SingleCoreWUs, so it is a real good idea to run 1-Project only of these MultiCoreWUs


Supporting BOINC, a great concept !
ID: 29362 · Report as offensive     Reply Quote
maeax

Send message
Joined: 2 May 07
Posts: 2090
Credit: 158,816,631
RAC: 127,244
Message 29369 - Posted: 18 Mar 2017, 5:42:00 UTC

Runtime 1 day and 15 hours:
https://lhcathome.cern.ch/lhcathome/workunit.php?wuid=60711144

The upload of 585,46 MByte is waiting...
ID: 29369 · Report as offensive     Reply Quote
Crystal Pellet
Volunteer moderator
Volunteer tester

Send message
Joined: 14 Jan 10
Posts: 1273
Credit: 8,480,147
RAC: 2,155
Message 29371 - Posted: 18 Mar 2017, 8:04:46 UTC
Last modified: 18 Mar 2017, 8:08:26 UTC

Longrunner returned: https://lhcathome.cern.ch/lhcathome/result.php?resultid=126170731

Run time 1 days 9 hours 8 min 10 sec
CPU time 5 days 7 hours 30 min 21 sec
Validate state Valid
Credit 632.96
................. Not very much for over 5 days of CPU

In the stderr output, I don't find any ATLAS-job information, except "Starting ATLAS job. (PandaID=3283615871 taskID=10959636)"
ID: 29371 · Report as offensive     Reply Quote
Toby Broom
Volunteer moderator

Send message
Joined: 27 Sep 08
Posts: 803
Credit: 649,991,649
RAC: 239,589
Message 29383 - Posted: 18 Mar 2017, 12:24:05 UTC

I had a few:

https://lhcathome.cern.ch/lhcathome/result.php?resultid=126171019
https://lhcathome.cern.ch/lhcathome/result.php?resultid=126170863
https://lhcathome.cern.ch/lhcathome/result.php?resultid=126170832

They didn't seem to run for so long and xx19 had some errors

Seems like some of the normal ones had ~1000 credit today??
ID: 29383 · Report as offensive     Reply Quote
Crystal Pellet
Volunteer moderator
Volunteer tester

Send message
Joined: 14 Jan 10
Posts: 1273
Credit: 8,480,147
RAC: 2,155
Message 29386 - Posted: 18 Mar 2017, 12:50:28 UTC - in response to Message 29383.  

I had a few:

https://lhcathome.cern.ch/lhcathome/result.php?resultid=126171019
https://lhcathome.cern.ch/lhcathome/result.php?resultid=126170863
https://lhcathome.cern.ch/lhcathome/result.php?resultid=126170832

They didn't seem to run for so long and xx19 had some errors


You have set up dual-cores without app_config. They get 3600MB of RAM and 4400MB is required for multi-cores.
So use an app_config.xml with the minimum RAM or setup at least three cores in your preferences and a VM with 4600MB will be created.
ID: 29386 · Report as offensive     Reply Quote
Profile Yeti
Volunteer moderator
Avatar

Send message
Joined: 2 Sep 04
Posts: 453
Credit: 193,464,258
RAC: 5,837
Message 29387 - Posted: 18 Mar 2017, 12:59:45 UTC - in response to Message 29371.  

Longrunner returned: https://lhcathome.cern.ch/lhcathome/result.php?resultid=126170731

Run time 1 days 9 hours 8 min 10 sec
CPU time 5 days 7 hours 30 min 21 sec
Validate state Valid
Credit 632.96
................. Not very much for over 5 days of CPU

Is it this machine? https://lhcathome.cern.ch/lhcathome/show_host_detail.php?hostid=10360630

Then, yes, you are right, this is definitly low, but your normal Atlas-WUs get also very low credits. Don't know what is the reason ...

Could it be that the last benchmark went wrong ?


Supporting BOINC, a great concept !
ID: 29387 · Report as offensive     Reply Quote
Crystal Pellet
Volunteer moderator
Volunteer tester

Send message
Joined: 14 Jan 10
Posts: 1273
Credit: 8,480,147
RAC: 2,155
Message 29388 - Posted: 18 Mar 2017, 13:42:44 UTC - in response to Message 29387.  

Then, yes, you are right, this is definitly low, but your normal Atlas-WUs get also very low credits. Don't know what is the reason ...

Could it be that the last benchmark went wrong ?

The benchmark is at least 1 year and maybe even 2 years old.
Do not have that low credit 'problem' with the Theory-tasks, so it's BOINC's new credit algorithm combined with new applications.
Maybe the benchmark was a bit low, cause I had fixed it to that level, because World Community Grid has sometimes a major problem with exceeded time limit when they switch tasks from the same application from long runners to very short runners vice versa. Meanwhile I raised the floating and integer speed.
ID: 29388 · Report as offensive     Reply Quote
Toby Broom
Volunteer moderator

Send message
Joined: 27 Sep 08
Posts: 803
Credit: 649,991,649
RAC: 239,589
Message 29389 - Posted: 18 Mar 2017, 13:58:11 UTC - in response to Message 29386.  

I thought I just picked out the single core ones as I changed back to singles as the combination of multi core and multi tasks isn't respected correctly and app config cannot be set correctly either.

I'll see if I get some more long running ones in single core.
ID: 29389 · Report as offensive     Reply Quote
Profile Yeti
Volunteer moderator
Avatar

Send message
Joined: 2 Sep 04
Posts: 453
Credit: 193,464,258
RAC: 5,837
Message 29390 - Posted: 18 Mar 2017, 15:00:38 UTC

Meanwhile I have 28 longrunners successful finished and uploaded + 1 that failed


Supporting BOINC, a great concept !
ID: 29390 · Report as offensive     Reply Quote
Profile HerveUAE
Avatar

Send message
Joined: 18 Dec 16
Posts: 123
Credit: 37,495,365
RAC: 0
Message 29397 - Posted: 18 Mar 2017, 17:23:17 UTC

So far, I have had 5 long-runners, both 2-core and 4-core. All went through OK with app_config.xml setting memory at 4400 MB.
Here is a 2-core: https://lhcathome.cern.ch/lhcathome/result.php?resultid=126170739
Here is a 4-core: https://lhcathome.cern.ch/lhcathome/result.php?resultid=126170657
Credit allocation, as often, is difficult to understand. The 2-core has more than double the credit of the 4-core, although the CPU time for the 4-core is more than double the CPU time of the 2-core...
We are the product of random evolution.
ID: 29397 · Report as offensive     Reply Quote
Erich56

Send message
Joined: 18 Dec 15
Posts: 1687
Credit: 103,037,844
RAC: 126,619
Message 29404 - Posted: 19 Mar 2017, 7:03:08 UTC

It would definitely be a good idea to create a seperate Task category for the longrunners.

On one of my PCs, I am running 3 tasks 3 cores each on a "high-end" processor, so no problem if longrunners are downloaded and processed.

On another two PCs, I run tasks with one core only, the processors are older, slower ones. So here it does not make any sense at all to have longrunners processed, crunching time would be 6-8 days.

Hence, it would be nice if for each of my PCs I could determine in advance whether or not longrunners are being downloaded. I strongly guess that same is true for other crunchers as well.
ID: 29404 · Report as offensive     Reply Quote
Toby Broom
Volunteer moderator

Send message
Joined: 27 Sep 08
Posts: 803
Credit: 649,991,649
RAC: 239,589
Message 29406 - Posted: 19 Mar 2017, 8:24:19 UTC
Last modified: 19 Mar 2017, 8:26:48 UTC

At GPUGRID they have long a short tasks.

On Rosetta there is an option for target run time.

I agree with Erich

BOINC gives ETA of 10d on my E5-2675v3
ID: 29406 · Report as offensive     Reply Quote
Erich56

Send message
Joined: 18 Dec 15
Posts: 1687
Credit: 103,037,844
RAC: 126,619
Message 29413 - Posted: 19 Mar 2017, 11:57:40 UTC

on my PC with 3 ATLAS tasks 3 cores ea., all 3 tasks are long-runners now. As it seems, it well take each them about 56 hours to get finished. So I'll see what will happen.
ID: 29413 · Report as offensive     Reply Quote
maeax

Send message
Joined: 2 May 07
Posts: 2090
Credit: 158,816,631
RAC: 127,244
Message 29435 - Posted: 20 Mar 2017, 11:51:12 UTC - in response to Message 29369.  

Runtime 1 day and 15 hours:
https://lhcathome.cern.ch/lhcathome/workunit.php?wuid=60711144

The upload of 585,46 MByte is waiting...


Thank you ATLAS-Team - 2.600 Cobblestones. Upload finished successful.
ID: 29435 · Report as offensive     Reply Quote
Crystal Pellet
Volunteer moderator
Volunteer tester

Send message
Joined: 14 Jan 10
Posts: 1273
Credit: 8,480,147
RAC: 2,155
Message 29436 - Posted: 20 Mar 2017, 12:08:52 UTC
Last modified: 20 Mar 2017, 12:12:48 UTC

I got a resend of a long runner that failed due to EXIT_DISK_LIMIT_EXCEEDED. Peak disk usage reported 5,960.42 MB.
https://lhcathome.cern.ch/lhcathome/result.php?resultid=126683384
My resend task started as a single core VM and I'll try to restart it with 4 cores.
ID: 29436 · Report as offensive     Reply Quote
Profile HerveUAE
Avatar

Send message
Joined: 18 Dec 16
Posts: 123
Credit: 37,495,365
RAC: 0
Message 29457 - Posted: 20 Mar 2017, 20:52:47 UTC

My resend task started as a single core VM and I'll try to restart it with 4 cores.

How do you do that? I mean restarting a single core task with a different number of cores?
We are the product of random evolution.
ID: 29457 · Report as offensive     Reply Quote
Toby Broom
Volunteer moderator

Send message
Joined: 27 Sep 08
Posts: 803
Credit: 649,991,649
RAC: 239,589
Message 29464 - Posted: 20 Mar 2017, 21:35:19 UTC

Looks like my Xeon E5-2675v3 won't make the deadline for the long runners, ETA is 8days.

e.g.

https://lhcathome.cern.ch/lhcathome/result.php?resultid=127152468
ID: 29464 · Report as offensive     Reply Quote
Toby Broom
Volunteer moderator

Send message
Joined: 27 Sep 08
Posts: 803
Credit: 649,991,649
RAC: 239,589
Message 29466 - Posted: 20 Mar 2017, 22:17:01 UTC - in response to Message 29436.  

I have 4 that failed with EXIT_DISK_LIMIT_EXCEEDED 5-7GB

It's a bit irritating though that in the log it shows HITS file and:

2017-03-20 15:56:46 (2872): Guest Log: Successfully finished the ATLAS job!
2017-03-20 15:56:46 (2872): Guest Log: Copying the results back to the shared directory!
2017-03-20 15:56:46 (2872): Guest Log: Copied the result file back to the shared directory and created atlas_done file!
2017-03-20 15:56:46 (2872): Guest Log: Success! Shutting down the machine.
2017-03-20 15:56:46 (2872): VM Completion File Detected.

I lost 900,000sec of compute and all three computers have over 100GB of free disk space and BOINC is set to use 150GB and is using ca 25GB so should have plenty of space
ID: 29466 · Report as offensive     Reply Quote
Previous · 1 · 2 · 3 · 4 · 5 · 6 · Next

Message boards : ATLAS application : Very long tasks in the queue


©2024 CERN