log in

Atlas out of work (and all other LHC-Projects)


Advanced search

Message boards : ATLAS application : Atlas out of work (and all other LHC-Projects)

1 · 2 · Next
Author Message
Profile Yeti
Volunteer moderator
Avatar
Send message
Joined: 2 Sep 04
Posts: 281
Credit: 41,058,368
RAC: 50,570
Message 29857 - Posted: 7 Apr 2017, 6:37:35 UTC
Last modified: 7 Apr 2017, 6:37:59 UTC

Atlas seems to be completly out of work (and all other LHC-Projects seem also)
____________


Supporting BOINC, a great concept !

David Cameron
Project administrator
Project developer
Project scientist
Send message
Joined: 13 May 14
Posts: 124
Credit: 2,875,749
RAC: 10,318
Message 29860 - Posted: 7 Apr 2017, 7:25:39 UTC - in response to Message 29857.

There should be tasks in the queue now. I'm not sure what was wrong, maybe something on the BOINC server was stuck overnight. There is a huge backlog of tasks to assimilate so you may have some tasks which are finished but pending credit allocation.

gyllic
Send message
Joined: 9 Dec 14
Posts: 71
Credit: 818,188
RAC: 5,260
Message 30138 - Posted: 30 Apr 2017, 17:04:44 UTC

probably out of work again.

David Cameron
Project administrator
Project developer
Project scientist
Send message
Joined: 13 May 14
Posts: 124
Credit: 2,875,749
RAC: 10,318
Message 30139 - Posted: 30 Apr 2017, 18:35:51 UTC - in response to Message 30138.

The current set of WU were processed faster than I expected so there are now none left in the queue. Monday is a CERN holiday so we'll probably have to wait until Tuesday to get more work.

David Cameron
Project administrator
Project developer
Project scientist
Send message
Joined: 13 May 14
Posts: 124
Credit: 2,875,749
RAC: 10,318
Message 30157 - Posted: 2 May 2017, 13:06:34 UTC - in response to Message 30139.

We have new WU now, sorry for the break.

The new WU have 50 events per task so are twice as fast to complete as the previous batch.

Dave Peachey
Send message
Joined: 9 May 09
Posts: 17
Credit: 752,075
RAC: 14
Message 30160 - Posted: 2 May 2017, 14:31:25 UTC - in response to Message 30157.
Last modified: 2 May 2017, 14:33:44 UTC

The new WU have 50 events per task so are twice as fast to complete as the previous batch.

David,

Thanks, I've had a few of these (taskID=11271634) and they are, indeed running at approx. twice the speed of the previous ones - around 43 mins compared to about 1hr 25 mins for the previous batches (based on a four-core utilisation regime).

However, I've also noticed that the main download file is over 100MB in size (compared to around 25MB for the previous WU types) and the upload file is about 25MB (which would equate well to the 50 vs. 100 events in the new WU).

Unfortunately, this makes them uneconomic for me to run in their current form given I have a 250GB per month data download limit and running these continuously (even at a rate of WU one at time) would mean roughly 3GB per day just to download the ATLAS task data. I don't have an upload cap so that aspect is not (was not previously) a problem.

Given I also have other projects which require similar amounts of data to be downloaded per day, hence I will quickly run out of available download capacity. So, whilst I may toy with these from time to time, I will have to curtail my continuous involvement for now because I just can't afford the bandwidth.

Unless, of course, something can be done to reduce the download file size ... ?

Dave

computezrmle
Send message
Joined: 15 Jun 08
Posts: 347
Credit: 3,399,908
RAC: 3,711
Message 30161 - Posted: 2 May 2017, 14:52:40 UTC - in response to Message 30160.

So we have the same situation than for several weeks during last December an January when the download size was more than 200 MB per WU plus several MBs during calculation plus 50 MB upload. Together with the fact that the runtime is halved at the end the network load did not change.
I agree with Dave Peachey that this is the best way to lock out users who's internet connection has a transfer limit.
The worst case would be that incase of a faulty batch the user's host may hit such a transfer limit within a very short time, e.g. over night.

David Cameron
Project administrator
Project developer
Project scientist
Send message
Joined: 13 May 14
Posts: 124
Credit: 2,875,749
RAC: 10,318
Message 30178 - Posted: 3 May 2017, 9:10:11 UTC - in response to Message 30161.

Unfortunately, due to the many different types of simulation tasks that ATLAS runs, we cannot guarantee that the input files will always be small. But this time it is made worse by having fewer events in each WU and this is something that we can control.

So we have changed task 11271634 so that new WU will have 200 events. Each WU will still require downloading the same large input and the output will be 4 times larger but they will run for longer so the overall bandwidth usage will be lower. Since there are many 50 event WU in the queue it may take some time before the longer WU start being sent out.

Hope this helps with those who have limited bandwidth.

Dave Peachey
Send message
Joined: 9 May 09
Posts: 17
Credit: 752,075
RAC: 14
Message 30179 - Posted: 3 May 2017, 10:33:56 UTC - in response to Message 30178.

We have changed task 11271634 so that new WU will have 200 events. Each WU will still require downloading the same large input and the output will be 4 times larger but they will run for longer so the overall bandwidth usage will be lower.

David,

Thanks for that - I understand entirely that these things can't be guaranteed but it is always appreciated when steps are implemented to make life easier/cheaper for the majority of volunteers ... hopefully this change won't disenfranchise too many people.

Uploading large files isn't a problem for me as some of my GPUGrid results files regularly hit the 200MB mark (which is only a concern if the connection is flaky and might error out on the upload!); as I said, I don't have a bandwith cap for uploads although I suspect that may be an issue for some people.

I'll take the hit with the faster, 50 event WUs for now and help out as much as I can to clear the backlog of small, 'expensive' ones in the anticpation of getting the longer 'cheaper' ones in due course ;-)

Just one other point, though ... I did notice that the original versions of 11271634 had a surprisingly low credit rating compared to previous WUs (half the time to complete but barely a tenth of the credit); I was getting 250-300 credits per WU previously (for 100 event WUs) then it went down to around 35 credits (for 50 event WUs).

Will upping the number of events (to 200) and doubling the time (in comparison to earlier WU ID types) put the credits back into a more reasonable range (say around the 450 mark) or is that something which is out of your control (e.g. because it's based on the ridiculous BOINC 'Credit New' calculation)?

Cheers
Dave

Profile rbpeake
Send message
Joined: 17 Sep 04
Posts: 55
Credit: 15,620,725
RAC: 1,342
Message 30184 - Posted: 3 May 2017, 18:03:38 UTC
Last modified: 3 May 2017, 18:04:14 UTC

This was recently posted in the LHCb forum:

LHCb do not pre-select jobs to be sent to the community, you pick-up jobs from the same 'queue' as all other sites.


Will this ever be true for ATLAS? That would seem to minimize the risk of running out of tasks for BOINC.

Thanks!!
____________
Regards,
Bob P.

Dave Peachey
Send message
Joined: 9 May 09
Posts: 17
Credit: 752,075
RAC: 14
Message 30185 - Posted: 3 May 2017, 18:30:37 UTC - in response to Message 30184.

This was recently posted in the LHCb forum:

LHCb do not pre-select jobs to be sent to the community, you pick-up jobs from the same 'queue' as all other sites.

Will this ever be true for ATLAS? That would seem to minimize the risk of running out of tasks for BOINC.

Something I've noticed since I started analysing my data down/uploads (initially due to a concern about where all my data headroom was going) ... today I've received LHC WU data downloaded from a number of different sources, for example:
- cmsextproxy.cern.ch
- boincai04.cern.ch
- cvmfs-egi.gridpp.rl.ac.uk
- stratum-one-lbp.cern.ch
- db-atlas-squid.ndgf.org (to a lesser extent)

I'm guessing that the first one is downloading CMS WU data rather than ATLAS data (as I've been running a couple of those today as well) but, given I restarted running ATLAS WUs at around 3pm this afternoon and all four of the above appear within the post-3pm timeframe, I'd be interested to know which of the others is ATLAS-related.

I don't know how the initial WU download and any subsequent data downloads actually works (that would be an interesting addition for David C's Information on ATLAS tasks page!) but I suspect that ATLAS tasks/data isn't all coming just the from one place even now (although I may be wrong).

Cheers

David Cameron
Project administrator
Project developer
Project scientist
Send message
Joined: 13 May 14
Posts: 124
Credit: 2,875,749
RAC: 10,318
Message 30189 - Posted: 4 May 2017, 7:28:49 UTC - in response to Message 30179.

Will upping the number of events (to 200) and doubling the time (in comparison to earlier WU ID types) put the credits back into a more reasonable range (say around the 450 mark) or is that something which is out of your control (e.g. because it's based on the ridiculous BOINC 'Credit New' calculation)?


Yes to both. The algorithm seems to rely on running time compared to estimated time, so the shorter WU give you less credit than expected and longer give you much more.

David Cameron
Project administrator
Project developer
Project scientist
Send message
Joined: 13 May 14
Posts: 124
Credit: 2,875,749
RAC: 10,318
Message 30190 - Posted: 4 May 2017, 7:34:39 UTC - in response to Message 30185.

- boincai04.cern.ch

The large input files for ATLAS are downloaded from here rather than directly from the LHC@Home server. With the consolidated project we are splitting the download/upload servers between apps to scale up better.

- db-atlas-squid.ndgf.org (to a lesser extent)

This is a squid cache for the database information that is downloaded when the job starts running.

I think the others are CMS or other apps. Can you try the same test with only ATLAS WU running?

Dave Peachey
Send message
Joined: 9 May 09
Posts: 17
Credit: 752,075
RAC: 14
Message 30191 - Posted: 4 May 2017, 7:42:35 UTC - in response to Message 30189.

Yes to both. The algorithm seems to rely on running time compared to estimated time, so the shorter WU give you less credit than expected and longer give you much more.

David,

Having now processed two significanty longer WUs in this batch (so I assume they are the 200-event ones), I can say that:
- yes, the amount of credit has increased over the 50-event WUs ... but not by a factor of four (more like three)
- the amount of credit (at around 75-80 per WU) is nothing like the 250-odd credits I was getting for the 10995533 WUs which were running with 100 events and taking half the time of the 200-event WUs

Evidence:
- WU 10995533 with a 1hr 11min runtime gained 250 credits
- WU 11271634 with a 2hr 7min runtime gained a mere 76 credits

So, the algorithm is (sort of) doing what is expected (for those specific WUs) but not in an equitable way compared to previous WU types.

Dave

computezrmle
Send message
Joined: 15 Jun 08
Posts: 347
Credit: 3,399,908
RAC: 3,711
Message 30192 - Posted: 4 May 2017, 7:49:00 UTC - in response to Message 30190.

cmsextproxy.cern.ch is an alias for lhchomeproxy.cern.ch
stratum-one-lbp.cern.ch is an alias for cvmfs-stratum-one.cern.ch
cvmfs-egi.gridpp.rl.ac.uk is an alias for cernvmfs.gridpp.rl.ac.uk

All of them are main systems for the data distribution not only for one single project.

Dave Peachey
Send message
Joined: 9 May 09
Posts: 17
Credit: 752,075
RAC: 14
Message 30193 - Posted: 4 May 2017, 7:56:33 UTC - in response to Message 30190.
Last modified: 4 May 2017, 8:08:50 UTC

I think the others are CMS or other apps. Can you try the same test with only ATLAS WU running?

My last CMS WU finished at just after midnight BST so, factoring that out, I have contacts for ATLAS-only major (i.e. 100+ MB) activity with the following servers:
- boincai04.cern.ch
- stratum-one-lbp.cern.ch
- cvmfs-egi.gridpp.rl.ac.uk
- cvmfs.racf.bnl.gov
- boincai02.cern.ch
and lesser activity (only tens of MB) with:
- db-atlas-squid.ndgf.org
- cvmfs.racf.bnl.gov
- lxcvmfs78.cern.ch
- lhcathome-upload.cern.ch

Obviously the cern.ch ones are all in Switzerland at CERN; the rl.ac.uk one resolves to the UK-based Science and Technology Facilities Council; the bnl.gov one resolves to the US-based Brookhaven National Laboratory; the ndgf.org would suggest the Nordic Data Grid Facility.

So that's an interesting mix of contacts in under nine hours ;-) But where does that leave us in terms of rbpeake's earlier comment in this thread, I wonder?

computezrmle
Send message
Joined: 15 Jun 08
Posts: 347
Credit: 3,399,908
RAC: 3,711
Message 30194 - Posted: 4 May 2017, 8:13:29 UTC - in response to Message 30193.

So that's an interesting mix of contacts in under nine hours ;-) ...

And that's by far not all.
See:
http://wlcg.web.cern.ch/
http://frontier.cern.ch/

David Cameron
Project administrator
Project developer
Project scientist
Send message
Joined: 13 May 14
Posts: 124
Credit: 2,875,749
RAC: 10,318
Message 30200 - Posted: 4 May 2017, 10:37:29 UTC - in response to Message 30193.

So that's an interesting mix of contacts in under nine hours ;-)


Almost all of those are CVMFS-related services which are located at various labs and institutes around the world.

But where does that leave us in terms of rbpeake's earlier comment in this thread, I wonder?


The "normal" tasks which run on the ATLAS grid process 1000 events each and this is too much for the average volunteer - it leads to more chance of failure and very large results to upload. See the discussions on earlier threads about having a separate "longrunners" app for this. So for ATLAS@Home we assign specific batches of tasks which are grouped into 100 events per WU. The tasks themselves are the same, just the grouping is different. The disadvantage of this is that it takes manual work to assign these tasks and we have to keep an eye on things or the queue goes empty. However there are occasions when the whole ATLAS grid runs out of simulation work to do so running normal tasks would not necessarily guarantee ATLAS@home always has work.

One of the ideas behind the LHC consolidation was that if one app runs out of work then people can run others. There is a setting on the project preferences to have a preferred app but run others if that app has no work. In terms of science all LHC apps should be considered equal :)

Profile Yeti
Volunteer moderator
Avatar
Send message
Joined: 2 Sep 04
Posts: 281
Credit: 41,058,368
RAC: 50,570
Message 30201 - Posted: 4 May 2017, 12:01:01 UTC - in response to Message 30200.

One of the ideas behind the LHC consolidation was that if one app runs out of work then people can run others. There is a setting on the project preferences to have a preferred app but run others if that app has no work. In terms of science all LHC apps should be considered equal :)

Nope, David.

Yes, we can set up one favorite subject and a fallback, but this fallback allows ALL other projects. But these other projects do not comply with Cruncher-Needs or Cruncher-Wishes.

For example, I can not run an app that doesn't stand a suspend / wait for several hours (CMS couldn't live with this, is this still so ?)

Having no more influence on the Fallback-Projects I can only block them. This was better when CERN had several projects, as the BOINC-Client can handle situations getting no work from a main-project, and I can set up the Fallback-Projects

Having put all LHC-Projects togehter you/we have lost the ability to configure Fallback-Projects to our needs.
____________


Supporting BOINC, a great concept !

maeax
Send message
Joined: 2 May 07
Posts: 182
Credit: 11,301,914
RAC: 11,411
Message 30206 - Posted: 4 May 2017, 13:25:59 UTC

locations home, school and work can used for different Computer with different projects in LHCatHome.
Is it possible to get names for the locations like the projectnames (Atlas,CMS..)?

1 · 2 · Next

Message boards : ATLAS application : Atlas out of work (and all other LHC-Projects)