1) Message boards : Number crunching : Imbalance between Subprojects (Message 30677)
Posted 6 Jun 2017 by Dave Peachey
Post:
I am really kind of surprised to read over and over about problems with bandwidth and/or download/upload speed and transfer limits.

In times of cable modem and flat rates, all this should not be a problem at all.

Apologies but I'm afraid I'm going to take issue with you on this one ... whilst it may not be a problem where you are, this is part of the imbalance between (and sometimes within) many countries and also impacts what the average and non-average DC cruncher has (or is prepared) to pay to indulge in this hobby.

Yes, in the UK, cable modem is available as part of many ISPs' flat rate packages and the rates can be very good. However, I've seen comments elsewhere on other project and from other DC crunchers, e.g. in some parts of the US (where you wouldn't necessarily expect this to be an issue), who subsist on very low bandwidth and/or ridiculously low data rates neither of which are conducive to running LHC sub-projects (other than SixTrack) with their high bandwidth requirements.

So whilst it's possible, in the UK, to get good quality cable and ADSL/VDSL (usually as part of a package including other services and provided you live in the heart of a major conurbation), the majority of UK ISPs are not known for their generosity when it comes to these things ... either they cost a lot of money or else the quality of service is so poor (especially outside the major population centres) that they are as good as useless for high bandwidth, always-on DC work.

However, and in spite of living in central London where these things are readily available, I can't use cable and I don't want the majority of spurious features or services which come as part of such packages. Nor do I want to be tied into a rigid contract with an ISP which has little flexibility and penalties for over-use.

Hence, I prefer to pay for a high quality, basic service (with only the features I need/want) but that comes with a monthly bandwidth limit (350GB per month download - which is double what I was using three months ago before I started crunching ATLAS WUs 24x7) and that is pretty much all eaten up by DC projects (LHC amongst them). The cost of this is three times what I used to pay for unlimited ADSL bandwidth just a few years ago but which the ISP I used at the time was not prepared to continue to supply.

On which basis, and as I said in my previous post, the choice to run DC full-time (especially projects with the high bandwidth requirements which LHC has on most of its sub-projects) is one which not many average cruncher would be prepared to indulge. Thus it is a limiting factor and contributes, I'm sure, to the less than spectacular take-up of the VM-based sub-projects on LHC.

Dave
2) Message boards : Number crunching : Imbalance between Subprojects (Message 30634)
Posted 5 Jun 2017 by Dave Peachey
Post:
- a large monthly data download allowance (ATLAS chews through 150-200GB per WU; CMS isn't far behind with its multiple jobs per session)

Correction ... that should be "150-200MB per WU" (but that's still a lot given that I get through one ATLAS WU per hour per day).
3) Message boards : Number crunching : Imbalance between Subprojects (Message 30632)
Posted 4 Jun 2017 by Dave Peachey
Post:
So in summary, project-level scheduling seems to be OK for us and what would be great is if more volunteers could run VM applications. What is the limitation?

Laurence,

The big issue for me (which also chimes with Philippe's comment in the immediately preceding post regarding the original DC philosophy) is that running any of the LHC sub-projects bar SixTrack seems to require an investment in additional time and resources beyond that which your average cruncher is prepared to proceed ... albeit "average" doesn't come near to describing some of the serious crunchers on LHC@Home ;-)

Specifically, and in order to run multiple instances of one or more sub-project WUs, I've found that this means:
- significant amounts of RAM for each machine (8GB seems to be a practical minimum; 32GB is better if the machine can take it) in order to run more than one single-core VM at a time and also use each machine for anything practical at the same time
- a large monthly data download allowance (ATLAS chews through 150-200GB per WU; CMS isn't far behind with its multiple jobs per session)
- fairly significant CPU power (to get the WU results back within a reasonable timescale)
- a robust computer set-up which can be optimised and left to run of its own devices (as noted above by Crystal Pellet) without encountering a periodic dearth of work (due to connection glitches with CERN servers) or periodic failures with the WUs it receives (bad batches of work)

On this basis, none of the sub-projects could be said to supply "rock solid" and "always available" work at a low TCO! All of this costs your average cruncher in terms of money to acquire the hardware, electricity and bandwidth allowances (hence more money) to operate it and also requires a degree of monitoring (time) which is more than most people are prepared to invest.

Now I know that some of the above criticisms could be levelled at any number of other BOINC projects so it's not unique to LHC@Home. However, allowing crunchers to optimise the balance for their individual machines might go some way to encouraging more people to run the LHC VM-based sub-projects rather than just sitting around waiting for SixTrack WUs.

Dave
4) Message boards : ATLAS application : Atlas out of work (and all other LHC-Projects) (Message 30193)
Posted 4 May 2017 by Dave Peachey
Post:
I think the others are CMS or other apps. Can you try the same test with only ATLAS WU running?

My last CMS WU finished at just after midnight BST so, factoring that out, I have contacts for ATLAS-only major (i.e. 100+ MB) activity with the following servers:
- boincai04.cern.ch
- stratum-one-lbp.cern.ch
- cvmfs-egi.gridpp.rl.ac.uk
- cvmfs.racf.bnl.gov
- boincai02.cern.ch
and lesser activity (only tens of MB) with:
- db-atlas-squid.ndgf.org
- cvmfs.racf.bnl.gov
- lxcvmfs78.cern.ch
- lhcathome-upload.cern.ch

Obviously the cern.ch ones are all in Switzerland at CERN; the rl.ac.uk one resolves to the UK-based Science and Technology Facilities Council; the bnl.gov one resolves to the US-based Brookhaven National Laboratory; the ndgf.org would suggest the Nordic Data Grid Facility.

So that's an interesting mix of contacts in under nine hours ;-) But where does that leave us in terms of rbpeake's earlier comment in this thread, I wonder?
5) Message boards : ATLAS application : Atlas out of work (and all other LHC-Projects) (Message 30191)
Posted 4 May 2017 by Dave Peachey
Post:
Yes to both. The algorithm seems to rely on running time compared to estimated time, so the shorter WU give you less credit than expected and longer give you much more.

David,

Having now processed two significanty longer WUs in this batch (so I assume they are the 200-event ones), I can say that:
- yes, the amount of credit has increased over the 50-event WUs ... but not by a factor of four (more like three)
- the amount of credit (at around 75-80 per WU) is nothing like the 250-odd credits I was getting for the 10995533 WUs which were running with 100 events and taking half the time of the 200-event WUs

Evidence:
- WU 10995533 with a 1hr 11min runtime gained 250 credits
- WU 11271634 with a 2hr 7min runtime gained a mere 76 credits

So, the algorithm is (sort of) doing what is expected (for those specific WUs) but not in an equitable way compared to previous WU types.

Dave
6) Message boards : ATLAS application : Atlas out of work (and all other LHC-Projects) (Message 30185)
Posted 3 May 2017 by Dave Peachey
Post:
This was recently posted in the LHCb forum:

LHCb do not pre-select jobs to be sent to the community, you pick-up jobs from the same 'queue' as all other sites.

Will this ever be true for ATLAS? That would seem to minimize the risk of running out of tasks for BOINC.

Something I've noticed since I started analysing my data down/uploads (initially due to a concern about where all my data headroom was going) ... today I've received LHC WU data downloaded from a number of different sources, for example:
- cmsextproxy.cern.ch
- boincai04.cern.ch
- cvmfs-egi.gridpp.rl.ac.uk
- stratum-one-lbp.cern.ch
- db-atlas-squid.ndgf.org (to a lesser extent)

I'm guessing that the first one is downloading CMS WU data rather than ATLAS data (as I've been running a couple of those today as well) but, given I restarted running ATLAS WUs at around 3pm this afternoon and all four of the above appear within the post-3pm timeframe, I'd be interested to know which of the others is ATLAS-related.

I don't know how the initial WU download and any subsequent data downloads actually works (that would be an interesting addition for David C's Information on ATLAS tasks page!) but I suspect that ATLAS tasks/data isn't all coming just the from one place even now (although I may be wrong).

Cheers
7) Message boards : ATLAS application : Atlas out of work (and all other LHC-Projects) (Message 30179)
Posted 3 May 2017 by Dave Peachey
Post:
We have changed task 11271634 so that new WU will have 200 events. Each WU will still require downloading the same large input and the output will be 4 times larger but they will run for longer so the overall bandwidth usage will be lower.

David,

Thanks for that - I understand entirely that these things can't be guaranteed but it is always appreciated when steps are implemented to make life easier/cheaper for the majority of volunteers ... hopefully this change won't disenfranchise too many people.

Uploading large files isn't a problem for me as some of my GPUGrid results files regularly hit the 200MB mark (which is only a concern if the connection is flaky and might error out on the upload!); as I said, I don't have a bandwith cap for uploads although I suspect that may be an issue for some people.

I'll take the hit with the faster, 50 event WUs for now and help out as much as I can to clear the backlog of small, 'expensive' ones in the anticpation of getting the longer 'cheaper' ones in due course ;-)

Just one other point, though ... I did notice that the original versions of 11271634 had a surprisingly low credit rating compared to previous WUs (half the time to complete but barely a tenth of the credit); I was getting 250-300 credits per WU previously (for 100 event WUs) then it went down to around 35 credits (for 50 event WUs).

Will upping the number of events (to 200) and doubling the time (in comparison to earlier WU ID types) put the credits back into a more reasonable range (say around the 450 mark) or is that something which is out of your control (e.g. because it's based on the ridiculous BOINC 'Credit New' calculation)?

Cheers
Dave
8) Message boards : ATLAS application : Atlas out of work (and all other LHC-Projects) (Message 30160)
Posted 2 May 2017 by Dave Peachey
Post:
The new WU have 50 events per task so are twice as fast to complete as the previous batch.

David,

Thanks, I've had a few of these (taskID=11271634) and they are, indeed running at approx. twice the speed of the previous ones - around 43 mins compared to about 1hr 25 mins for the previous batches (based on a four-core utilisation regime).

However, I've also noticed that the main download file is over 100MB in size (compared to around 25MB for the previous WU types) and the upload file is about 25MB (which would equate well to the 50 vs. 100 events in the new WU).

Unfortunately, this makes them uneconomic for me to run in their current form given I have a 250GB per month data download limit and running these continuously (even at a rate of WU one at time) would mean roughly 3GB per day just to download the ATLAS task data. I don't have an upload cap so that aspect is not (was not previously) a problem.

Given I also have other projects which require similar amounts of data to be downloaded per day, hence I will quickly run out of available download capacity. So, whilst I may toy with these from time to time, I will have to curtail my continuous involvement for now because I just can't afford the bandwidth.

Unless, of course, something can be done to reduce the download file size ... ?

Dave
9) Message boards : ATLAS application : Another batch of faulty WUs? (Message 30069)
Posted 26 Apr 2017 by Dave Peachey
Post:
I'm starting to see a number of WUs terminating early and giving a "Validate error" in the results.

Examples include:
- Workunit 65970381
- Workunit 65971624
- Workunit 65971430
each of which has the common parameters:
- name includes text string ..Su7Ccp2YYBZmABFKDmABFKDm3INKDm..
- taskID = 10995533
and all of which are terminating early (anything from 10 to 30 minutes elapsed run-time).

As these are relatively new batch of WUs (created around 10:00 UTC today) and I haven't had any/many wingmen report results, I don't know whether this is "just me" or a symptom of another batch of faulty WUs.

Having said all of the above, I have also had some successes with WUs bearing these parameters so that would suggest it isn't necessarily a completely faulty batch and that maybe some other factors are involved (although my machine is generally stable so I don't believe the fault is in the hardware/software set-up).

Is anyone else seeing the same or similar behaviour with WUs having these parameters?

Dave
10) Message boards : ATLAS application : Error -161 (Message 29948)
Posted 17 Apr 2017 by Dave Peachey
Post:
Whilst erroring out another dozen or so of these WUs this morning (allowing them to error out rather than aborting them gets them out of the system that bit quicker due to the "max # errors" setting), I noticed that all of these WUs have a common fator ... namely they all appear to contain the text string ..qnDDn7oo6G73TpABFKDmABFKDmPaIKDm.. within the WU name (at least, all of the ones I've encountered have done so).

I don't know whether there are any other WUs with different name strings which are exhibiting the same problem nor do I know whether anyone has been able to crunch any of these succsessfully (and, if so, under what circumstances) but that does seem to suggest a common fault with a specific WU batch.

Moreover, I notice, to my dismay, that these WUs are still being produced (two of my most recent failures were created only this morning), so the problem isn't going to go away until someone terminates this batch production and/or identifies the problem and rectifies it.

Hopefully one of the project team will be able to get on the case some time this week.
11) Message boards : ATLAS application : 100% errors (Message 29941)
Posted 16 Apr 2017 by Dave Peachey
Post:
All of your tasks showing Error while computing seem to be victims of the problem being discussed, at length, in the thread on the "Error -161" message (per https://lhcathome.cern.ch/lhcathome/forum_thread.php?id=4224) which is a recent and recurrent problem.

Whilst this seems to affect people to a greater or lesser extent, there are a number of people who are experiencing this problem with a significant proportion of WUs they download (as would seem to be the case for you). In the absence of (apparent) investigations or any reports by the project team (I presume they are short-staffed due to the holidays hence the lack of news), the jury is out on where the problem lies.

I can't speak for the ones showing Validate error but, certainly with your attempts to process them, the result includes a failure to generate the 50MB+ HITS output file ... as would seem to be the case for the wingmen who are also failing with those same WUs. So there's something different that's wrong here although, as those WUs mostly seem to validate eventually, that would seem to imply a different fault.

However, this seems to be the first time (which I can recall) since the recent consolidation of ATLAS@home into the wider LHC@home that problems of this order of magnitude have been encountered so I would hope this is a glitch and not a foreshadowing of future user experiences for ATLAS@home.

And, no, a 5% success rate on successful WU processing is very much not OK for this or any other project (a 5% failure rate is generally deemed to be just about acceptable) and is indicative of problems at either the project end, at your end ... or both!

BTW, your app_config file looks OK; mine is simlar at
<app>
<name>ATLAS</name>
<fraction_done_exact/>
<max_concurrent>1</max_concurrent>
</app>
<app_version>
<app_name>ATLAS</app_name>
<avg_ncpus>4.000000</avg_ncpus>
<plan_class>vbox64_mt_mcore_atlas</plan_class>
<cmdline>--memory_size_mb 5700</cmdline>
</app_version>

but this is for a sixteen core machine using only four cores for ATLAS and with 32GB RAM.

Given your machine only seems to have two cores and 6GB of RAM (if I've read the specs page for that computer correctly), maybe it's underpowered for running ATLAS on two cores and you should consider cutting back to using only one core with a commensurate drop in the RAM setting.
12) Message boards : ATLAS application : Error -161 (Message 29932)
Posted 14 Apr 2017 by Dave Peachey
Post:
... assuming they won't/can't be removed manually...

why not?

No reason that I know of other than the fact that, in spite of this issue having been extant for a couple of days, nothing has been done about it.

That's not to say it wouldn't be possible if there were the resources available and/or a willingness to do this ... albeit, in this respect, I actually know very little about the potential complexities of the matter so I may be completely wrong!
13) Message boards : ATLAS application : Error -161 (Message 29930)
Posted 14 Apr 2017 by Dave Peachey
Post:
I've had more than a dozen of them in the last 24 hours. I've worked on the assumption that anything with a very long predicted duration (i.e. significantly longer than what I experience for 'normal' WUs) is likely to be faulty - notwithstanding any legitimate 'ultra-long' WUs which are in circulation.

On which basis, I bump them to the top of the queue by temporarily suspending any other reasonable-looking WUs just to clear them out as quickly as possible. It's a laborious, manual intervention which I do once every eight hours or so but, thus far, I've been proven correct; anything showing a very long predicted duration has errored out within a few minutes.

Whilst that's not been too much of a waste of processing time overall (just over an hour all told) I have found that, with several of these predicted long WUs in the queue at any one time, that clogs up my BOINC queue to the extent that I can't download other WUs for other projects because BOINC Manager thinks the WU queue is full (I keep a max. 0.75-day long queue for the sake of avoiding too many problems if something drastic happens).

On which basis, yes, it's annoying and the sooner they are all errored out (assuming they won't/can't be removed manually), the better.
14) Message boards : ATLAS application : Very long tasks in the queue (Message 29688)
Posted 28 Mar 2017 by Dave Peachey
Post:
In contrast to what David told us some time ago (Task ID for the "longrunners" = 10959636), they now seem to have Task ID 10995522.

Erich,

I would call the 10995522 tasks and their run times "normal" for me given they usually only run for 4hrs or so of total CPU time - which equates to around 1hr20m of elapsed time when running as a 4-core task and which is what I'm used to seeing on my machine. Although I too am now seeing BOINC Manager getting confused as to their anticipated length (as are you) which raises another question about what might be going on here.

As of this morning, however, all of my anticipated "long runners" with TaskID=11016767 have errored out over night (with a "Validate error" message) in times ranging from 10mins to 17mins. Indeed, I've also had several 10995522, 25 or 28 tasks do the same thing and/or not produce a HITS file.

It hasn't been a complete failure - I have had some of these run to completion and validation - but my error rate has increased alarmingly to eight of the last twenty WUs I've tried to process so I wonder whether there is an external factor involved here (given my machine has previouly been rock solid on these tasks).

I wonder if David C can shed any light on this?

Dave
15) Message boards : ATLAS application : Very long tasks in the queue (Message 29684)
Posted 27 Mar 2017 by Dave Peachey
Post:
Has anyone encountered, or had experiences with, any of the WUs with TaskID=11016767 ?
<snip>
I seem to have acquired four of them and the BOINC estimated completion time has hit the roof at 13hr28m.

So I bumped one of them to the front of the queue ... and it bombed after only 10mins run time with a crazy looking output log https://lhcathome.cern.ch/lhcathome/result.php?resultid=129069334.

Either somone was having fun when they coded it or else my machine has a nasty stutter and has severely mangled the content of that file! Anyway, back to the "regular" ones for now and I'll get around to the remaining long ones some time tomorrow.
16) Message boards : ATLAS application : Very long tasks in the queue (Message 29683)
Posted 27 Mar 2017 by Dave Peachey
Post:
Morning/afternoon/evening all,

Has anyone encountered, or had experiences with, any of the WUs with TaskID=11016767 ? Looking at the reworked version of the old Atlas@Home home page (http://lhcathome.web.cern.ch/projects/atlas - which is a very useful page!), these are designated as
Task mc16_13TeV DP2500_3000.simul (11016767) with 0/643 in progress - although I expect that's a tad out of date.

I seem to have acquired four of them and the BOINC estimated completion time has hit the roof at 13hr28m. My normal run time (running in 4-core multi-thread mode) is on the order of 1hr20m to 1hr30m for the tasks with ID 10995522/25/28 (which BOINC has now come to recognise along with the regular long running 10995515/17 tasks (clocking in at a between 4hr and 4hr30m). However, as these are new (at least to my BOINC installation), it could just be that BOINC doesn't know what to make of them (yet).

Does anyone know if these are a variation on the fabled 1000-event WUs (TaskID=10959636) or just seriously long runners of a different sort?

I suspect that if I knew how to correctly interpret the above task string I'd be able to answer my own question. I'd venture that "13TeV" is the energy in terra-electron volts but I don't understand the "DP2500_3000" part. Perhaps this is something on which David C could enlighten us in his "Information on ATLAS tasks" sticky thread?

Cheers
Dave
17) Message boards : ATLAS application : ATLAS out of beta (Message 29511)
Posted 21 Mar 2017 by Dave Peachey
Post:
Evening All,

I stayed away from the beta testing (I've done enough of that elsewhere) but my first two "production" ATLAS Simulation WUs seem to have gone off without a hitch with two more on the way ... it's looking OK so far.

I've used the same formula in the "app_config.xml" file as under the ATLAS@Home project albeit I found that the <name>, <app_name> and <plan_class> parameter names were slightly different <name> and <app_name> both "ATLAS" (instead of "ATLAS_MCORE") and <plan_class> is "vbox64_mt_mcore_atlas" (instead of "vbox_64_mt_mcore"). I also notice that the .vdi file is slightly bigger at 1.62GB (for the current "2017_03_01" image) instead of 1.35GB (for the old ATLAS@Home "1.04" multicore image).

I've let them run with six of my sixteen cores as before (so assigning 7300MB memory for a six core session) and the timing is within a few minutes either side of what I was doing previously (around 60 minutes) so I'm guessing I've had the "standard" 100 event WUs thus far.

Cheers
Dave



©2024 CERN