1) Message boards : Theory Application : Move TheoryN Back Into Theory. (Message 40480)
Posted 16 Nov 2019 by BITLab Argo
Post:
The app_config.xml solution didn't work for me. If anyone knows how to get the client to request that native version over the VBox version, please let me know.

The Web GUI has a per-locale "Run test applications" tickbox (that Atlas already requires for its native application); this allows opt-in/opt-out of the native app but will have side-effects on other projects/sub-projects. Could you instead prefer the native app at the server but let users opt out, in which case they'd get the Vbox app (if they have Vbox installed)?

The thing is that the volunteers hit by my third bullet point are those for whom native apps won't run, e.g. because they can't get CVMFS to work. That might suggest that they are less technically able, so expecting them to hand-edit some obscure XML file seems a bit unfriendly!
2) Message boards : Theory Application : Move TheoryN Back Into Theory. (Message 40452)
Posted 14 Nov 2019 by BITLab Argo
Post:
I think you are asking about converting Theory to use the Atlas model.

...When VBox is installed it oscillates between the VBox and native app.
My experience from Atlas:
* If both VBox and "native" (i.e. its dependencies) are installed and working, the client will oscillate for a while, eventually preferring whichever option gets it the most credits
* If VBox is missing but "native" is OK, the server won't issue Vbox tasks (which is fine) and the client will run native tasks (which is fine)
* If VBox is OK but the (Linux) host is missing the "native" dependencies, the client will fetch a mixture of tasks but the native ones will fail, wasting resources and not so fine. Worse (though this is based on experience with an older client version) the client seemed to then prioritise the failing native version instead of the successful Vbox one. So I think there needs to be a way for Linux users to opt out of the native app.
3) Message boards : Cafe LHC : From LHC (27km) to FCC (100km) (Message 37805)
Posted 22 Jan 2019 by BITLab Argo
Post:
Depressing that there's no mention of trying to make it a muon collider, but instead going back to electrons/positrons and later protons/antiprotons.
Given the fuss being made about the LHC failing to find anything beyond the LEP-validated Standard Model, I'd have thought smashing something else together - even in the present tunnel - would be a more useful next step.

"Just keep banging the rocks together, guys!"
4) Message boards : Number crunching : Memory requirements for LHC applications (Message 37361)
Posted 16 Nov 2018 by BITLab Argo
Post:
Using singularity probably won't have big effects on the used/required memory (just a guess).
I agree - the 4GB boxes I mentioned below should have been using it because of the OS - but I was rather pointing out that the list of tasks that the user will see could look very different, in case anyone else is having a look.
5) Message boards : Number crunching : Memory requirements for LHC applications (Message 37350)
Posted 15 Nov 2018 by BITLab Argo
Post:
(I've spotted a suitable machine to play with, but that's weeks rather than hours away)

Looking at the output from ps aux there was some obvious overhead (BOINC client, CVMFS, MemoryMonitor etc.), and then a set of nCPU athena.py doing the work with resources (CPU, memory) split evenly between them.
So my inclination would be to express the correlation as overhead + (memory * nCPU).

NB this is from (human) memory and for native Atlas native jobs (no Singularity). I can't recall what the Singularity ones looked like.
6) Message boards : Number crunching : Memory requirements for LHC applications (Message 37318)
Posted 12 Nov 2018 by BITLab Argo
Post:
I can't though see an easy way to figure the (native Atlas) formula out...
Thinking about it, this may be harder than I thought as part of what Condor does is to summarise the local resources available and then fetch tasks that match, so

and maybe I was just lucky to hit a bunch of undemanding tasks).
was actually the whole system doing the right thing by sending smaller tasks to low-spec machines. So then the formula will depend on the "smallest" sub-task that the project is willing to issue ... maybe an Atlas person could comment?
I can envisage a minimalist formula of say 300MB + (1024MB*nCores) to get some native jobs on low-spec machines, but then there might be a lot of idle time once the small sub-tasks run out.
7) Message boards : Number crunching : Memory requirements for LHC applications (Message 37307)
Posted 11 Nov 2018 by BITLab Argo
Post:
I think we're done. Can someone with the authority please take this last copy and post it as a pinned message?
I still think that the mentioned formula for native ATLAS tasks is wrong!
I agree - I've run 4-core native Atlas on (dedicated) 4 GB machines without significantly worse throughput than 8 GB boxes. I can't though see an easy way to figure the formula out...

Unfortunately I've left the institution, the BITlab project is gone, and the machines are no doubt heading ever-closer to the skip; otherwise I'd have tried running a 1-core Atlas on one of the 2GB boxes, just to see what happens.
(I will admit I was slightly surprised to be getting 4x Athena.py on the 4GB machines. But IIRC the "overhead" (BOINC client, CVMFS, other Atlas processes) was only few hundred MB, and maybe I was just lucky to hit a bunch of undemanding tasks).
8) Message boards : LHCb Application : 207 (0x000000CF) EXIT_NO_SUB_TASKS (Message 37286)
Posted 8 Nov 2018 by BITLab Argo
Post:
...the interesting thing was that a single-core LHCb task would show a sustained loadavg of about 0.3...
Which loadavg do you refer to?
1 min, 5 min, 15 min?
All three - they weren't identical, but all around an intermediate value neither idle nor running flat out (and I looked quite a few times). This was for single-core tasks, that were the only thing running on a two-core machine.

This is different to CPU-time/wallclock.
I know these are measurements of different things, and the latter's poor efficiency values have caused lots of complaints on these boards. I'm puzzled because I would expect the obvious causes (e.g. slow payload downloads, limited availability of sub-tasks) would cause the load to flip between 0 and ~100% values. 10 minutes idle followed by 5 minutes flat out would give ~.3 on the 15min number, but it should be an exception to see it on the 1 min number at the same time.

Hence my mentioning it: it's like - for me - the poor efficiency is because the code never runs flat out for some reason (have they requested a 1GHz Virtual CPU on a 3GHz machine? :-) ).

LHCb's scientific app appears in the process list as a python script and should run at 100 % (or close to).
I can't remember seeing that - I thought it was all wrapped up inside the VirtualBox process. Watchdogs, Condor et al. shouldn't be using significant CPU for any length of time.

Be aware that the 1st calculation phase of this script runs only for roughly 1 min.
I often noticed jobs that dropped to 0 % after that phase for 1-1.5h until the jobs were cancelled and the VM requested the next one.

My understanding - which may very well be wrong and need updating! - is that the VBox job workflow goes something like:

  • BOINC client wrapper starts the VM
  • something in the VM, I believe now Condor on all expts, requests from the experiment to be allocated job(s)/sub-tasks(s)
  • the experiment's data payload for each is pulled down within the VM
  • the experiment's software - on CVMFS mounted within the VM - is pulled down (for the first job) and run over the payload
  • the results file is uploaded
  • Condor requests another job/sub-task

until the VM has been running for 10 hrs, after which no new jobs/sub-tasks are requested and the VM stops when the final running one is finished (or is killed after 18hrs by the wrapper).

Thus there is anyway a lot of potential CPU-idle time during the network transfers. Naively, it looks in your example like the sub-task starts but fails to access something, either the payload or the software (CVMFS), until Condor loses patience and gives up on it? While mine seem to start but then crawl rather than run.

9) Message boards : LHCb Application : 207 (0x000000CF) EXIT_NO_SUB_TASKS (Message 37284)
Posted 8 Nov 2018 by BITLab Argo
Post:
An if the tasks get jobs, overall CPU usage usually is max. about 5-10 % out of the total runtime of the task.
Something has been very wrong with the LHCb tasks for quite some time, but no one at LHC seems to care :-(


I was seeing slightly better, but the interesting thing was that a single-core LHCb task would show a sustained loadavg of about 0.3 which was broadly in line with the reported task efficiency (CPU-time/wallclock-run-time); that seems too little CPU for actual number-crunching, but way too much for a VM idling or transferring data.
10) Message boards : LHCb Application : 207 (0x000000CF) EXIT_NO_SUB_TASKS (Message 37281)
Posted 8 Nov 2018 by BITLab Argo
Post:
Sorry, I got the lingo wrong. It should have been:

Runtime of last 100 tasks in hours: average, min, max
LHCb Simulation 2.05 (0.29 - 18.12)
... which suggests that a few people must be getting jobs/sub-tasks, else all LHCb tasks would fail after 20 minutes (207 (0x000000CF) EXIT_NO_SUB_TASKS)?

And today the page reports
Runtime of last 100 tasks in hours: average, min, max
LHCb Simulation 4.96 (0.37 - 18.12)
so the server sees more tasks running longer.
11) Message boards : LHCb Application : 207 (0x000000CF) EXIT_NO_SUB_TASKS (Message 37269)
Posted 7 Nov 2018 by BITLab Argo
Post:
Surely,if they run out of work because we've crunched it all, that's a good thing. Would be nice if someone said something though!

What puzzles me is that the "Server status" page has
Runtime of last 100 tasks in hours: average, min, max
LHCb Simulation 2.05 (0.29 - 18.12)
which suggests that a few people must be getting tasks, else surely all the values would be around 0.3h (20min.) by now?
12) Message boards : Number crunching : Memory requirements for LHC applications (Message 37226)
Posted 4 Nov 2018 by BITLab Argo
Post:
Corrections are incorporated.


I think your LHCb formula is wrong - it doesn't match what I posted in the other thread
https://lhcathome.cern.ch/lhcathome/forum_thread.php?id=4867&postid=37168, and the latter matches what I see running.

Also, your list is for VirtualBox jobs, not multi-threaded - multi-threaded native Atlas seems to have more frugal requirements.
13) Message boards : Number crunching : Local control of which subprojects run` (Message 37206)
Posted 3 Nov 2018 by BITLab Argo
Post:
I disagree:

Most helpful would be....


... not having to hand-hack XML to make things work, in the first place.
14) Message boards : Number crunching : Local control of which subprojects run` (Message 37168)
Posted 2 Nov 2018 by BITLab Argo
Post:
The memory requirements formulae are in a few posts ... Happy to pin it if anyone finds a definitive formula


I found the following for Virtualbox tasks fairly recently; apologies for not noting where:

Atlas: 3GB + (0.9GB * ncores.)
3993 for 1
4915 for 2
6759 for 4
10545 for 8

LHCb: memory for 1 thread is 2048 MB, add 1300 MB for each additional thread
3348 for 2
5948 for 4
11148 for 8

Theory: memory for 1 thread is 730 MB, add 100 MB for each additional thread
830 for 2
1030 for 4
1430 for 8


(the numbers show the memory requirement for multiple threads within one VM; the LHCb and Theory figures match what I see boinc_client pass to the vboxwrapper, so there will be a little bit more overhead needed)

I don't know what the requirements for Sixtrack are, certainly <= 512MB per task.

I don't know what the requirements for native Atlas are; I have successfully (HITS files produced!) run 4-core native Atlas on 4-core 4 GB machines without drama. I did increase BOINC's allowed memory usage from the default 50% to 85%.
15) Message boards : Number crunching : Local control of which subprojects run` (Message 37157)
Posted 1 Nov 2018 by BITLab Argo
Post:
The various subprojects of LHC have really different characteristics, and the only existing way to control which one are run is at the account level
I'm wondering if this control is possible at the local machine level.


Why do you feel the "Location" feature is insufficient/inappropriate for you?

The only issue I've had with locations is that the website splits their configuration across two separate pages on different menus, which is a nuisance when making sure everything's consistent at setup.
After that it worked fine and it's trivial to flip a machine between subprojects when needed.
16) Message boards : ATLAS application : ATLAS native - Configure CVMFS to work with openhtc.io (Message 37031)
Posted 15 Oct 2018 by BITLab Argo
Post:
I use
singularity exec -B /cvmfs /cvmfs/atlas.cern.ch/repo/images/singularity/x86_64-slc6.img hostname

to test singularity (this also depends on CVMFS working, of course).

I install singularity from RPMs and haven't seen your issue; I would guess that you have installed singularity somewhere non-standard and so need to restart the boinc client and make sure it's picking the path up correctly.

Else if you've a deeper problem then please start a new thread so others with the same problem can find it.


I'm trying to run Atlas native application however I can see the following error:
This is not SLC6, need to run with Singularity....
Checking Singularity...
sh: 1: singularity: not found
Singularity is not installed, aborting

Seem the system can't find singularity.But I've already installed singularity: ...
17) Message boards : LHCb Application : New version v1.05 (Message 36983)
Posted 9 Oct 2018 by BITLab Argo
Post:
the problem still persists; I started a LHCb task 3 hours ago, and the properties show a CPU time of 20 minutes :-(
Would be great if someone looked into this.


My (single-core) LHCb jobs also have very poor efficiencies: 10 - 50%.

Random job's log snippet:
2018-10-09 01:39:15 (7650): Status Report: Job Duration: '64800.000000'
2018-10-09 01:39:15 (7650): Status Report: Elapsed Time: '6000.000000'
2018-10-09 01:39:15 (7650): Status Report: CPU Time: '954.280000'

It does look like the pilots are getting work from Condor, but I haven't poked around enough to work out if the poor efficiency comes from waiting on subsequent downloads/uploads.
18) Message boards : CMS Application : no new WUs available (Message 36825)
Posted 22 Sep 2018 by BITLab Argo
Post:
on another PC, again, same problem:

2018-09-22 22:01:41 (3740): VM Completion Message: Condor exited after 44402s without running a job

stderr in total: https://lhcathome.cern.ch/lhcathome/result.php?resultid=207056703


But that log suggests that it actually did run 8 jobs, one of which used nearly 10hrs of CPU time!
Looks like the Condor client is getting confused somehow.

Are these tasks not checked before they are released to the volunteers?


To be fair, I've not seen any announcement from CMS suggesting the mass of us re-attach...
19) Message boards : Theory Application : New version 263.80 (Message 36803)
Posted 21 Sep 2018 by BITLab Argo
Post:
No idea why, but you're lucky: mine seem to have dropped by a factor of 100!

... and then this morning the credit rates have come back up again, but only by a factor of ten...
20) Message boards : Theory Application : New version 263.80 (Message 36800)
Posted 21 Sep 2018 by BITLab Argo
Post:
No idea why, but you're lucky: mine seem to have dropped by a factor of 100!
See e.g. hostid=10414406


Next 20


©2024 CERN