1)
Message boards :
Theory Application :
Move TheoryN Back Into Theory.
(Message 40480)
Posted 16 Nov 2019 by BITLab Argo Post: The app_config.xml solution didn't work for me. If anyone knows how to get the client to request that native version over the VBox version, please let me know. The Web GUI has a per-locale "Run test applications" tickbox (that Atlas already requires for its native application); this allows opt-in/opt-out of the native app but will have side-effects on other projects/sub-projects. Could you instead prefer the native app at the server but let users opt out, in which case they'd get the Vbox app (if they have Vbox installed)? The thing is that the volunteers hit by my third bullet point are those for whom native apps won't run, e.g. because they can't get CVMFS to work. That might suggest that they are less technically able, so expecting them to hand-edit some obscure XML file seems a bit unfriendly! |
2)
Message boards :
Theory Application :
Move TheoryN Back Into Theory.
(Message 40452)
Posted 14 Nov 2019 by BITLab Argo Post: I think you are asking about converting Theory to use the Atlas model. ...When VBox is installed it oscillates between the VBox and native app.My experience from Atlas: * If both VBox and "native" (i.e. its dependencies) are installed and working, the client will oscillate for a while, eventually preferring whichever option gets it the most credits * If VBox is missing but "native" is OK, the server won't issue Vbox tasks (which is fine) and the client will run native tasks (which is fine) * If VBox is OK but the (Linux) host is missing the "native" dependencies, the client will fetch a mixture of tasks but the native ones will fail, wasting resources and not so fine. Worse (though this is based on experience with an older client version) the client seemed to then prioritise the failing native version instead of the successful Vbox one. So I think there needs to be a way for Linux users to opt out of the native app. |
3)
Message boards :
Cafe LHC :
From LHC (27km) to FCC (100km)
(Message 37805)
Posted 22 Jan 2019 by BITLab Argo Post: Depressing that there's no mention of trying to make it a muon collider, but instead going back to electrons/positrons and later protons/antiprotons. Given the fuss being made about the LHC failing to find anything beyond the LEP-validated Standard Model, I'd have thought smashing something else together - even in the present tunnel - would be a more useful next step. "Just keep banging the rocks together, guys!" |
4)
Message boards :
Number crunching :
Memory requirements for LHC applications
(Message 37361)
Posted 16 Nov 2018 by BITLab Argo Post: Using singularity probably won't have big effects on the used/required memory (just a guess).I agree - the 4GB boxes I mentioned below should have been using it because of the OS - but I was rather pointing out that the list of tasks that the user will see could look very different, in case anyone else is having a look. |
5)
Message boards :
Number crunching :
Memory requirements for LHC applications
(Message 37350)
Posted 15 Nov 2018 by BITLab Argo Post: (I've spotted a suitable machine to play with, but that's weeks rather than hours away) Looking at the output from ps aux there was some obvious overhead (BOINC client, CVMFS, MemoryMonitor etc.), and then a set of nCPU athena.py doing the work with resources (CPU, memory) split evenly between them. So my inclination would be to express the correlation as overhead + (memory * nCPU). NB this is from (human) memory and for native Atlas native jobs (no Singularity). I can't recall what the Singularity ones looked like. |
6)
Message boards :
Number crunching :
Memory requirements for LHC applications
(Message 37318)
Posted 12 Nov 2018 by BITLab Argo Post: I can't though see an easy way to figure the (native Atlas) formula out...Thinking about it, this may be harder than I thought as part of what Condor does is to summarise the local resources available and then fetch tasks that match, so and maybe I was just lucky to hit a bunch of undemanding tasks).was actually the whole system doing the right thing by sending smaller tasks to low-spec machines. So then the formula will depend on the "smallest" sub-task that the project is willing to issue ... maybe an Atlas person could comment? I can envisage a minimalist formula of say 300MB + (1024MB*nCores) to get some native jobs on low-spec machines, but then there might be a lot of idle time once the small sub-tasks run out. |
7)
Message boards :
Number crunching :
Memory requirements for LHC applications
(Message 37307)
Posted 11 Nov 2018 by BITLab Argo Post: I agree - I've run 4-core native Atlas on (dedicated) 4 GB machines without significantly worse throughput than 8 GB boxes. I can't though see an easy way to figure the formula out...I think we're done. Can someone with the authority please take this last copy and post it as a pinned message?I still think that the mentioned formula for native ATLAS tasks is wrong! Unfortunately I've left the institution, the BITlab project is gone, and the machines are no doubt heading ever-closer to the skip; otherwise I'd have tried running a 1-core Atlas on one of the 2GB boxes, just to see what happens. (I will admit I was slightly surprised to be getting 4x Athena.py on the 4GB machines. But IIRC the "overhead" (BOINC client, CVMFS, other Atlas processes) was only few hundred MB, and maybe I was just lucky to hit a bunch of undemanding tasks). |
8)
Message boards :
LHCb Application :
207 (0x000000CF) EXIT_NO_SUB_TASKS
(Message 37286)
Posted 8 Nov 2018 by BITLab Argo Post: All three - they weren't identical, but all around an intermediate value neither idle nor running flat out (and I looked quite a few times). This was for single-core tasks, that were the only thing running on a two-core machine....the interesting thing was that a single-core LHCb task would show a sustained loadavg of about 0.3...Which loadavg do you refer to? This is different to CPU-time/wallclock.I know these are measurements of different things, and the latter's poor efficiency values have caused lots of complaints on these boards. I'm puzzled because I would expect the obvious causes (e.g. slow payload downloads, limited availability of sub-tasks) would cause the load to flip between 0 and ~100% values. 10 minutes idle followed by 5 minutes flat out would give ~.3 on the 15min number, but it should be an exception to see it on the 1 min number at the same time. Hence my mentioning it: it's like - for me - the poor efficiency is because the code never runs flat out for some reason (have they requested a 1GHz Virtual CPU on a 3GHz machine? :-) ). LHCb's scientific app appears in the process list as a python script and should run at 100 % (or close to).I can't remember seeing that - I thought it was all wrapped up inside the VirtualBox process. Watchdogs, Condor et al. shouldn't be using significant CPU for any length of time. Be aware that the 1st calculation phase of this script runs only for roughly 1 min. My understanding - which may very well be wrong and need updating! - is that the VBox job workflow goes something like:
until the VM has been running for 10 hrs, after which no new jobs/sub-tasks are requested and the VM stops when the final running one is finished (or is killed after 18hrs by the wrapper). |
9)
Message boards :
LHCb Application :
207 (0x000000CF) EXIT_NO_SUB_TASKS
(Message 37284)
Posted 8 Nov 2018 by BITLab Argo Post: An if the tasks get jobs, overall CPU usage usually is max. about 5-10 % out of the total runtime of the task. I was seeing slightly better, but the interesting thing was that a single-core LHCb task would show a sustained loadavg of about 0.3 which was broadly in line with the reported task efficiency (CPU-time/wallclock-run-time); that seems too little CPU for actual number-crunching, but way too much for a VM idling or transferring data. |
10)
Message boards :
LHCb Application :
207 (0x000000CF) EXIT_NO_SUB_TASKS
(Message 37281)
Posted 8 Nov 2018 by BITLab Argo Post: Sorry, I got the lingo wrong. It should have been: Runtime of last 100 tasks in hours: average, min, max... which suggests that a few people must be getting jobs/sub-tasks, else all LHCb tasks would fail after 20 minutes (207 (0x000000CF) EXIT_NO_SUB_TASKS)? And today the page reports Runtime of last 100 tasks in hours: average, min, maxso the server sees more tasks running longer. |
11)
Message boards :
LHCb Application :
207 (0x000000CF) EXIT_NO_SUB_TASKS
(Message 37269)
Posted 7 Nov 2018 by BITLab Argo Post: Surely,if they run out of work because we've crunched it all, that's a good thing. Would be nice if someone said something though! What puzzles me is that the "Server status" page has Runtime of last 100 tasks in hours: average, min, maxwhich suggests that a few people must be getting tasks, else surely all the values would be around 0.3h (20min.) by now? |
12)
Message boards :
Number crunching :
Memory requirements for LHC applications
(Message 37226)
Posted 4 Nov 2018 by BITLab Argo Post: Corrections are incorporated. I think your LHCb formula is wrong - it doesn't match what I posted in the other thread https://lhcathome.cern.ch/lhcathome/forum_thread.php?id=4867&postid=37168, and the latter matches what I see running. Also, your list is for VirtualBox jobs, not multi-threaded - multi-threaded native Atlas seems to have more frugal requirements. |
13)
Message boards :
Number crunching :
Local control of which subprojects run`
(Message 37206)
Posted 3 Nov 2018 by BITLab Argo Post: I disagree: Most helpful would be.... ... not having to hand-hack XML to make things work, in the first place. |
14)
Message boards :
Number crunching :
Local control of which subprojects run`
(Message 37168)
Posted 2 Nov 2018 by BITLab Argo Post: The memory requirements formulae are in a few posts ... Happy to pin it if anyone finds a definitive formula I found the following for Virtualbox tasks fairly recently; apologies for not noting where: Atlas: 3GB + (0.9GB * ncores.) (the numbers show the memory requirement for multiple threads within one VM; the LHCb and Theory figures match what I see boinc_client pass to the vboxwrapper, so there will be a little bit more overhead needed) I don't know what the requirements for Sixtrack are, certainly <= 512MB per task. I don't know what the requirements for native Atlas are; I have successfully (HITS files produced!) run 4-core native Atlas on 4-core 4 GB machines without drama. I did increase BOINC's allowed memory usage from the default 50% to 85%. |
15)
Message boards :
Number crunching :
Local control of which subprojects run`
(Message 37157)
Posted 1 Nov 2018 by BITLab Argo Post: The various subprojects of LHC have really different characteristics, and the only existing way to control which one are run is at the account level Why do you feel the "Location" feature is insufficient/inappropriate for you? The only issue I've had with locations is that the website splits their configuration across two separate pages on different menus, which is a nuisance when making sure everything's consistent at setup. After that it worked fine and it's trivial to flip a machine between subprojects when needed. |
16)
Message boards :
ATLAS application :
ATLAS native - Configure CVMFS to work with openhtc.io
(Message 37031)
Posted 15 Oct 2018 by BITLab Argo Post: I use singularity exec -B /cvmfs /cvmfs/atlas.cern.ch/repo/images/singularity/x86_64-slc6.img hostname to test singularity (this also depends on CVMFS working, of course). I install singularity from RPMs and haven't seen your issue; I would guess that you have installed singularity somewhere non-standard and so need to restart the boinc client and make sure it's picking the path up correctly. Else if you've a deeper problem then please start a new thread so others with the same problem can find it. I'm trying to run Atlas native application however I can see the following error: |
17)
Message boards :
LHCb Application :
New version v1.05
(Message 36983)
Posted 9 Oct 2018 by BITLab Argo Post: the problem still persists; I started a LHCb task 3 hours ago, and the properties show a CPU time of 20 minutes :-( My (single-core) LHCb jobs also have very poor efficiencies: 10 - 50%. Random job's log snippet: 2018-10-09 01:39:15 (7650): Status Report: Job Duration: '64800.000000' 2018-10-09 01:39:15 (7650): Status Report: Elapsed Time: '6000.000000' 2018-10-09 01:39:15 (7650): Status Report: CPU Time: '954.280000' It does look like the pilots are getting work from Condor, but I haven't poked around enough to work out if the poor efficiency comes from waiting on subsequent downloads/uploads. |
18)
Message boards :
CMS Application :
no new WUs available
(Message 36825)
Posted 22 Sep 2018 by BITLab Argo Post: on another PC, again, same problem: But that log suggests that it actually did run 8 jobs, one of which used nearly 10hrs of CPU time! Looks like the Condor client is getting confused somehow. Are these tasks not checked before they are released to the volunteers? To be fair, I've not seen any announcement from CMS suggesting the mass of us re-attach... |
19)
Message boards :
Theory Application :
New version 263.80
(Message 36803)
Posted 21 Sep 2018 by BITLab Argo Post: No idea why, but you're lucky: mine seem to have dropped by a factor of 100! ... and then this morning the credit rates have come back up again, but only by a factor of ten... |
20)
Message boards :
Theory Application :
New version 263.80
(Message 36800)
Posted 21 Sep 2018 by BITLab Argo Post: No idea why, but you're lucky: mine seem to have dropped by a factor of 100! See e.g. hostid=10414406 |
©2024 CERN