1) Message boards : ATLAS application : Use existing Tier-2 cvmfs squid for boinc ATLAS@home hosts (Message 44333)
Posted 17 Feb 2021 by Henry Nebrensky
Post:
A separate squid instance on each worker box will reduce efficiency and increase maintenance effort.

As the OP claims to be close to a WLCG Tier2 site I figured network bandwidth wouldn't be an issue, and if there's enough machines that maintenance issues are significant then you want to be using a configuration management system anyway in which case I found keeping machines identical made things simpler! And on a corporate/enterprise network, a separate squid is likely to result in questions from network admins about what it's doing there, who can connect through it, and to where...

For CVMFS the biggest single gain is in maintaining the cache across successive VM initialisations; the local squid route gets that without affecting any other activity on the machine, and without exposing new services on the network.
If I was in the OP's situation - and I've been close - I'd start there. Once the OP has some experience and a track record with their networking team they can always take another step, whether that's the standalone BOINC squid or fudgerating the local squids to use the Tier2 one or whatever else. (It'll also depend on what else the machine is used for - the OP mentioned a separate batch system in another post.)
2) Message boards : ATLAS application : Use existing Tier-2 cvmfs squid for boinc ATLAS@home hosts (Message 44328)
Posted 17 Feb 2021 by Henry Nebrensky
Post:
We do already have a cvmfs squid for our local WLCG Tier2 cluster and the obvious plan is to make our boinc hosts connect to it.
Does anybody have a working configuration for this scenario and could share it?

Years ago, for Atlas-native I simply used the same basic configuration of CVMFS on the boinc nodes as on the Grid cluster nodes.
As you're going with VirtualBox I have no idea how it will work out: defining a proxy in BOINC will route all web traffic through it, including the VM's Frontier traffic as well as BOINC job requests. Also, AIUI the Atlas VMs here have the CVMFS preconfigured to use the Cloudflare stratum one server instead of the CERN ones in which case I don't think you'll get the expected efficiency either, as the squid would end up having to hold two copies of each file. My inclination would be to just run a BOINC-specific squid on each machine to hold the local CVMFS cache across successive VM instances.
3) Message boards : Theory Application : WUs stuck at 100% (Message 44055)
Posted 3 Jan 2021 by Henry Nebrensky
Post:
When my WUs reach 100% ...
Don't take BOINC's "percentages" too seriously. For native Theory you can see what the project code is up to with something like:
tail -n 20 /var/lib/boinc/slots/?/cernvm/shared/runRivet.log 
Most Theory apps will report progress as number of events processed, which you can compare with the total number of events requested (second to last number on the very first line).
As long as there's a steady rate of progress the task is fine. If the log file is stale (timestamp more than two hours ago) then I (personally) assume the code is stuck in an infinite loop, and abort the task.

(Unfortunately my computers have all had to be moved to the great datacentre in the sky, so I've written this down where I might stand a chance of finding it if I ever get to pick this up again!)
4) Message boards : Number crunching : Cupboard is bare? (Message 44033)
Posted 31 Dec 2020 by Henry Nebrensky
Post:
Atlas is back, thank you.
Embarrassingly that caught me by surprise as I wasn't expecting things to be fixed until next week - I'm having to shut my cluster down tonight as it must relocate to the great data-centre in the sky :(, and having spent ages aborting CMS tasks so the work finishes at a sensible time a load of Atlas tasks suddenly appears!
5) Message boards : ATLAS application : Change in Credit? (Message 43807)
Posted 9 Dec 2020 by Henry Nebrensky
Post:
Yes, this is when a new Application-Version is installed.
But this was in -native AND in Windows.
Now it seem back to normal for both.
Having drafted this off-line: mine went down:

290583495 17,762.46 68,998.94 433.84 ATLAS Simulation v2.84 (native_mt) x86_64-pc-linux-gnu
290449999 17,692.59 69,222.89 473.77 ATLAS Simulation v2.83 (native_mt) x86_64-pc-linux-gnu
290381406 17,758.57 68,993.30 1,027.94 ATLAS Simulation v2.82 (native_mt) x86_64-pc-linux-gnu

There again, on the same machine I get
289610329 21,778.25 21,778.25 180.94 Theory Simulation v300.06 (native_theory) x86_64-pc-linux-gnu
290203308 32,569.85 32,259.48 298.90 SixTrack v502.05 (avx) x86_64-pc-linux-gnu
so the new rate of Atlas credits is much closer to that for Theory and SixTrack.

And having given it time to untangle itself, at last I got:
290642977 18,283.33 70,903.93 806.63 ATLAS Simulation v2.84 (native_mt) x86_64-pc-linux-gnu
290625445 17,840.59 69,256.80 315.02 ATLAS Simulation v2.84 (native_mt) x86_64-pc-linux-gnu
6) Message boards : ATLAS application : Confused (Message 43787)
Posted 4 Dec 2020 by Henry Nebrensky
Post:
The scheduler has really problems to balance with Multi-Core-WUs; if you like to run these, it may be neccessary to help the scheduler.

I don't have too many problems with mixing multi-core Atlas with Theory or SixTrack on 10617965. Keeping a steady stream of jobs on hand helps. Making changes such as to #Cores will upset the client - I just grit my teeth and wait a couple of days while it sorts itself out.
I find that trying to micro-manage BOINC makes things worse, not better. :(
7) Message boards : Theory Application : Pythia8 looooooong runner! (Message 43776)
Posted 4 Dec 2020 by Henry Nebrensky
Post:
\pythia8 8.301 dire-default is one you should probably abort since they always run for 10 days and fail (all the ones I have checked)
Mine's been updating the log file with believable, if slow, progress...

10:25:41 GMT +00:00 2020-11-29: cranky-0.0.32: [INFO] ===> [runRivet] Sun Nov 29 10:25:39 UTC 2020 [boinc pp jets 7000 150,-,1860 - pythia8 8.301 dire-default 59000 150]
17:34:27 GMT +00:00 2020-12-03: cranky-0.0.32: [INFO] Container 'runc' finished with status code 0.

Run time 4 days 7 hours 8 min 53 sec
CPU time 4 days 6 hours 37 min 23 sec
Credit 3,085.19
Peak working set size 293.17 MB
Peak swap size 600.68 MB
Peak disk usage 1.86 MB
8) Message boards : Theory Application : Pythia8 looooooong runner! (Message 43775)
Posted 3 Dec 2020 by Henry Nebrensky
Post:
\pythia8 8.301 dire-default is one you should probably abort since they always run for 10 days and fail (all the ones I have checked)
Mine's been updating the log file with believable, if slow, progress:
58100 events processed
so I've let it run. Guess we find out this evening...

"dire" does indeed seem to be code for troublesome, though.
9) Message boards : Theory Application : Pythia8 looooooong runner! (Message 43763)
Posted 1 Dec 2020 by Henry Nebrensky
Post:
I spotted 289594409 as it was taking so long:

===> [runRivet] Sat Nov 28 23:15:38 UTC 2020 [boinc PbPb heavyion-mb 2760 - - pythia8 8.230 default 90000 150]

Run time 1 days 23 hours 53 min 55 sec
CPU time 1 days 23 hours 50 min 55 sec
Peak working set size 186.35 MB

At least this lead-lead task might actually have succeeded:
Container 'runc' finished with status code 0.


Meanwhile, elsewhere:
2688018 boinc     39  19   53544  20764   7116 R  96.3   0.3   3561:32 pythia8.exe

has reached "34300 events processed" after more than two days... luckily it's only going for 59k events.
===> [runRivet] Sun Nov 29 10:25:39 UTC 2020 [boinc pp jets 7000 150,-,1860 - pythia8 8.301 dire-default 59000 150]
10) Message boards : ATLAS application : How is Work-Distribution calculated ? (Message 43730)
Posted 28 Nov 2020 by Henry Nebrensky
Post:
Hi,
I would rather not remove the limits completely since many hosts will end up with tasks they will not be able to process before the deadline.
I'm not sure I understand this: a "maximum" is a limit, not a requirement that so many tasks be downloaded. We've seen the same issue with CMS.
There's already a mechanism in BOINC for requesting how much work to download: the local cache length ("Store at least" and "Store up to an additional" ... days of work).. This works with Sixtrack, and both CMS and Atlas have steady, repeatably-sized tasks compatible with this approach. Respecting this user configuration setting would let the project raise the Max # jobs without having to worry about the side-effect on smaller machines.
11) Message boards : CMS Application : Please check your task times and your IPv6 connectivity (Message 43639)
Posted 17 Nov 2020 by Henry Nebrensky
Post:
So it seems Max tasks is the critical figure. We could ask Laurence/Nils to add more options to the table if you want >8 but less than infinity.
The flood does seem to stop eventually - I think it's at about (60 * max_concurrent) - but in that situation I've always been distracted by "how do loops work in bash so as to abort most of these jobs" from analysing any significance of the number.
The point is it's many, many more than would be expected from the cache settings in the web preferences, which I think spooks people. If I specify I want to keep two days' work in hand, why is BOINC giving me a month's worth?
12) Message boards : CMS Application : Please check your task times and your IPv6 connectivity (Message 43587)
Posted 8 Nov 2020 by Henry Nebrensky
Post:
My 4-core box will readily get swamped by 100+ CMS tasks, to the extent that I will only tick the CMS box when down to the last few CMS tasks if I'm around to watch what happens and untick it in a hurry! This happens even when CMS is running fine - no correlation with job failures. Set to 0.5 days + 1.5 days. "accept work from other applications?" unticked. It might be trying to fill the local queue wrt the CMS deadline, but since letting it do this can force the other sub-projects off the machine I've never left it get that far.

My pet theory is that this is some side-effect of how Atlas and Theory artificially limit their task numbers based on core numbers (such that I can't get 2-day's-worth work between them). That, and the usual chaos one gets from trying to micro-manage the BOINC client (see above) rather than just let it get on with it.

p.s. This discussion should probably be in this thread?
13) Message boards : Number crunching : Atlas/Theory native in prefs - Run native if available? (Message 43483)
Posted 8 Oct 2020 by Henry Nebrensky
Post:
I agree.
The server should check if the client is running Windows or Linux.
Couldn't this also be solved simply by making native tasks require Linux as the OS? Then there shouldn't be native tasks "available" to a Windows client in the first place...
14) Message boards : Theory Application : Slot-Number are growing (Message 43395)
Posted 23 Sep 2020 by Henry Nebrensky
Post:
Type
boinccmd --get_file_transfers
at the command line, while in a directory with a copy of the client's gui_rpc_auth.cfg file in it.
first a empty line and then
============== File Transfers ============
Doesn't look like there are any stuck transfers, so must be something else wrong.

This Version is included in CentOS77.
I wasn't aware they'd added it to CentOS itself, but then I haven't built a CentOS 7 machine from scratch in a while. I use the current version from EPEL (7.16.6-3):

~ > rpm -qi boinc-client
Name        : boinc-client
Version     : 7.16.6
Release     : 3.el7
Source RPM  : boinc-client-7.16.6-3.el7.src.rpm
Packager    : Fedora Project
15) Message boards : Theory Application : Slot-Number are growing (Message 43381)
Posted 20 Sep 2020 by Henry Nebrensky
Post:
(boinccmd --get_file_transfers) how is it possible to use?
Type
boinccmd --get_file_transfers
at the command line, while in a directory with a copy of the client's gui_rpc_auth.cfg file in it.
(Message options from Boinc?)
I wouldn't know - I don't use the GUI stuff.
16) Message boards : Theory Application : Theory Sherpa (Message 43376)
Posted 20 Sep 2020 by Henry Nebrensky
Post:
sherpa 2.2.4 default
pp winclusive 7000 20 - 0+20/20 - Lean back and enjoy since 24 hours?!?
I'd give up on it.
17) Message boards : Theory Application : Slot-Number are growing (Message 43375)
Posted 20 Sep 2020 by Henry Nebrensky
Post:
I checked my machine yesterday and today and don't see this (CentOS7 kernel 3.10.0-1127.19.1.el7.x86_64 running CMS and Theory Native via BOINC 7.16.6 from EPEL running as a daemon).

File output.tgz (Theory) is not deleted in slot-folder.
Are all the upload transfers succeeding (boinccmd --get_file_transfers)?
18) Message boards : Theory Application : MadGraph5 (Message 43273)
Posted 24 Aug 2020 by Henry Nebrensky
Post:
True - there's a sampling feature in that I only check in rarely and follow up on tasks that look to be misbehaving. I also wonder if madgraph behaves better within a VM where it can't see any other cores.
19) Message boards : Theory Application : MadGraph5 (Message 43271)
Posted 24 Aug 2020 by Henry Nebrensky
Post:
It also has significant stretches of not actually using CPU at all.
e.g I recently killed task 281349801 precisely because it was holding two cores but idle - it's reported as using just 50 mins in 20 hours :(

We did have a thread (which maeax has kindly tracked down) about it some months back.
This does remind me that I was going to complain there that even hard-wiring the coreness to two isn't really good enough - it should either be one, or else the WUs submitted to BOINC with a consistent #cores requirement.
20) Message boards : Theory Application : MadGraph5 (Message 43270)
Posted 24 Aug 2020 by Henry Nebrensky
Post:
Have found this thread you wrote - Extreme overload:
https://lhcathome.cern.ch/lhcathome/forum_thread.php?id=5323#41736
Have native Linux with ONE Cpu, but in the log is a entry to use two cpu's (set nb_core 2)
How can this second Cpu being used?

Same way as it used all 232 cores on computezrmle's machine! :(
It'll just chuck processes at the OS and see what happens - isn't there a rivetvm.exe as well, or is that idle while madgraph does its multiprocessing thing?

The running: is 2. Now 548 Completed and Idle: 50 (seem 600 is the max.)

Looking back at that thread you might want to do a
grep subprocess /var/lib/boinc/slots/?/cernvm/shared/runRivet.log
to check that 600 is indeed the correct number (edit: just in case "idle" doesn't mean what I think it does).


Next 20


©2021 CERN