Message boards : Theory Application : Compute Error with Native Theory jobs when Native Atlas and Sixtrack jobs work fine,
Message board moderation

To post messages, you must log in.

AuthorMessage
alee67

Send message
Joined: 2 Jan 19
Posts: 4
Credit: 1,661,241
RAC: 89
Message 44093 - Posted: 13 Jan 2021, 4:50:42 UTC

I have two Google Cloud instances (they were giving me lots of free credits for publishing a couple of somewhat lame Google Assistant apps) running Ubuntu 20.04 set up to run Native Theory and Atlas jobs, including suspend/resume following the instructions here: https://lhcathome.cern.ch/lhcathome/forum_thread.php?id=4971

The Native Atlas, as well as Sixtrack, jobs run fine, but the Native Theory jobs were all exiting quickly. The ones I've looked at all seem to have the same problem. This is an example: https://lhcathome.cern.ch/lhcathome/forum_thread.php?id=4971

What seems to be happening is that when BOINC starts, /sbin/create-boinc-cgroup creates boinc subdirectories, and files within these subdirectories, in several subdirectories of /sys/fs/cgroup , but this fails for /sys/fs/cgroup/blkio/boinc . Delaying the creation of /sys/fs/cgroup/blkio/boinc and the files within it seems to fix the problem. I'd guess that it needs to be delayed until after something else is completed, but I don't know what, or how to do that, so I just used a simple time delay. One way is to insert sleep commands into /sbin/create-boinc-cgroup , but I ended up creating the file /etc/systemd/system/boinc-client.timer :

[Unit]
Description=Berkeley Open Infrastructure Network Computing Client
Documentation=man:boinc(1)
After=network-online.target

[Timer]
OnBootSec=5min

[Install]
WantedBy=timers.target


I had found that a 1 minute delay didn't work, and a 2 minute delay doesn't always work, so I just set it to 5 minutes.

Is there a better way to do this?
ID: 44093 · Report as offensive     Reply Quote
computezrmle
Volunteer moderator
Volunteer developer
Volunteer tester
Help desk expert
Avatar

Send message
Joined: 15 Jun 08
Posts: 2400
Credit: 225,142,212
RAC: 123,845
Message 44097 - Posted: 13 Jan 2021, 8:24:59 UTC - in response to Message 44093.  

Do you really need pause/resume for Theory?
I guess not since you also run ATLAS which always starts a task from the scratch if it has been interrupted.


If you need it and the problem is caused by the creation order of the cgroups folders you may split the folder creation into 2 services:
Service 1 creates the uncritical folders
Service 2 creates the folders that require service 1 has finished.

A fix delay should be avoided since it may occasionally cause race conditions.
Instead see the systemd manual how to configure service dependencies using "Wants", "Requires", "Before", "After" and related options.
https://www.freedesktop.org/software/systemd/man/systemd.unit.html#%5BUnit%5D%20Section%20Options
ID: 44097 · Report as offensive     Reply Quote
alee67

Send message
Joined: 2 Jan 19
Posts: 4
Credit: 1,661,241
RAC: 89
Message 44102 - Posted: 14 Jan 2021, 4:58:09 UTC - in response to Message 44097.  

These are preemptible Google Cloud instances (that cost about 1/4 as much as regular non-preemptible instances, so I can get a lot more computing power for my credits) that are only running BOINC projects, and I haven't installed anything on top of the Ubuntu 20.04 image that Google provides besides BOINC and CVMFS. I have no idea what needs to finish before /sys/fs/cgroup/blkio/boinc and the files within can be created, which keeps me from using anything besides a fixed time delay.
ID: 44102 · Report as offensive     Reply Quote
alee67

Send message
Joined: 2 Jan 19
Posts: 4
Credit: 1,661,241
RAC: 89
Message 44120 - Posted: 16 Jan 2021, 6:04:21 UTC

It appears that the 5 minute fixed time delay for starting BOINC doesn't always prevent the problem that I've been having with Native Theory. Also, it appears that I'm not alone in having trouble with /sys/fs/cgroup/blkio/boinc . At least on this work unit, someone else seems to have the same problem: https://lhcathome.cern.ch/lhcathome/workunit.php?wuid=151827668 .
ID: 44120 · Report as offensive     Reply Quote
alee67

Send message
Joined: 2 Jan 19
Posts: 4
Credit: 1,661,241
RAC: 89
Message 44528 - Posted: 22 Mar 2021, 2:57:23 UTC - in response to Message 44120.  

Someone elsewhere gave me an interesting fix for my problem. Add the following to /etc/systemd/system/boinc-client.service:

[services]
. . .
MemoryAccounting=true
IOAccounting=true
BlockIOAccounting=true
CPUAccounting=true


This creates some empty files in some of the /sys/fs/cgroup/*/boinc directories, which appears to ensure that the directories are created, and can't be deleted. My fix was to recreate the directories periodically, if needed, so this seems to work better.
ID: 44528 · Report as offensive     Reply Quote

Message boards : Theory Application : Compute Error with Native Theory jobs when Native Atlas and Sixtrack jobs work fine,


©2024 CERN