Message boards :
Theory Application :
Compute Error with Native Theory jobs when Native Atlas and Sixtrack jobs work fine,
Message board moderation
Author | Message |
---|---|
Send message Joined: 2 Jan 19 Posts: 4 Credit: 1,665,861 RAC: 1 |
I have two Google Cloud instances (they were giving me lots of free credits for publishing a couple of somewhat lame Google Assistant apps) running Ubuntu 20.04 set up to run Native Theory and Atlas jobs, including suspend/resume following the instructions here: https://lhcathome.cern.ch/lhcathome/forum_thread.php?id=4971 The Native Atlas, as well as Sixtrack, jobs run fine, but the Native Theory jobs were all exiting quickly. The ones I've looked at all seem to have the same problem. This is an example: https://lhcathome.cern.ch/lhcathome/forum_thread.php?id=4971 What seems to be happening is that when BOINC starts, /sbin/create-boinc-cgroup creates boinc subdirectories, and files within these subdirectories, in several subdirectories of /sys/fs/cgroup , but this fails for /sys/fs/cgroup/blkio/boinc . Delaying the creation of /sys/fs/cgroup/blkio/boinc and the files within it seems to fix the problem. I'd guess that it needs to be delayed until after something else is completed, but I don't know what, or how to do that, so I just used a simple time delay. One way is to insert sleep commands into /sbin/create-boinc-cgroup , but I ended up creating the file /etc/systemd/system/boinc-client.timer : [Unit] Description=Berkeley Open Infrastructure Network Computing Client Documentation=man:boinc(1) After=network-online.target [Timer] OnBootSec=5min [Install] WantedBy=timers.target I had found that a 1 minute delay didn't work, and a 2 minute delay doesn't always work, so I just set it to 5 minutes. Is there a better way to do this? |
Send message Joined: 15 Jun 08 Posts: 2534 Credit: 253,973,449 RAC: 44,024 |
Do you really need pause/resume for Theory? I guess not since you also run ATLAS which always starts a task from the scratch if it has been interrupted. If you need it and the problem is caused by the creation order of the cgroups folders you may split the folder creation into 2 services: Service 1 creates the uncritical folders Service 2 creates the folders that require service 1 has finished. A fix delay should be avoided since it may occasionally cause race conditions. Instead see the systemd manual how to configure service dependencies using "Wants", "Requires", "Before", "After" and related options. https://www.freedesktop.org/software/systemd/man/systemd.unit.html#%5BUnit%5D%20Section%20Options |
Send message Joined: 2 Jan 19 Posts: 4 Credit: 1,665,861 RAC: 1 |
These are preemptible Google Cloud instances (that cost about 1/4 as much as regular non-preemptible instances, so I can get a lot more computing power for my credits) that are only running BOINC projects, and I haven't installed anything on top of the Ubuntu 20.04 image that Google provides besides BOINC and CVMFS. I have no idea what needs to finish before /sys/fs/cgroup/blkio/boinc and the files within can be created, which keeps me from using anything besides a fixed time delay. |
Send message Joined: 2 Jan 19 Posts: 4 Credit: 1,665,861 RAC: 1 |
It appears that the 5 minute fixed time delay for starting BOINC doesn't always prevent the problem that I've been having with Native Theory. Also, it appears that I'm not alone in having trouble with /sys/fs/cgroup/blkio/boinc . At least on this work unit, someone else seems to have the same problem: https://lhcathome.cern.ch/lhcathome/workunit.php?wuid=151827668 . |
Send message Joined: 2 Jan 19 Posts: 4 Credit: 1,665,861 RAC: 1 |
Someone elsewhere gave me an interesting fix for my problem. Add the following to /etc/systemd/system/boinc-client.service: [services] . . . MemoryAccounting=true IOAccounting=true BlockIOAccounting=true CPUAccounting=true This creates some empty files in some of the /sys/fs/cgroup/*/boinc directories, which appears to ensure that the directories are created, and can't be deleted. My fix was to recreate the directories periodically, if needed, so this seems to work better. |
©2024 CERN