41) Message boards : Theory Application : (Native) Theory - Sherpa looooooong runners (Message 41422)
Posted 29 Jan 2020 by wujj123456
Post:
Finally finished: https://lhcathome.cern.ch/lhcathome/result.php?resultid=259641514

===> [runRivet] Mon Jan 20 15:24:51 UTC 2020 [boinc pp jets 8000 800 - sherpa 1.4.1 default 100000 16]

Run time 4 days 12 hours 57 min 5 sec
CPU time 4 days 12 hours 21 min 56 sec

It actually finishes? I have a few of these 1d+ or 2d+ WUs as well at 100% progress. I felt it will never finish...
42) Message boards : ATLAS application : error on Atlas native: 195 (0x000000C3) EXIT_CHILD_FAILED (Message 41410)
Posted 28 Jan 2020 by wujj123456
Post:
The BOINC data directory must be mounted inside the container, and with a default installation this is /var/lib/boinc-client/slots. If there are problems mounting /var you could try a different data directory or install BOINC in a different place. For example on my desktop I run boinc-client from my home directory because the root partition is too small.

Thanks for the reply. Looks like it's a bind mount and I should be able to easily reproduce this without wasting WUs. However, it does seem to work locally, assuming seeing the error message means container has been setup properly with remount.

$ sudo su -l boinc -s /bin/bash -c '/cvmfs/atlas.cern.ch/repo/containers/sw/singularity/x86_64-el7/current/bin/singularity exec --pwd /var/lib/boinc-client/slots/32 -B /cvmfs,/var /cvmfs/atlas.cern.ch/repo/containers/images/singularity/x86_64-centos7.img sh ls'
INFO: Convert SIF file to sandbox...
/usr/bin/ls: /usr/bin/ls: cannot execute binary file
INFO: Cleaning up image...

Now i wonder if it's some setup in the default unit file came with Ubuntu 19.10: https://pastebin.com/akEe8cyY. I am not that familiar with systemd unit files, but nothing looks suspicious after searching the man page. Clearly the symlink /var/lib/boinc should have been resolved given all WUs read/write /var/lib/boinc-client/ without a problem. Any ideas where I should look next?
43) Message boards : ATLAS application : error on Atlas native: 195 (0x000000C3) EXIT_CHILD_FAILED (Message 41367)
Posted 27 Jan 2020 by wujj123456
Post:

A bit of research on the stderr error message may be significant.
"container creation failed: mount ->/var error: can't remount /var: operation not permitted"
https://lhcathome.cern.ch/lhcathome/result.php?resultid=256777262
It seems to have something to do with how the local storage is mounted.
https://github.com/sylabs/singularity/issues/2282

I am running into the same. Is this /var on host filesystem? I probably don't want singularity to remount my /var on host system, but if it's trying to mount due to some missing flags, I can probably check what they do and add them so that remount becomes a noop and succeeds.

https://lhcathome.cern.ch/lhcathome/result.php?resultid=260003929

If I couldn't resolve this, is there a way to disable native atlas while allowing native theory without refusing atlas work entirely?
44) Message boards : Theory Application : Unable to start VM on some WUs (Message 41258)
Posted 14 Jan 2020 by wujj123456
Post:
I configured automatic upgrade, but not automatically reboot. I thought the new kernel and components would only be in effect after I reboot. Let me turn auto update off to ensure the vboxdrv is always in sync with kernel to see if the result improves.
45) Message boards : Theory Application : Unable to start VM on some WUs (Message 41245)
Posted 14 Jan 2020 by wujj123456
Post:
I checked a few failed tasks and they all failed with messages like this.

2020-01-12 15:31:37 (2138):
Command: VBoxManage -q startvm "boinc_74513d880c5d6ae6" --type headless
Exit Code: 1
Output:
WARNING: The character device /dev/vboxdrv does not exist.
Please install the virtualbox-dkms package and the appropriate
headers, most likely linux-headers-generic.

You will not be able to start VMs until this problem is fixed.
VBoxManage: error: The virtual machine 'boinc_74513d880c5d6ae6' has terminated unexpectedly during startup with exit code 1 (0x1)
VBoxManage: error: Details: code NS_ERROR_FAILURE (0x80004005), component MachineWrap, interface IMachine
Waiting for VM "boinc_74513d880c5d6ae6" to power on...\


Example failures:
https://lhcathome.cern.ch/lhcathome/result.php?resultid=259134333
https://lhcathome.cern.ch/lhcathome/result.php?resultid=259134354

However, the /dev/vboxdrv exists and the mentioned packages are also installed on the host. The host has valid results for same application as well: https://lhcathome.cern.ch/lhcathome/results.php?hostid=10595991&offset=0&show_names=0&state=0&appid=13

I couldn't find any smoking gun as of why it fails some WUs but not others. Could /dev/vboxdrv temporarily become inaccessible for some reason I should be aware of?


Previous 20


©2024 CERN