1) Message boards : Sixtrack Application : please, remove non-optimized application SixTrack for 32 bit systems (Message 42642)
Posted 28 May 2020 by wujj123456
Post:
I would say just remove non-optimized apps altogether... On my Ryzen 3 that's perfectly capable of doing avx, I still get lots of non-optimized WUs, which takes 50-100% longer to finish for same credit. (I am using credit/hr as an approximation as efficiency since it's same app. Feel free to correct the assumption if that's invalid.) I really doubt there are many system not capable of doing sse2 these days and most should be able to do avx too. It's also interesting that all apps are at least sse2 for Linux and apparently that's not a concern.

I wonder if I could use app_info.xml and force map non-optimized app to the avx application? Would it generate different results failing validation? Have anyone tried that already?
2) Message boards : Theory Application : (Native) Theory - Sherpa looooooong runners (Message 41422)
Posted 29 Jan 2020 by wujj123456
Post:
Finally finished: https://lhcathome.cern.ch/lhcathome/result.php?resultid=259641514

===> [runRivet] Mon Jan 20 15:24:51 UTC 2020 [boinc pp jets 8000 800 - sherpa 1.4.1 default 100000 16]

Run time 4 days 12 hours 57 min 5 sec
CPU time 4 days 12 hours 21 min 56 sec

It actually finishes? I have a few of these 1d+ or 2d+ WUs as well at 100% progress. I felt it will never finish...
3) Message boards : ATLAS application : error on Atlas native: 195 (0x000000C3) EXIT_CHILD_FAILED (Message 41410)
Posted 28 Jan 2020 by wujj123456
Post:
The BOINC data directory must be mounted inside the container, and with a default installation this is /var/lib/boinc-client/slots. If there are problems mounting /var you could try a different data directory or install BOINC in a different place. For example on my desktop I run boinc-client from my home directory because the root partition is too small.

Thanks for the reply. Looks like it's a bind mount and I should be able to easily reproduce this without wasting WUs. However, it does seem to work locally, assuming seeing the error message means container has been setup properly with remount.

$ sudo su -l boinc -s /bin/bash -c '/cvmfs/atlas.cern.ch/repo/containers/sw/singularity/x86_64-el7/current/bin/singularity exec --pwd /var/lib/boinc-client/slots/32 -B /cvmfs,/var /cvmfs/atlas.cern.ch/repo/containers/images/singularity/x86_64-centos7.img sh ls'
INFO: Convert SIF file to sandbox...
/usr/bin/ls: /usr/bin/ls: cannot execute binary file
INFO: Cleaning up image...

Now i wonder if it's some setup in the default unit file came with Ubuntu 19.10: https://pastebin.com/akEe8cyY. I am not that familiar with systemd unit files, but nothing looks suspicious after searching the man page. Clearly the symlink /var/lib/boinc should have been resolved given all WUs read/write /var/lib/boinc-client/ without a problem. Any ideas where I should look next?
4) Message boards : ATLAS application : error on Atlas native: 195 (0x000000C3) EXIT_CHILD_FAILED (Message 41367)
Posted 27 Jan 2020 by wujj123456
Post:

A bit of research on the stderr error message may be significant.
"container creation failed: mount ->/var error: can't remount /var: operation not permitted"
https://lhcathome.cern.ch/lhcathome/result.php?resultid=256777262
It seems to have something to do with how the local storage is mounted.
https://github.com/sylabs/singularity/issues/2282

I am running into the same. Is this /var on host filesystem? I probably don't want singularity to remount my /var on host system, but if it's trying to mount due to some missing flags, I can probably check what they do and add them so that remount becomes a noop and succeeds.

https://lhcathome.cern.ch/lhcathome/result.php?resultid=260003929

If I couldn't resolve this, is there a way to disable native atlas while allowing native theory without refusing atlas work entirely?
5) Message boards : Theory Application : Unable to start VM on some WUs (Message 41258)
Posted 14 Jan 2020 by wujj123456
Post:
I configured automatic upgrade, but not automatically reboot. I thought the new kernel and components would only be in effect after I reboot. Let me turn auto update off to ensure the vboxdrv is always in sync with kernel to see if the result improves.
6) Message boards : Theory Application : Unable to start VM on some WUs (Message 41245)
Posted 14 Jan 2020 by wujj123456
Post:
I checked a few failed tasks and they all failed with messages like this.

2020-01-12 15:31:37 (2138):
Command: VBoxManage -q startvm "boinc_74513d880c5d6ae6" --type headless
Exit Code: 1
Output:
WARNING: The character device /dev/vboxdrv does not exist.
Please install the virtualbox-dkms package and the appropriate
headers, most likely linux-headers-generic.

You will not be able to start VMs until this problem is fixed.
VBoxManage: error: The virtual machine 'boinc_74513d880c5d6ae6' has terminated unexpectedly during startup with exit code 1 (0x1)
VBoxManage: error: Details: code NS_ERROR_FAILURE (0x80004005), component MachineWrap, interface IMachine
Waiting for VM "boinc_74513d880c5d6ae6" to power on...\


Example failures:
https://lhcathome.cern.ch/lhcathome/result.php?resultid=259134333
https://lhcathome.cern.ch/lhcathome/result.php?resultid=259134354

However, the /dev/vboxdrv exists and the mentioned packages are also installed on the host. The host has valid results for same application as well: https://lhcathome.cern.ch/lhcathome/results.php?hostid=10595991&offset=0&show_names=0&state=0&appid=13

I couldn't find any smoking gun as of why it fails some WUs but not others. Could /dev/vboxdrv temporarily become inaccessible for some reason I should be aware of?



©2020 CERN