Message boards : ATLAS application : error on Atlas native: 195 (0x000000C3) EXIT_CHILD_FAILED
Message board moderation

To post messages, you must log in.

1 · 2 · Next

AuthorMessage
wolfman1360

Send message
Joined: 17 Feb 17
Posts: 16
Credit: 119,862
RAC: 0
Message 41089 - Posted: 27 Dec 2019, 8:20:27 UTC
Last modified: 27 Dec 2019, 8:21:44 UTC

Hello,
I am getting this error on Atlas native tasks like https://lhcathome.cern.ch/lhcathome/result.php?resultid=256862538 or https://lhcathome.cern.ch/lhcathome/result.php?resultid=256862442

I'm pretty sure I have everything installed correctly as tasks like https://lhcathome.cern.ch/lhcathome/result.php?resultid=256735752 finish successfully.

I'm at a loss of what to do here. All machines running Boinc 7.9.3.
Suspending Atlas on these machines for the time being.
A quick search of the forum yielded CMS was not installed (it is).
Any help appreciated here.
ID: 41089 · Report as offensive     Reply Quote
Jim1348

Send message
Joined: 15 Nov 14
Posts: 415
Credit: 11,880,818
RAC: 3,894
Message 41090 - Posted: 27 Dec 2019, 8:40:47 UTC - in response to Message 41089.  
Last modified: 27 Dec 2019, 8:51:24 UTC

I am getting this error on Atlas native tasks like https://lhcathome.cern.ch/lhcathome/result.php?resultid=256862538 or https://lhcathome.cern.ch/lhcathome/result.php?resultid=256862442

I'm pretty sure I have everything installed correctly as tasks like https://lhcathome.cern.ch/lhcathome/result.php?resultid=256735752 finish successfully.

I'm at a loss of what to do here. All machines running Boinc 7.9.3.

Been there, done that. The only way I can get native ATLAS to run is by creating a second BOINC account (BOINC2).
https://lhcathome.cern.ch/lhcathome/forum_thread.php?id=5169&postid=40851#40851
That is a bit of a pain if you have never done it before, but easy enough thereafter.
It removes the "lock" on the files that is somehow preventing it from running.

A bit of research on the stderr error message may be significant.
"container creation failed: mount ->/var error: can't remount /var: operation not permitted"
https://lhcathome.cern.ch/lhcathome/result.php?resultid=256777262
It seems to have something to do with how the local storage is mounted.
https://github.com/sylabs/singularity/issues/2282

EDIT: Even after all that, you have to do the following:
Attach to LHC and download native ATLAS
Allow a native ATLAS to start up.
Allow access by all (or ATLAS will fail):
sudo chmod -R 777 /var/lib/boinc-client
sudo chmod -R 777 /var/lib/boinc2

I am out at this point. The experts need to get back to work.
ID: 41090 · Report as offensive     Reply Quote
computezrmle
Volunteer moderator
Volunteer developer
Volunteer tester
Avatar

Send message
Joined: 15 Jun 08
Posts: 1422
Credit: 72,818,211
RAC: 107,612
Message 41091 - Posted: 27 Dec 2019, 8:49:56 UTC - in response to Message 41089.  

The stderr.txt clearly states what is missing - python2:
/usr/bin/env: ‘python2’: No such file or directory


Python2 is still required for ATLAS native but not for Theory native.
Hence this example succeeded:
https://lhcathome.cern.ch/lhcathome/result.php?resultid=256735752
ID: 41091 · Report as offensive     Reply Quote
Jim1348

Send message
Joined: 15 Nov 14
Posts: 415
Credit: 11,880,818
RAC: 3,894
Message 41092 - Posted: 27 Dec 2019, 8:53:51 UTC - in response to Message 41091.  

I have python, or it would not work at all. It seems to have problems accessing it.
ID: 41092 · Report as offensive     Reply Quote
computezrmle
Volunteer moderator
Volunteer developer
Volunteer tester
Avatar

Send message
Joined: 15 Jun 08
Posts: 1422
Credit: 72,818,211
RAC: 107,612
Message 41093 - Posted: 27 Dec 2019, 9:03:49 UTC - in response to Message 41092.  

David Cameron is working on a solution to make python2 obsolete.
See:
https://lhcathomedev.cern.ch/lhcathome-dev/forum_thread.php?id=507&postid=6893#6893
ID: 41093 · Report as offensive     Reply Quote
Jim1348

Send message
Joined: 15 Nov 14
Posts: 415
Credit: 11,880,818
RAC: 3,894
Message 41094 - Posted: 27 Dec 2019, 14:36:55 UTC - in response to Message 41093.  

Good. I hope they can make it more like native Theory.
That works.
ID: 41094 · Report as offensive     Reply Quote
wolfman1360

Send message
Joined: 17 Feb 17
Posts: 16
Credit: 119,862
RAC: 0
Message 41096 - Posted: 27 Dec 2019, 17:16:11 UTC

Thank you for catching that. Apparently trying to read things at 2 in the morning is hard.
I have installed Python 2.7, I have not set up a second Boinc instance however I have also done
sudo chmod -R 777 /var/lib/boinc-client
Hopefully everything is good from now on. I did not see that Python was a prerequisite for Atlas native either but I of course may have missed that.
It will be fantastic when this is no longer needed - hopefully that comes out of dev soon.
ID: 41096 · Report as offensive     Reply Quote
Gunde

Send message
Joined: 9 Jan 15
Posts: 83
Credit: 330,840,617
RAC: 245,244
Message 41279 - Posted: 16 Jan 2020, 17:37:33 UTC
Last modified: 16 Jan 2020, 17:51:49 UTC

Anyone have solution for this?

2020-01-16 14:28:22,074: Checking singularity with cmd:/cvmfs/atlas.cern.ch/repo/containers/sw/singularity/x86_64-el7/current/bin/singularity exec -B /cvmfs /cvmfs/atlas.cern.ch/repo/containers/images/singularity/x86_64-centos7.img hostname
2020-01-16 14:28:23,158: Singularity isnt working: INFO:    Convert SIF file to sandbox...
FATAL:   while extracting /cvmfs/atlas.cern.ch/repo/containers/images/singularity/x86_64-centos7.img: root filesystem extraction failed: failed to copy content in staging file: write /tmp/archive-846395633: no space left on device


Some times host getting trouble and this happen before and solve it self after a while. One of host happens to be effected more then other but setup is done same way for all host.
This host use default value in config. Would it help to increase cache? Or any other parameter that could help to adjust?.

Task: https://lhcathome.cern.ch/lhcathome/result.php?resultid=259466119

Device info storage
Cache	46080 KB
Swap space	4 GB
Total disk space	410.56 GB
Free Disk Space	393.62 GB


Another host same issue:
Task: https://lhcathome.cern.ch/lhcathome/result.php?resultid=258968040

Could give it a try to resize swap.
ID: 41279 · Report as offensive     Reply Quote
Jim1348

Send message
Joined: 15 Nov 14
Posts: 415
Credit: 11,880,818
RAC: 3,894
Message 41280 - Posted: 16 Jan 2020, 18:46:39 UTC - in response to Message 41279.  
Last modified: 16 Jan 2020, 18:54:42 UTC

Are you using all 72 cores? I normally provide at least 2 GB/core when running native ATLAS on single cores each.
So you may need more memory, if you are ever running only native ATLAS (or run 2 cores each would be the easy fix).
ID: 41280 · Report as offensive     Reply Quote
maeax

Send message
Joined: 2 May 07
Posts: 914
Credit: 33,698,037
RAC: 7,578
Message 41281 - Posted: 16 Jan 2020, 18:57:27 UTC - in response to Message 41279.  

ID: 41281 · Report as offensive     Reply Quote
Gunde

Send message
Joined: 9 Jan 15
Posts: 83
Credit: 330,840,617
RAC: 245,244
Message 41282 - Posted: 16 Jan 2020, 19:14:06 UTC - in response to Message 41280.  

Yes i use all 72 cores. But half of them are used to Atlas, running sixtrack and other projects. At highest counted i gone up at 70 GB for this system and had an issue only when rosetta took 4GB for each task and when yoyo ecm P2.
Right now when Atlas was running on this host it is around 50GB ram.

I notice that /dev/cl/root was full and on default it was set to 50 GiB so right now i will update and restart and see if clear up some space. If not i try change size on this or reinstall os.
ID: 41282 · Report as offensive     Reply Quote
Gunde

Send message
Joined: 9 Jan 15
Posts: 83
Credit: 330,840,617
RAC: 245,244
Message 41283 - Posted: 16 Jan 2020, 21:23:44 UTC - in response to Message 41281.  

Thanks for looking up SIF. It looks like SIF read-only but would tell much for singularity to atlas.

I had root full after trying to create new swapfile and after updates and reboot it stall. No rescue for my host so i did re-install of os and gave root and swap to higher amount with custom setup instead of default.
So just hope it solve it and deal with all other hosts.
ID: 41283 · Report as offensive     Reply Quote
wujj123456

Send message
Joined: 14 Sep 08
Posts: 5
Credit: 3,178,915
RAC: 13,400
Message 41367 - Posted: 27 Jan 2020, 1:01:59 UTC - in response to Message 41090.  
Last modified: 27 Jan 2020, 1:03:13 UTC


A bit of research on the stderr error message may be significant.
"container creation failed: mount ->/var error: can't remount /var: operation not permitted"
https://lhcathome.cern.ch/lhcathome/result.php?resultid=256777262
It seems to have something to do with how the local storage is mounted.
https://github.com/sylabs/singularity/issues/2282

I am running into the same. Is this /var on host filesystem? I probably don't want singularity to remount my /var on host system, but if it's trying to mount due to some missing flags, I can probably check what they do and add them so that remount becomes a noop and succeeds.

https://lhcathome.cern.ch/lhcathome/result.php?resultid=260003929

If I couldn't resolve this, is there a way to disable native atlas while allowing native theory without refusing atlas work entirely?
ID: 41367 · Report as offensive     Reply Quote
David Cameron
Project administrator
Project developer
Project scientist

Send message
Joined: 13 May 14
Posts: 311
Credit: 10,222,747
RAC: 5,558
Message 41372 - Posted: 27 Jan 2020, 11:05:07 UTC - in response to Message 41367.  


A bit of research on the stderr error message may be significant.
"container creation failed: mount ->/var error: can't remount /var: operation not permitted"
https://lhcathome.cern.ch/lhcathome/result.php?resultid=256777262
It seems to have something to do with how the local storage is mounted.
https://github.com/sylabs/singularity/issues/2282

I am running into the same. Is this /var on host filesystem? I probably don't want singularity to remount my /var on host system, but if it's trying to mount due to some missing flags, I can probably check what they do and add them so that remount becomes a noop and succeeds.

https://lhcathome.cern.ch/lhcathome/result.php?resultid=260003929

If I couldn't resolve this, is there a way to disable native atlas while allowing native theory without refusing atlas work entirely?


The BOINC data directory must be mounted inside the container, and with a default installation this is /var/lib/boinc-client/slots. If there are problems mounting /var you could try a different data directory or install BOINC in a different place. For example on my desktop I run boinc-client from my home directory because the root partition is too small.
ID: 41372 · Report as offensive     Reply Quote
wujj123456

Send message
Joined: 14 Sep 08
Posts: 5
Credit: 3,178,915
RAC: 13,400
Message 41410 - Posted: 28 Jan 2020, 4:05:21 UTC - in response to Message 41372.  
Last modified: 28 Jan 2020, 4:06:49 UTC

The BOINC data directory must be mounted inside the container, and with a default installation this is /var/lib/boinc-client/slots. If there are problems mounting /var you could try a different data directory or install BOINC in a different place. For example on my desktop I run boinc-client from my home directory because the root partition is too small.

Thanks for the reply. Looks like it's a bind mount and I should be able to easily reproduce this without wasting WUs. However, it does seem to work locally, assuming seeing the error message means container has been setup properly with remount.

$ sudo su -l boinc -s /bin/bash -c '/cvmfs/atlas.cern.ch/repo/containers/sw/singularity/x86_64-el7/current/bin/singularity exec --pwd /var/lib/boinc-client/slots/32 -B /cvmfs,/var /cvmfs/atlas.cern.ch/repo/containers/images/singularity/x86_64-centos7.img sh ls'
INFO: Convert SIF file to sandbox...
/usr/bin/ls: /usr/bin/ls: cannot execute binary file
INFO: Cleaning up image...

Now i wonder if it's some setup in the default unit file came with Ubuntu 19.10: https://pastebin.com/akEe8cyY. I am not that familiar with systemd unit files, but nothing looks suspicious after searching the man page. Clearly the symlink /var/lib/boinc should have been resolved given all WUs read/write /var/lib/boinc-client/ without a problem. Any ideas where I should look next?
ID: 41410 · Report as offensive     Reply Quote
djoser
Avatar

Send message
Joined: 30 Aug 14
Posts: 89
Credit: 7,748,724
RAC: 6,417
Message 42279 - Posted: 25 Apr 2020, 17:38:46 UTC - in response to Message 41410.  
Last modified: 25 Apr 2020, 17:50:54 UTC

Hello,

i just noticed that i had 2 WUs with that error today:

https://lhcathome.cern.ch/lhcathome/result.php?resultid=271962666
and
https://lhcathome.cern.ch/lhcathome/result.php?resultid=271977499

The one running right now seems to have the same problem. Running, but low CPU usage.

I know what this error means, but does anyone know why there are no sub tasks?

Greetings, djoser.
Why mine when you can research? - GRIDCOIN - Real cryptocurrency without wasting hashes! www.gridcoin.us
ID: 42279 · Report as offensive     Reply Quote
computezrmle
Volunteer moderator
Volunteer developer
Volunteer tester
Avatar

Send message
Joined: 15 Jun 08
Posts: 1422
Credit: 72,818,211
RAC: 107,612
Message 42280 - Posted: 25 Apr 2020, 18:23:10 UTC - in response to Message 42279.  

No guarantee, but you may try if the following options in /etc/cvmfs/default.local solve the issue (or slow down your CVMFS).
CVMFS_MAX_RETRIES=3
CVMFS_KCACHE_TIMEOUT=3

Run "cvmfs_config reload" after you saved the config file.
ID: 42280 · Report as offensive     Reply Quote
djoser
Avatar

Send message
Joined: 30 Aug 14
Posts: 89
Credit: 7,748,724
RAC: 6,417
Message 42504 - Posted: 15 May 2020, 19:02:08 UTC

I set up a machine with Ubuntu 20.04 (ID: 10651915) which also fails to create the container due to permission problems. In the CVMFS setup procedure i noticed a message that the distribution is not supported and settings for 18.04 are being used.
Maybe this is related to each other?

BTW: The 'chmod -R 777 /var/lib/boinc-client' command did not resolve the issue for me.
Why mine when you can research? - GRIDCOIN - Real cryptocurrency without wasting hashes! www.gridcoin.us
ID: 42504 · Report as offensive     Reply Quote
Gunde

Send message
Joined: 9 Jan 15
Posts: 83
Credit: 330,840,617
RAC: 245,244
Message 42505 - Posted: 15 May 2020, 21:44:03 UTC - in response to Message 42504.  
Last modified: 15 May 2020, 21:50:26 UTC

Running 20.04 also and there is no packages for it. It would suggest 18.04 instead and it fails for me to.

We would need to wait but some might be able to build or use nightly. Have not trying so and use virtualbox instead.

Cern could we get focal added to dist?
http://cvmrepo.web.cern.ch/cvmrepo/apt/dists/
ID: 42505 · Report as offensive     Reply Quote
djoser
Avatar

Send message
Joined: 30 Aug 14
Posts: 89
Credit: 7,748,724
RAC: 6,417
Message 42527 - Posted: 17 May 2020, 18:54:06 UTC
Last modified: 17 May 2020, 19:13:52 UTC

I already downgraded to 18.04.
Now i'm running into serious problems with cvmfs, see here:

https://lhcathome.cern.ch/lhcathome/result.php?resultid=273707970

Also all my jobs are running much longer and CPU efficiency dropped significantly, that's network related i'm almost sure (cvmfs connections to CERN).
Why mine when you can research? - GRIDCOIN - Real cryptocurrency without wasting hashes! www.gridcoin.us
ID: 42527 · Report as offensive     Reply Quote
1 · 2 · Next

Message boards : ATLAS application : error on Atlas native: 195 (0x000000C3) EXIT_CHILD_FAILED


©2020 CERN