Message boards : ATLAS application : Creation of container failed
Message board moderation

To post messages, you must log in.

1 · 2 · Next

AuthorMessage
djoser
Avatar

Send message
Joined: 30 Aug 14
Posts: 145
Credit: 10,847,070
RAC: 0
Message 44734 - Posted: 14 Apr 2021, 17:58:28 UTC
Last modified: 14 Apr 2021, 18:02:39 UTC

Since a few minutes ago i have the following error on ALL of my ATLAS native tasks:

FATAL: container creation failed: hook function for tag prelayer returns error: failed to create /var/lib/condor directory: mkdir /var/lib/condor: permission denied

Since this is just since today and on more than one machine, i don't think this a local problem!? I didn't change anything on my machines which were working flawlessly for months.

Oh wait...there was an update for my OS today...unfortunately i do not remember the package which was updated. I hope this is not the problem...

Anyone else with this problem? Any hints?

Regards, djoser.
Why mine when you can research? - GRIDCOIN - Real cryptocurrency without wasting hashes! https://gridcoin.us
ID: 44734 · Report as offensive     Reply Quote
[AF>Le_Pommier] Jerome_C2005

Send message
Joined: 12 Jul 11
Posts: 76
Credit: 1,064,445
RAC: 51
Message 44735 - Posted: 14 Apr 2021, 18:20:11 UTC
Last modified: 14 Apr 2021, 18:20:38 UTC

Hi

I have a different error with most of my recent Atlas native tasks, with a very interesting "validate the error" status :

[2021-04-14 02:29:47] *** Error codes and diagnostics ***
[2021-04-14 02:29:47] "exeErrorCode": 65,
[2021-04-14 02:29:47] "exeErrorDiag": "Non-zero return code from EVNTtoHITS (65); Logfile error in log.EVNTtoHITS: \"AthMpEvtLoopMgr FATAL makePool failed for AthMpEvtLoopMgr.SharedEvtQueueProvider\"",
[2021-04-14 02:29:47] "pilotErrorCode": 1165,
[2021-04-14 02:29:47] "pilotErrorDiag": "Local output file is missing",
[2021-04-14 02:29:47] *** Listing of results directory ***

See that one for example.
ID: 44735 · Report as offensive     Reply Quote
Jim1348

Send message
Joined: 15 Nov 14
Posts: 568
Credit: 17,800,348
RAC: 21,210
Message 44736 - Posted: 14 Apr 2021, 18:51:18 UTC - in response to Message 44735.  

I had 21 native ATLAS invalids yesterday, apparently due to segfaults.
https://lhcathome.cern.ch/lhcathome/result.php?resultid=311786429
https://lhcathome.cern.ch/lhcathome/result.php?resultid=311785736
https://lhcathome.cern.ch/lhcathome/result.php?resultid=311783201

However, usually someone else was able to complete them validly, if they were running CentOS or Scientific Linux.
ID: 44736 · Report as offensive     Reply Quote
David Cameron
Project administrator
Project developer
Project scientist

Send message
Joined: 13 May 14
Posts: 354
Credit: 12,041,535
RAC: 3,823
Message 44739 - Posted: 14 Apr 2021, 20:57:19 UTC - in response to Message 44734.  

A few hours ago the version of Singularity on CVMFS that is used to run ATLAS native tasks was updated from 3.2.1 to 3.7.2. This could explain the problem you see. Would be good to know if others are experiencing similar issues.

This new version was thoroughly tested by ATLAS but here on LHC@Home there are a lot of different platforms and environments so it's possible it might cause a problem.
ID: 44739 · Report as offensive     Reply Quote
djoser
Avatar

Send message
Joined: 30 Aug 14
Posts: 145
Credit: 10,847,070
RAC: 0
Message 44741 - Posted: 14 Apr 2021, 21:38:00 UTC - in response to Message 44739.  
Last modified: 14 Apr 2021, 21:38:55 UTC

This would explain the sudden problem of creating the singularity container on my machines.
I don't have singularity installed locally, so my machines use the CVMFS version, which was updated and don't seem to work with my machines. So i guess the simplest solution is to install singularity locally...
Why mine when you can research? - GRIDCOIN - Real cryptocurrency without wasting hashes! https://gridcoin.us
ID: 44741 · Report as offensive     Reply Quote
David Cameron
Project administrator
Project developer
Project scientist

Send message
Joined: 13 May 14
Posts: 354
Credit: 12,041,535
RAC: 3,823
Message 44743 - Posted: 15 Apr 2021, 6:44:32 UTC - in response to Message 44741.  

I have asked our singularity expert to have a look at your errors. I am not sure if the /var/lib/condor dir is significant - are you running condor on your machines? It could be that this dir exists in the image and has different permissions to your local dir.
ID: 44743 · Report as offensive     Reply Quote
djoser
Avatar

Send message
Joined: 30 Aug 14
Posts: 145
Credit: 10,847,070
RAC: 0
Message 44744 - Posted: 15 Apr 2021, 7:24:43 UTC - in response to Message 44743.  
Last modified: 15 Apr 2021, 7:25:57 UTC

Thanks for looking into my problem.
No, i'm not using condor. The directory /var/lib/condor doesn't even exist on my machines.

On one of my machines i will try to run ATLAS native with a locally installed singularity, to see if this helps.
Why mine when you can research? - GRIDCOIN - Real cryptocurrency without wasting hashes! https://gridcoin.us
ID: 44744 · Report as offensive     Reply Quote
djoser
Avatar

Send message
Joined: 30 Aug 14
Posts: 145
Credit: 10,847,070
RAC: 0
Message 44745 - Posted: 15 Apr 2021, 8:35:53 UTC - in response to Message 44744.  
Last modified: 15 Apr 2021, 8:36:45 UTC

On one of my machines i will try to run ATLAS native with a locally installed singularity, to see if this helps.

No, unfortunately even with a locally installed singularity ATLAS native doesn't work.

https://lhcathome.cern.ch/lhcathome/result.php?resultid=312206373

The error in the logfile shouldn't occur because the user is member of both groups "boinc" and "singularity" and therefore should have all permissions needed.
Why mine when you can research? - GRIDCOIN - Real cryptocurrency without wasting hashes! https://gridcoin.us
ID: 44745 · Report as offensive     Reply Quote
maeax

Send message
Joined: 2 May 07
Posts: 1301
Credit: 39,583,271
RAC: 11,364
Message 44746 - Posted: 15 Apr 2021, 9:40:52 UTC

This are the singularity-messages from your last Atlas-native successful tasks
This is from your 1.PC:
[2021-04-13 20:53:28] Using singularity image /cvmfs/atlas.cern.ch/repo/containers/images/singularity/x86_64-centos7.img
[2021-04-13 20:53:28] Checking for singularity binary...
[2021-04-13 20:53:28] which: no singularity in (/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin:/sbin:/bin)
[2021-04-13 20:53:28] Singularity is not installed, using version from CVMFS
[2021-04-13 20:53:28] Checking singularity works with /cvmfs/atlas.cern.ch/repo/containers/sw/singularity/x86_64-el7/current/bin/singularity exec -B /cvmfs /cvmfs/atlas.cern.ch/repo/containers/images/singularity/x86_64-centos7.img hostname
[2021-04-13 20:53:32] INFO:  Convert SIF file to sandbox... einstein INFO:  Cleaning up image...
[2021-04-13 20:53:32] Singularity works


This is from your 2.PC
[2021-04-14 15:17:40] Checking for singularity binary...
[2021-04-14 15:17:40] which: no singularity in (/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin:/sbin:/bin)
[2021-04-14 15:17:40] Singularity is not installed, using version from CVMFS
[2021-04-14 15:17:40] Checking singularity works with /cvmfs/atlas.cern.ch/repo/containers/sw/singularity/x86_64-el7/current/bin/singularity exec -B /cvmfs /cvmfs/atlas.cern.ch/repo/containers/images/singularity/x86_64-centos7.img hostname
[2021-04-14 15:17:42] INFO:  Convert SIF file to sandbox... hawking INFO:  Cleaning up image...
[2021-04-14 15:17:42] Singularity works


Using only CentOS7 and CentOS8, but both with locally installed Singularity
[2021-04-14 07:07:54] Running /usr/bin/singularity --version
[2021-04-14 07:07:54] singularity version 3.7.1-1.el8
[2021-04-13 14:57:19] Running /usr/bin/singularity --version
[2021-04-13 14:57:19] singularity version 3.4.0-1.2.el7
ID: 44746 · Report as offensive     Reply Quote
djoser
Avatar

Send message
Joined: 30 Aug 14
Posts: 145
Credit: 10,847,070
RAC: 0
Message 44747 - Posted: 15 Apr 2021, 10:15:11 UTC - in response to Message 44746.  

Yes, those are tasks from BEFORE the singularity update David mentioned.
All tasks AFTER this update are failing.
Why mine when you can research? - GRIDCOIN - Real cryptocurrency without wasting hashes! https://gridcoin.us
ID: 44747 · Report as offensive     Reply Quote
maeax

Send message
Joined: 2 May 07
Posts: 1301
Credit: 39,583,271
RAC: 11,364
Message 44748 - Posted: 15 Apr 2021, 10:36:40 UTC - in response to Message 44747.  

What mean this Info:
Convert SIF file to sandbox..?
ID: 44748 · Report as offensive     Reply Quote
[AF>Le_Pommier] Jerome_C2005

Send message
Joined: 12 Jul 11
Posts: 76
Credit: 1,064,445
RAC: 51
Message 44749 - Posted: 15 Apr 2021, 10:43:51 UTC
Last modified: 15 Apr 2021, 10:47:27 UTC

The latest task that failed for me first says

[2021-04-15 05:40:32] Using singularity image /cvmfs/atlas.cern.ch/repo/containers/images/singularity/x86_64-centos7.img
[2021-04-15 05:40:32] Checking for singularity binary...
[2021-04-15 05:40:32] Using singularity found in PATH at /usr/local/bin/singularity
[2021-04-15 05:40:32] Running /usr/local/bin/singularity --version
[2021-04-15 05:40:32] singularity version 3.7.2+12-g1eba63670
[2021-04-15 05:40:32] Checking singularity works with /usr/local/bin/singularity exec -B /cvmfs /cvmfs/atlas.cern.ch/repo/containers/images/singularity/x86_64-centos7.img hostname
[2021-04-15 05:40:32] vps-3dca72ac
[2021-04-15 05:40:32] Singularity works
[2021-04-15 05:40:33] Set ATHENA_PROC_NUMBER=2
[2021-04-15 05:40:33] Starting ATLAS job with PandaID=5025257188

and then (15mn after)

[2021-04-15 05:54:39] *** Error codes and diagnostics ***
[2021-04-15 05:54:39] "exeErrorCode": 65,
[2021-04-15 05:54:39] "exeErrorDiag": "Non-zero return code from EVNTtoHITS (65); Logfile error in log.EVNTtoHITS: \"AthMpEvtLoopMgr FATAL makePool failed for AthMpEvtLoopMgr.SharedEvtQueueProvider\"",
[2021-04-15 05:54:39] "pilotErrorCode": 1165,
[2021-04-15 05:54:39] "pilotErrorDiag": "Local output file is missing",
[2021-04-15 05:54:39] *** Listing of results directory ***


Few time before another failed but the error log is completely different, and status is different ("error" when the previous is "validate the error") .[/quote]

Before all this I had no problem with Atlas native.
ID: 44749 · Report as offensive     Reply Quote
maeax

Send message
Joined: 2 May 07
Posts: 1301
Credit: 39,583,271
RAC: 11,364
Message 44751 - Posted: 15 Apr 2021, 10:53:46 UTC

OK, we have to wait for the answer from David and the Singularity-Expert.
ID: 44751 · Report as offensive     Reply Quote
David Cameron
Project administrator
Project developer
Project scientist

Send message
Joined: 13 May 14
Posts: 354
Credit: 12,041,535
RAC: 3,823
Message 44754 - Posted: 15 Apr 2021, 17:20:44 UTC - in response to Message 44749.  


[2021-04-15 05:54:39] "exeErrorDiag": "Non-zero return code from EVNTtoHITS (65); Logfile error in log.EVNTtoHITS: \"AthMpEvtLoopMgr FATAL makePool failed for AthMpEvtLoopMgr.SharedEvtQueueProvider\"",


This error usually means that the host ran out of memory. I have seen it often with vbox tasks when the VM was not given enough memory, but not in native tasks. However your host has only 4GB of memory which is kind of on the limit for running ATLAS tasks. It could be the latest batch uses slightly more memory and you reached the limit.
ID: 44754 · Report as offensive     Reply Quote
David Cameron
Project administrator
Project developer
Project scientist

Send message
Joined: 13 May 14
Posts: 354
Credit: 12,041,535
RAC: 3,823
Message 44755 - Posted: 15 Apr 2021, 19:00:23 UTC - in response to Message 44745.  

On one of my machines i will try to run ATLAS native with a locally installed singularity, to see if this helps.

No, unfortunately even with a locally installed singularity ATLAS native doesn't work.

https://lhcathome.cern.ch/lhcathome/result.php?resultid=312206373

The error in the logfile shouldn't occur because the user is member of both groups "boinc" and "singularity" and therefore should have all permissions needed.


The feedback I got said there was a configuration change related to setuid in the latest singularity, and he pointed me to this page. However from your last sentence it seems you have already done what is suggested there, so I'm not really sure what to do.
ID: 44755 · Report as offensive     Reply Quote
[AF>Le_Pommier] Jerome_C2005

Send message
Joined: 12 Jul 11
Posts: 76
Credit: 1,064,445
RAC: 51
Message 44756 - Posted: 15 Apr 2021, 19:57:24 UTC - in response to Message 44754.  
Last modified: 15 Apr 2021, 19:58:56 UTC


[2021-04-15 05:54:39] "exeErrorDiag": "Non-zero return code from EVNTtoHITS (65); Logfile error in log.EVNTtoHITS: \"AthMpEvtLoopMgr FATAL makePool failed for AthMpEvtLoopMgr.SharedEvtQueueProvider\"",


This error usually means that the host ran out of memory. I have seen it often with vbox tasks when the VM was not given enough memory, but not in native tasks. However your host has only 4GB of memory which is kind of on the limit for running ATLAS tasks. It could be the latest batch uses slightly more memory and you reached the limit.

Thanks for the info, so I said "ok never mind let's remove atlas from the list", and then I realize that this machine is set to be on school location, and my school parameters precisely request not to run any Atlas ?? I have the checkbox "if no other work available..." ticked BUT there is always "other work" ?? (there is six-track and native theory selected and I have never seen it with no other tasks / forced to get some Atlas...) so I have untick to see if it changes, but I am surprised.

(the machine is a small linux command line hosted VM and I won't even try to put VB on it, I don't even know if it's possible, and as you said it has very limited resources)
ID: 44756 · Report as offensive     Reply Quote
David Cameron
Project administrator
Project developer
Project scientist

Send message
Joined: 13 May 14
Posts: 354
Credit: 12,041,535
RAC: 3,823
Message 44760 - Posted: 16 Apr 2021, 11:51:44 UTC - in response to Message 44756.  


[2021-04-15 05:54:39] "exeErrorDiag": "Non-zero return code from EVNTtoHITS (65); Logfile error in log.EVNTtoHITS: \"AthMpEvtLoopMgr FATAL makePool failed for AthMpEvtLoopMgr.SharedEvtQueueProvider\"",


This error usually means that the host ran out of memory. I have seen it often with vbox tasks when the VM was not given enough memory, but not in native tasks. However your host has only 4GB of memory which is kind of on the limit for running ATLAS tasks. It could be the latest batch uses slightly more memory and you reached the limit.

Thanks for the info, so I said "ok never mind let's remove atlas from the list", and then I realize that this machine is set to be on school location, and my school parameters precisely request not to run any Atlas ?? I have the checkbox "if no other work available..." ticked BUT there is always "other work" ?? (there is six-track and native theory selected and I have never seen it with no other tasks / forced to get some Atlas...) so I have untick to see if it changes, but I am surprised.

(the machine is a small linux command line hosted VM and I won't even try to put VB on it, I don't even know if it's possible, and as you said it has very limited resources)


One thing you may try is adding some swap space to the VM. I remember some time ago testing ATLAS tasks on a similar 4GB VM, and they failed with this same error. Adding 1GB of swap space was enough to fix it, and the swap was only used very briefly at the start of the task.
ID: 44760 · Report as offensive     Reply Quote
[AF>Le_Pommier] Jerome_C2005

Send message
Joined: 12 Jul 11
Posts: 76
Credit: 1,064,445
RAC: 51
Message 44761 - Posted: 16 Apr 2021, 18:10:50 UTC

I have done this (added a 10 GB swap file) and enabled Atlas again, I'll let you know.
ID: 44761 · Report as offensive     Reply Quote
djoser
Avatar

Send message
Joined: 30 Aug 14
Posts: 145
Credit: 10,847,070
RAC: 0
Message 44762 - Posted: 16 Apr 2021, 19:41:18 UTC - in response to Message 44755.  

On one of my machines i will try to run ATLAS native with a locally installed singularity, to see if this helps.

No, unfortunately even with a locally installed singularity ATLAS native doesn't work.

https://lhcathome.cern.ch/lhcathome/result.php?resultid=312206373

The error in the logfile shouldn't occur because the user is member of both groups "boinc" and "singularity" and therefore should have all permissions needed.


The feedback I got said there was a configuration change related to setuid in the latest singularity, and he pointed me to this page. However from your last sentence it seems you have already done what is suggested there, so I'm not really sure what to do.

That was the one machine i equipped with a local installation of singularity, and yes, i have done what the readme file was suggesting. Strange, that it doesn't work anyway...

But what about my initial problem on the machine without locally installed singularity?
I tried one more task today and still have the same problem.

https://lhcathome.cern.ch/lhcathome/result.php?resultid=312621601

Had the singularity expert something to say about that particular problem?

Thanks and regards!
Why mine when you can research? - GRIDCOIN - Real cryptocurrency without wasting hashes! https://gridcoin.us
ID: 44762 · Report as offensive     Reply Quote
tatayet64

Send message
Joined: 20 Aug 10
Posts: 4
Credit: 2,216,637
RAC: 1
Message 44766 - Posted: 18 Apr 2021, 22:57:40 UTC
Last modified: 18 Apr 2021, 22:59:22 UTC

Hi,

I'm having exactly the same issue as djoser.
All my ATLAS Job go on error since 14/04.
With the same error (not the same folder) :
2021-04-18 11:27:14] FATAL:   container creation failed: hook function for tag prelayer returns error: failed to create /var/lib/alternatives directory: mkdir /var/lib/alternatives: permission denied

For exemple, one of my last tasks : https://lhcathome.cern.ch/lhcathome/result.php?resultid=313045776

Regards
ID: 44766 · Report as offensive     Reply Quote
1 · 2 · Next

Message boards : ATLAS application : Creation of container failed


©2021 CERN