Message boards : Theory Application : Issues Native Theory application
Message board moderation

To post messages, you must log in.

Previous · 1 · 2 · 3 · 4 · 5 · Next

AuthorMessage
Jim1348

Send message
Joined: 15 Nov 14
Posts: 454
Credit: 12,303,409
RAC: 3,335
Message 38367 - Posted: 21 Mar 2019, 22:05:29 UTC

It looks like I completed a sherpa, but it was only a 2.2.0 and not that long.
https://lhcathome.cern.ch/lhcathome/result.php?resultid=219413872

Does it count?
ID: 38367 · Report as offensive     Reply Quote
mmonnin

Send message
Joined: 22 Mar 17
Posts: 44
Credit: 3,801,950
RAC: 0
Message 38369 - Posted: 21 Mar 2019, 22:12:49 UTC - in response to Message 38343.  
Last modified: 21 Mar 2019, 22:13:18 UTC

I see that the maximum runtime of last 100 tasks is 25.06 hours.

Would be interesting to know what job and result-id it was.


I aborted this one. Running for 2 days while the longest one I've completed was 8 hours.
https://lhcathome.cern.ch/lhcathome/workunit.php?wuid=109410751
ID: 38369 · Report as offensive     Reply Quote
pianoman

Send message
Joined: 29 Jun 18
Posts: 6
Credit: 3,288,185
RAC: 2,182
Message 38370 - Posted: 22 Mar 2019, 2:06:32 UTC
Last modified: 22 Mar 2019, 2:12:44 UTC

Seeing failed tasks on debian testing (buster). Yes, I know that's unstable, but it looks to be a container permission issue. Did I miss a step? I have 5 machines, two running ubuntu 18.04 are running theory native just fine (third has its first native theory job in its queue), but my two debian buster systems throwing this error:

Edit: both system run native atlas just fine.

machine 1:
https://lhcathome.cern.ch/lhcathome/result.php?resultid=219687183
<core_client_version>7.14.2</core_client_version>
<![CDATA[
<message>
process exited with code 195 (0xc3, -61)</message>
<stderr_txt>
13:21:36 (21317): wrapper (7.15.26016): starting
13:21:36 (21317): wrapper (7.15.26016): starting
13:21:36 (21317): wrapper: running ../../projects/lhcathome.cern.ch_lhcathome/cranky-0.0.28 ()
17:21:36 2019-03-21: cranky-0.0.28: [INFO] Detected TheoryN App
17:21:36 2019-03-21: cranky-0.0.28: [INFO] Checking CVMFS.
17:21:36 2019-03-21: cranky-0.0.28: [INFO] Checking runc.
17:21:38 2019-03-21: cranky-0.0.28: [INFO] Creating the filesystem.
17:21:38 2019-03-21: cranky-0.0.28: [INFO] Using /cvmfs/cernvm-prod.cern.ch/cvm3
17:21:38 2019-03-21: cranky-0.0.28: [INFO] Creating cgroup for slot 8
mkdir: cannot create directory &#226;&#128;&#152;/sys/fs/cgroup/blkio/boinc&#226;&#128;&#153;: Permission denied
mkdir: cannot create directory &#226;&#128;&#152;/sys/fs/cgroup/hugetlb&#226;&#128;&#153;: Read-only file system
17:21:38 2019-03-21: cranky-0.0.28: [INFO] Updating config.json.
17:21:38 2019-03-21: cranky-0.0.28: [INFO] Running Container 'runc'.
container_linux.go:336: starting container process caused "process_linux.go:293: applying cgroup configuration for process caused \"mkdir /sys/fs/cgroup/blkio/boinc: permission denied\""
17:21:38 2019-03-21: cranky-0.0.28: [ERROR] Container 'runc' terminated with status code 1.
13:21:38 (21317): cranky exited; CPU time 0.136671
13:21:38 (21317): app exit status: 0xce
13:21:38 (21317): called boinc_finish(195)

</stderr_txt>
]]>


And a similar message on machine 2:
https://lhcathome.cern.ch/lhcathome/result.php?resultid=219590462
<core_client_version>7.14.2</core_client_version>
<![CDATA[
<message>
process exited with code 195 (0xc3, -61)</message>
<stderr_txt>
01:41:48 (31651): wrapper (7.15.26016): starting
01:41:48 (31651): wrapper (7.15.26016): starting
01:41:48 (31651): wrapper: running ../../projects/lhcathome.cern.ch_lhcathome/cranky-0.0.28 ()
05:41:48 2019-03-20: cranky-0.0.28: [INFO] Detected TheoryN App
05:41:48 2019-03-20: cranky-0.0.28: [INFO] Checking CVMFS.
05:41:48 2019-03-20: cranky-0.0.28: [INFO] Checking runc.
05:41:48 2019-03-20: cranky-0.0.28: [INFO] Creating the filesystem.
05:41:48 2019-03-20: cranky-0.0.28: [INFO] Using /cvmfs/cernvm-prod.cern.ch/cvm3
05:41:48 2019-03-20: cranky-0.0.28: [INFO] Creating cgroup for slot 7
mkdir: cannot create directory &#226;&#128;&#152;/sys/fs/cgroup/hugetlb&#226;&#128;&#153;: Read-only file system
05:41:48 2019-03-20: cranky-0.0.28: [INFO] Updating config.json.
05:41:48 2019-03-20: cranky-0.0.28: [INFO] Running Container 'runc'.
05:41:48 2019-03-20: cranky-0.0.28: [ERROR] Container 'runc' terminated with status code 139.
01:41:49 (31651): cranky exited; CPU time 0.065568
01:41:49 (31651): app exit status: 0xce
01:41:49 (31651): called boinc_finish(195)

</stderr_txt>
]]>
ID: 38370 · Report as offensive     Reply Quote
Profile Laurence
Project administrator
Project developer

Send message
Joined: 20 Jun 14
Posts: 337
Credit: 237,918
RAC: 0
Message 38374 - Posted: 22 Mar 2019, 10:01:32 UTC - in response to Message 38370.  

Your Debian machine may be handling cgroups differently. Did you do the steps for Suspend/Resume from the instructions?
ID: 38374 · Report as offensive     Reply Quote
maeax

Send message
Joined: 2 May 07
Posts: 980
Credit: 34,577,004
RAC: 18,172
Message 38381 - Posted: 22 Mar 2019, 19:30:56 UTC - in response to Message 38344.  

No result so far. Sherpa 1.4.3 for now 25 hours running and hoping to get a good end therefore.
runRivet.log is growing every minute (255Kbyte) nevts=1.000.
Let you know for the result.

From the database, knocking out because of the Deadline:
https://lhcathome.cern.ch/lhcathome/workunit.php?wuid=109355962
But... still running 80 hours so far with 777,8 KByte runRivet.log :-))
@Bronco wrote Go Sherpa go!
ID: 38381 · Report as offensive     Reply Quote
pianoman

Send message
Joined: 29 Jun 18
Posts: 6
Credit: 3,288,185
RAC: 2,182
Message 38383 - Posted: 23 Mar 2019, 1:51:00 UTC - in response to Message 38374.  

Your Debian machine may be handling cgroups differently. Did you do the steps for Suspend/Resume from the instructions?


I did, but I'll dig a little deeper into the logs to make sure those scripts are getting run, thanks for the pointer.
ID: 38383 · Report as offensive     Reply Quote
pianoman

Send message
Joined: 29 Jun 18
Posts: 6
Credit: 3,288,185
RAC: 2,182
Message 38384 - Posted: 23 Mar 2019, 2:08:13 UTC - in response to Message 38383.  

Ahh, well, for one reason, the debian buster 4.19 kernel doesn't have CONFIG_CGROUP_HUGETLB set, so there is no /sys/fs/cgroup/hugetlb directory. I wonder why..
ID: 38384 · Report as offensive     Reply Quote
bronco

Send message
Joined: 13 Apr 18
Posts: 443
Credit: 8,438,885
RAC: 0
Message 38386 - Posted: 23 Mar 2019, 10:57:05 UTC - in response to Message 38381.  

No result so far. Sherpa 1.4.3 for now 25 hours running and hoping to get a good end therefore.
runRivet.log is growing every minute (255Kbyte) nevts=1.000.
Let you know for the result.

From the database, knocking out because of the Deadline:
https://lhcathome.cern.ch/lhcathome/workunit.php?wuid=109355962
But... still running 80 hours so far with 777,8 KByte runRivet.log :-))
@Bronco wrote Go Sherpa go!

Now I see I was wrong :-(
Die, sherpa, die!!
Setting my watchdog script to immediately abort any sherpa less than version 5.0.0.
ID: 38386 · Report as offensive     Reply Quote
Jim1348

Send message
Joined: 15 Nov 14
Posts: 454
Credit: 12,303,409
RAC: 3,335
Message 38402 - Posted: 24 Mar 2019, 4:24:19 UTC

I am now getting both Native Theory and Native ATLAS with the appropriated preferences selected.
Everything is working well.
ID: 38402 · Report as offensive     Reply Quote
bronco

Send message
Joined: 13 Apr 18
Posts: 443
Credit: 8,438,885
RAC: 0
Message 38403 - Posted: 24 Mar 2019, 9:49:50 UTC - in response to Message 38402.  

I am now getting both Native Theory and Native ATLAS with the appropriated preferences selected.
Everything is working well.

Yes, very well. I have a native ATLAS and a native Theory running concurrently on a host with only 4 GB RAM :)
Hoping to add native CMS to the mix someday.
ID: 38403 · Report as offensive     Reply Quote
Jim1348

Send message
Joined: 15 Nov 14
Posts: 454
Credit: 12,303,409
RAC: 3,335
Message 38405 - Posted: 24 Mar 2019, 13:25:41 UTC - in response to Message 38403.  

Hoping to add native CMS to the mix someday.

Exactly, but I have to keep my health up to live that long.
ID: 38405 · Report as offensive     Reply Quote
G_UK
Avatar

Send message
Joined: 21 Nov 10
Posts: 5
Credit: 1,513,745
RAC: 77
Message 38417 - Posted: 24 Mar 2019, 21:47:59 UTC

OK, I have managed to get these Native tasks running on Debian Buster although it has several dependencies.

1) You will need to build a custom kernel, I built a stock 4.19 kernel from kernel.org with the following additional options enabled:
- All Controllers in "General Setup -> Control Group Support"
- All Namespaces in "General Setup -> Namespaces support"
- FUSE filesystem in "Filesystems -> FUSE"
(Note: you will need to enable additional drivers etc depending on your machine)

2) Set-up the CERN repo and Install CVMFS
Follow the instructions here https://cvmfs.readthedocs.io/en/stable/cpt-quickstart.html

For Debian you will need to edit the apt repo before installing CVMFS as it defaults to Ubuntu.
"sudo nano /etc/apt/sources.list.d/cernvm.list"
deb http://cvmrepo.web.cern.ch/cvmrepo/apt stretch-prod main

"sudo apt update"

3) Singularity (for Atlas) can be installed direct from the Debian repo "sudo apt install singularity-container"

4) Set-up CVMFS as per the pinned instructions, as we have enabled namespaces in the kernel you can skip the "Enabling user namespace" step.

I think that is everything I had to do, I ended up getting most of the error messages that have been posted so far at various parts of getting this running.[/list]
https://gridcoin.ddns.net
Rx7Qvpvc7qEZbrHeJ8ukknJo4Gwrc3unBg
Bitshares: g-uk https://wallet.bitshares.org
ID: 38417 · Report as offensive     Reply Quote
G_UK
Avatar

Send message
Joined: 21 Nov 10
Posts: 5
Credit: 1,513,745
RAC: 77
Message 38420 - Posted: 24 Mar 2019, 23:20:38 UTC

My tasks are now completing successfully however they still have errors in the log.

<core_client_version>7.14.2</core_client_version>
<![CDATA[
<stderr_txt>
22:36:06 (28989): wrapper (7.15.26016): starting
22:36:06 (28989): wrapper (7.15.26016): starting
22:36:06 (28989): wrapper: running ../../projects/lhcathome.cern.ch_lhcathome/cranky-0.0.28 ()
22:36:06 2019-03-24: cranky-0.0.28: [INFO] Detected TheoryN App
22:36:06 2019-03-24: cranky-0.0.28: [INFO] Checking CVMFS.
22:36:06 2019-03-24: cranky-0.0.28: [INFO] Checking runc.
22:36:06 2019-03-24: cranky-0.0.28: [INFO] Creating the filesystem.
22:36:06 2019-03-24: cranky-0.0.28: [INFO] Using /cvmfs/cernvm-prod.cern.ch/cvm3
22:36:06 2019-03-24: cranky-0.0.28: [INFO] Creating cgroup for slot 8
mkdir: cannot create directory &#226;&#128;&#152;/sys/fs/cgroup/net_cls&#226;&#128;&#153;: Read-only file system
mkdir: cannot create directory &#226;&#128;&#152;/sys/fs/cgroup/net_prio&#226;&#128;&#153;: Read-only file system
22:36:06 2019-03-24: cranky-0.0.28: [INFO] Updating config.json.
22:36:06 2019-03-24: cranky-0.0.28: [INFO] Running Container 'runc'.
23:03:50 2019-03-24: cranky-0.0.28: [INFO] Container 'runc' finished with status code 0.
===> [runRivet] Sun Mar 24 22:36:06 UTC 2019 [boinc pp jets 7000 - - pythia8 8.226 tune-CUETP8S1 100000 34]
23:03:50 2019-03-24: cranky-0.0.28: [INFO] Preparing output.
23:03:50 (28989): cranky exited; CPU time 2011.657096
23:03:50 (28989): called boinc_finish(0)

</stderr_txt>
]]>

https://gridcoin.ddns.net
Rx7Qvpvc7qEZbrHeJ8ukknJo4Gwrc3unBg
Bitshares: g-uk https://wallet.bitshares.org
ID: 38420 · Report as offensive     Reply Quote
pianoman

Send message
Joined: 29 Jun 18
Posts: 6
Credit: 3,288,185
RAC: 2,182
Message 38424 - Posted: 25 Mar 2019, 0:40:49 UTC - in response to Message 38420.  

I think the cgroup sysfs error messages are a red herring.. I mean, yes, they're errors, but I think runc is designed to still work with those present. I'm still running the stock Buster kernel, and at least one of my two machines that was experiencing errors started to work fine when I really didn't change anything.

The other one I don't think has received any more TheoryN tasks so I don't know if that one magically solved itself or not.

And thank you G_UK, I forgot to mention I had to change the repo to stretch by hand as well; I expect that running running testing.
ID: 38424 · Report as offensive     Reply Quote
Profile Laurence
Project administrator
Project developer

Send message
Joined: 20 Jun 14
Posts: 337
Credit: 237,918
RAC: 0
Message 38428 - Posted: 25 Mar 2019, 11:17:21 UTC - in response to Message 38417.  

OK, I have managed to get these Native tasks running on Debian Buster although it has several dependencies.

1) You will need to build a custom kernel, I built a stock 4.19 kernel from kernel.org with the following additional options enabled:
- All Controllers in "General Setup -> Control Group Support"
- All Namespaces in "General Setup -> Namespaces support"
- FUSE filesystem in "Filesystems -> FUSE"
(Note: you will need to enable additional drivers etc depending on your machine)


Why did you need to do this? Will the default Buster installation not support rootless containers?

2) Set-up the CERN repo and Install CVMFS
Follow the instructions here https://cvmfs.readthedocs.io/en/stable/cpt-quickstart.html

For Debian you will need to edit the apt repo before installing CVMFS as it defaults to Ubuntu.
"sudo nano /etc/apt/sources.list.d/cernvm.list"
deb http://cvmrepo.web.cern.ch/cvmrepo/apt stretch-prod main

"sudo apt update"

Do we need to update the instructions?

3) Singularity (for Atlas) can be installed direct from the Debian repo "sudo apt install singularity-container"

4) Set-up CVMFS as per the pinned instructions, as we have enabled namespaces in the kernel you can skip the "Enabling user namespace" step.

I think that is everything I had to do, I ended up getting most of the error messages that have been posted so far at various parts of getting this running.[/list]
ID: 38428 · Report as offensive     Reply Quote
Profile Laurence
Project administrator
Project developer

Send message
Joined: 20 Jun 14
Posts: 337
Credit: 237,918
RAC: 0
Message 38429 - Posted: 25 Mar 2019, 11:18:53 UTC - in response to Message 38420.  

My tasks are now completing successfully however they still have errors in the log.

<core_client_version>7.14.2</core_client_version>
<![CDATA[
<stderr_txt>
22:36:06 (28989): wrapper (7.15.26016): starting
22:36:06 (28989): wrapper (7.15.26016): starting
22:36:06 (28989): wrapper: running ../../projects/lhcathome.cern.ch_lhcathome/cranky-0.0.28 ()
22:36:06 2019-03-24: cranky-0.0.28: [INFO] Detected TheoryN App
22:36:06 2019-03-24: cranky-0.0.28: [INFO] Checking CVMFS.
22:36:06 2019-03-24: cranky-0.0.28: [INFO] Checking runc.
22:36:06 2019-03-24: cranky-0.0.28: [INFO] Creating the filesystem.
22:36:06 2019-03-24: cranky-0.0.28: [INFO] Using /cvmfs/cernvm-prod.cern.ch/cvm3
22:36:06 2019-03-24: cranky-0.0.28: [INFO] Creating cgroup for slot 8
mkdir: cannot create directory ‘/sys/fs/cgroup/net_cls’: Read-only file system
mkdir: cannot create directory ‘/sys/fs/cgroup/net_prio’: Read-only file system
22:36:06 2019-03-24: cranky-0.0.28: [INFO] Updating config.json.
22:36:06 2019-03-24: cranky-0.0.28: [INFO] Running Container 'runc'.
23:03:50 2019-03-24: cranky-0.0.28: [INFO] Container 'runc' finished with status code 0.
===> [runRivet] Sun Mar 24 22:36:06 UTC 2019 [boinc pp jets 7000 - - pythia8 8.226 tune-CUETP8S1 100000 34]
23:03:50 2019-03-24: cranky-0.0.28: [INFO] Preparing output.
23:03:50 (28989): cranky exited; CPU time 2011.657096
23:03:50 (28989): called boinc_finish(0)

</stderr_txt>
]]>


Runc will ignore this but some features (which we may or may not need) will not be available. I will suppress those error messages in a future release.
ID: 38429 · Report as offensive     Reply Quote
G_UK
Avatar

Send message
Joined: 21 Nov 10
Posts: 5
Credit: 1,513,745
RAC: 77
Message 38431 - Posted: 25 Mar 2019, 20:07:42 UTC - in response to Message 38428.  
Last modified: 25 Mar 2019, 20:12:43 UTC


Why did you need to do this? Will the default Buster installation not support rootless containers?


I couldn't get the work units to run with the stock Buster Kernel and I saw the post by @pianoman mentioning that CONFIG_CGROUP_HUGETLB was not set in the kernel so I thought I would try enabling it.

After building a new Kernel with the mentioned features enabled the work-units could run so I posted what I had done to get it working. Since I did that however, @pianoman has posted to say his has suddenly started working so it may have been something else that has resolved itself and just coincided with me rebuilding the Kernel, when I get back home later I will have to switch one machine back to the stock Buster Kernel and test again.


Do we need to update the instructions?


When following these instructions for Debian/Ubuntu, I found that the cvmfs-release-latest_all.deb package sets up an apt repository for Ubuntu Precise rather than Debian Stretch.

This is a problem as there is a difference in version numbers for some cvmfs dependencies between Ubuntu and Debian. To fix it you need to edit the apt sources file that the package creates to point at the correct distribution.


Runc will ignore this but some features (which we may or may not need) will not be available. I will suppress those error messages in a future release


I think I was getting those because I didn't have net-cls or net-prio enabled in the kernel options. I did rebuild the kernel last night but have not had a chance to check it yet as I have been at work since. As mentioned above though I need to switch one machine back to the stock Debian Buster kernel tonight to double check.

Edit: Just checked one machines completed work units and everything after 8AM this morning is no longer getting the errors so the rebuild with net-cls and net-prio worked.

https://lhcathome.cern.ch/lhcathome/results.php?hostid=10586088
https://gridcoin.ddns.net
Rx7Qvpvc7qEZbrHeJ8ukknJo4Gwrc3unBg
Bitshares: g-uk https://wallet.bitshares.org
ID: 38431 · Report as offensive     Reply Quote
G_UK
Avatar

Send message
Joined: 21 Nov 10
Posts: 5
Credit: 1,513,745
RAC: 77
Message 38432 - Posted: 25 Mar 2019, 22:02:17 UTC

Right, I've got home and downgraded one machine back to the default Debian Buster Kernel

https://lhcathome.cern.ch/lhcathome/results.php?hostid=10586087

As you can see the jobs started failing again so something is wrong with this set-up. I'll next try rebuilding the Kernel with the Debian config applied to a Vanilla kernel.org 4.19 (with and without CONFIG_CGROUP_HUGETLB enabled), this should hopefully tell us if it is a Debian Buster specific patch that is throwing it off.
https://gridcoin.ddns.net
Rx7Qvpvc7qEZbrHeJ8ukknJo4Gwrc3unBg
Bitshares: g-uk https://wallet.bitshares.org
ID: 38432 · Report as offensive     Reply Quote
bronco

Send message
Joined: 13 Apr 18
Posts: 443
Credit: 8,438,885
RAC: 0
Message 38443 - Posted: 26 Mar 2019, 15:55:19 UTC - in response to Message 38405.  

Hoping to add native CMS to the mix someday.

Exactly, but I have to keep my health up to live that long.

Optimism promotes longevity and it tastes better than liver, kale and fat free ice cream.
If you haven't already done so, snag some sixtrack and see how nicely they play with native ATLAS and native Theory. With "switch between tasks every..." set to 2080 minutes and a sane task cache nothing gets preempted which avoids ATLAS restarting from 0 events. Very nice!!
ID: 38443 · Report as offensive     Reply Quote
Jim1348

Send message
Joined: 15 Nov 14
Posts: 454
Credit: 12,303,409
RAC: 3,335
Message 38445 - Posted: 26 Mar 2019, 16:10:52 UTC - in response to Message 38443.  
Last modified: 26 Mar 2019, 16:21:43 UTC

If you haven't already done so, snag some sixtrack and see how nicely they play with native ATLAS and native Theory. With "switch between tasks every..." set to 2080 minutes and a sane task cache nothing gets preempted which avoids ATLAS restarting from 0 events. Very nice!!

I avoid sixtrack. It is too easy, and requires no special software. Anyone can run it, so I let them. But I did pick up a few inadvertently when I had "allow other work" selected. The problem is that the BOINC scheduler does not deal with single-thread and multi-thread work units very well, and often runs both for a while on more cores than allotted. (It is possible that the mt tasks aren't really using all the cores all the time, and the BOINC scheduler may be doing the right thing, but it looks strange.)

And with all the support from the large computer centers, LHC doesn't really need me anyway. It is only all the strange software that makes it worthwhile
ID: 38445 · Report as offensive     Reply Quote
Previous · 1 · 2 · 3 · 4 · 5 · Next

Message boards : Theory Application : Issues Native Theory application


©2020 CERN