Message boards :
Theory Application :
Issues Native Theory application
Message board moderation
Previous · 1 · 2 · 3 · 4 · 5 · Next
Author | Message |
---|---|
Send message Joined: 15 Nov 14 Posts: 602 Credit: 24,371,321 RAC: 0 ![]() ![]() |
It looks like I completed a sherpa, but it was only a 2.2.0 and not that long. https://lhcathome.cern.ch/lhcathome/result.php?resultid=219413872 Does it count? |
Send message Joined: 22 Mar 17 Posts: 66 Credit: 25,047,948 RAC: 42,717 ![]() ![]() |
I see that the maximum runtime of last 100 tasks is 25.06 hours. I aborted this one. Running for 2 days while the longest one I've completed was 8 hours. https://lhcathome.cern.ch/lhcathome/workunit.php?wuid=109410751 |
Send message Joined: 29 Jun 18 Posts: 6 Credit: 5,314,428 RAC: 0 ![]() ![]() |
Seeing failed tasks on debian testing (buster). Yes, I know that's unstable, but it looks to be a container permission issue. Did I miss a step? I have 5 machines, two running ubuntu 18.04 are running theory native just fine (third has its first native theory job in its queue), but my two debian buster systems throwing this error: Edit: both system run native atlas just fine. machine 1: https://lhcathome.cern.ch/lhcathome/result.php?resultid=219687183 <core_client_version>7.14.2</core_client_version> <![CDATA[ <message> process exited with code 195 (0xc3, -61)</message> <stderr_txt> 13:21:36 (21317): wrapper (7.15.26016): starting 13:21:36 (21317): wrapper (7.15.26016): starting 13:21:36 (21317): wrapper: running ../../projects/lhcathome.cern.ch_lhcathome/cranky-0.0.28 () 17:21:36 2019-03-21: cranky-0.0.28: [INFO] Detected TheoryN App 17:21:36 2019-03-21: cranky-0.0.28: [INFO] Checking CVMFS. 17:21:36 2019-03-21: cranky-0.0.28: [INFO] Checking runc. 17:21:38 2019-03-21: cranky-0.0.28: [INFO] Creating the filesystem. 17:21:38 2019-03-21: cranky-0.0.28: [INFO] Using /cvmfs/cernvm-prod.cern.ch/cvm3 17:21:38 2019-03-21: cranky-0.0.28: [INFO] Creating cgroup for slot 8 mkdir: cannot create directory ‘/sys/fs/cgroup/blkio/boinc’: Permission denied mkdir: cannot create directory ‘/sys/fs/cgroup/hugetlb’: Read-only file system 17:21:38 2019-03-21: cranky-0.0.28: [INFO] Updating config.json. 17:21:38 2019-03-21: cranky-0.0.28: [INFO] Running Container 'runc'. container_linux.go:336: starting container process caused "process_linux.go:293: applying cgroup configuration for process caused \"mkdir /sys/fs/cgroup/blkio/boinc: permission denied\"" 17:21:38 2019-03-21: cranky-0.0.28: [ERROR] Container 'runc' terminated with status code 1. 13:21:38 (21317): cranky exited; CPU time 0.136671 13:21:38 (21317): app exit status: 0xce 13:21:38 (21317): called boinc_finish(195) </stderr_txt> ]]> And a similar message on machine 2: https://lhcathome.cern.ch/lhcathome/result.php?resultid=219590462 <core_client_version>7.14.2</core_client_version> <![CDATA[ <message> process exited with code 195 (0xc3, -61)</message> <stderr_txt> 01:41:48 (31651): wrapper (7.15.26016): starting 01:41:48 (31651): wrapper (7.15.26016): starting 01:41:48 (31651): wrapper: running ../../projects/lhcathome.cern.ch_lhcathome/cranky-0.0.28 () 05:41:48 2019-03-20: cranky-0.0.28: [INFO] Detected TheoryN App 05:41:48 2019-03-20: cranky-0.0.28: [INFO] Checking CVMFS. 05:41:48 2019-03-20: cranky-0.0.28: [INFO] Checking runc. 05:41:48 2019-03-20: cranky-0.0.28: [INFO] Creating the filesystem. 05:41:48 2019-03-20: cranky-0.0.28: [INFO] Using /cvmfs/cernvm-prod.cern.ch/cvm3 05:41:48 2019-03-20: cranky-0.0.28: [INFO] Creating cgroup for slot 7 mkdir: cannot create directory ‘/sys/fs/cgroup/hugetlb’: Read-only file system 05:41:48 2019-03-20: cranky-0.0.28: [INFO] Updating config.json. 05:41:48 2019-03-20: cranky-0.0.28: [INFO] Running Container 'runc'. 05:41:48 2019-03-20: cranky-0.0.28: [ERROR] Container 'runc' terminated with status code 139. 01:41:49 (31651): cranky exited; CPU time 0.065568 01:41:49 (31651): app exit status: 0xce 01:41:49 (31651): called boinc_finish(195) </stderr_txt> ]]> |
![]() Send message Joined: 20 Jun 14 Posts: 381 Credit: 238,712 RAC: 0 ![]() ![]() |
Your Debian machine may be handling cgroups differently. Did you do the steps for Suspend/Resume from the instructions? |
Send message Joined: 2 May 07 Posts: 2260 Credit: 175,581,097 RAC: 11,545 ![]() ![]() ![]() |
No result so far. Sherpa 1.4.3 for now 25 hours running and hoping to get a good end therefore. From the database, knocking out because of the Deadline: https://lhcathome.cern.ch/lhcathome/workunit.php?wuid=109355962 But... still running 80 hours so far with 777,8 KByte runRivet.log :-)) @Bronco wrote Go Sherpa go! |
Send message Joined: 29 Jun 18 Posts: 6 Credit: 5,314,428 RAC: 0 ![]() ![]() |
Your Debian machine may be handling cgroups differently. Did you do the steps for Suspend/Resume from the instructions? I did, but I'll dig a little deeper into the logs to make sure those scripts are getting run, thanks for the pointer. |
Send message Joined: 29 Jun 18 Posts: 6 Credit: 5,314,428 RAC: 0 ![]() ![]() |
Ahh, well, for one reason, the debian buster 4.19 kernel doesn't have CONFIG_CGROUP_HUGETLB set, so there is no /sys/fs/cgroup/hugetlb directory. I wonder why.. |
Send message Joined: 13 Apr 18 Posts: 443 Credit: 8,438,885 RAC: 0 ![]() ![]() |
No result so far. Sherpa 1.4.3 for now 25 hours running and hoping to get a good end therefore. Now I see I was wrong :-( Die, sherpa, die!! Setting my watchdog script to immediately abort any sherpa less than version 5.0.0. |
Send message Joined: 15 Nov 14 Posts: 602 Credit: 24,371,321 RAC: 0 ![]() ![]() |
I am now getting both Native Theory and Native ATLAS with the appropriated preferences selected. Everything is working well. |
Send message Joined: 13 Apr 18 Posts: 443 Credit: 8,438,885 RAC: 0 ![]() ![]() |
I am now getting both Native Theory and Native ATLAS with the appropriated preferences selected. Yes, very well. I have a native ATLAS and a native Theory running concurrently on a host with only 4 GB RAM :) Hoping to add native CMS to the mix someday. |
Send message Joined: 15 Nov 14 Posts: 602 Credit: 24,371,321 RAC: 0 ![]() ![]() |
Hoping to add native CMS to the mix someday. Exactly, but I have to keep my health up to live that long. |
![]() Send message Joined: 21 Nov 10 Posts: 5 Credit: 2,007,500 RAC: 0 ![]() ![]() |
OK, I have managed to get these Native tasks running on Debian Buster although it has several dependencies. 1) You will need to build a custom kernel, I built a stock 4.19 kernel from kernel.org with the following additional options enabled: - All Controllers in "General Setup -> Control Group Support" - All Namespaces in "General Setup -> Namespaces support" - FUSE filesystem in "Filesystems -> FUSE" (Note: you will need to enable additional drivers etc depending on your machine) 2) Set-up the CERN repo and Install CVMFS Follow the instructions here https://cvmfs.readthedocs.io/en/stable/cpt-quickstart.html For Debian you will need to edit the apt repo before installing CVMFS as it defaults to Ubuntu. "sudo nano /etc/apt/sources.list.d/cernvm.list" deb http://cvmrepo.web.cern.ch/cvmrepo/apt stretch-prod main "sudo apt update" 3) Singularity (for Atlas) can be installed direct from the Debian repo "sudo apt install singularity-container" 4) Set-up CVMFS as per the pinned instructions, as we have enabled namespaces in the kernel you can skip the "Enabling user namespace" step. I think that is everything I had to do, I ended up getting most of the error messages that have been posted so far at various parts of getting this running.[/list] https://gridcoin.ddns.net Rx7Qvpvc7qEZbrHeJ8ukknJo4Gwrc3unBg Bitshares: g-uk https://wallet.bitshares.org |
![]() Send message Joined: 21 Nov 10 Posts: 5 Credit: 2,007,500 RAC: 0 ![]() ![]() |
My tasks are now completing successfully however they still have errors in the log. <core_client_version>7.14.2</core_client_version> <![CDATA[ <stderr_txt> 22:36:06 (28989): wrapper (7.15.26016): starting 22:36:06 (28989): wrapper (7.15.26016): starting 22:36:06 (28989): wrapper: running ../../projects/lhcathome.cern.ch_lhcathome/cranky-0.0.28 () 22:36:06 2019-03-24: cranky-0.0.28: [INFO] Detected TheoryN App 22:36:06 2019-03-24: cranky-0.0.28: [INFO] Checking CVMFS. 22:36:06 2019-03-24: cranky-0.0.28: [INFO] Checking runc. 22:36:06 2019-03-24: cranky-0.0.28: [INFO] Creating the filesystem. 22:36:06 2019-03-24: cranky-0.0.28: [INFO] Using /cvmfs/cernvm-prod.cern.ch/cvm3 22:36:06 2019-03-24: cranky-0.0.28: [INFO] Creating cgroup for slot 8 mkdir: cannot create directory ‘/sys/fs/cgroup/net_cls’: Read-only file system mkdir: cannot create directory ‘/sys/fs/cgroup/net_prio’: Read-only file system 22:36:06 2019-03-24: cranky-0.0.28: [INFO] Updating config.json. 22:36:06 2019-03-24: cranky-0.0.28: [INFO] Running Container 'runc'. 23:03:50 2019-03-24: cranky-0.0.28: [INFO] Container 'runc' finished with status code 0. ===> [runRivet] Sun Mar 24 22:36:06 UTC 2019 [boinc pp jets 7000 - - pythia8 8.226 tune-CUETP8S1 100000 34] 23:03:50 2019-03-24: cranky-0.0.28: [INFO] Preparing output. 23:03:50 (28989): cranky exited; CPU time 2011.657096 23:03:50 (28989): called boinc_finish(0) </stderr_txt> ]]> https://gridcoin.ddns.net Rx7Qvpvc7qEZbrHeJ8ukknJo4Gwrc3unBg Bitshares: g-uk https://wallet.bitshares.org |
Send message Joined: 29 Jun 18 Posts: 6 Credit: 5,314,428 RAC: 0 ![]() ![]() |
I think the cgroup sysfs error messages are a red herring.. I mean, yes, they're errors, but I think runc is designed to still work with those present. I'm still running the stock Buster kernel, and at least one of my two machines that was experiencing errors started to work fine when I really didn't change anything. The other one I don't think has received any more TheoryN tasks so I don't know if that one magically solved itself or not. And thank you G_UK, I forgot to mention I had to change the repo to stretch by hand as well; I expect that running running testing. |
![]() Send message Joined: 20 Jun 14 Posts: 381 Credit: 238,712 RAC: 0 ![]() ![]() |
OK, I have managed to get these Native tasks running on Debian Buster although it has several dependencies. Why did you need to do this? Will the default Buster installation not support rootless containers?
Do we need to update the instructions?
|
![]() Send message Joined: 20 Jun 14 Posts: 381 Credit: 238,712 RAC: 0 ![]() ![]() |
My tasks are now completing successfully however they still have errors in the log. Runc will ignore this but some features (which we may or may not need) will not be available. I will suppress those error messages in a future release. |
![]() Send message Joined: 21 Nov 10 Posts: 5 Credit: 2,007,500 RAC: 0 ![]() ![]() |
I couldn't get the work units to run with the stock Buster Kernel and I saw the post by @pianoman mentioning that CONFIG_CGROUP_HUGETLB was not set in the kernel so I thought I would try enabling it. After building a new Kernel with the mentioned features enabled the work-units could run so I posted what I had done to get it working. Since I did that however, @pianoman has posted to say his has suddenly started working so it may have been something else that has resolved itself and just coincided with me rebuilding the Kernel, when I get back home later I will have to switch one machine back to the stock Buster Kernel and test again.
When following these instructions for Debian/Ubuntu, I found that the cvmfs-release-latest_all.deb package sets up an apt repository for Ubuntu Precise rather than Debian Stretch. This is a problem as there is a difference in version numbers for some cvmfs dependencies between Ubuntu and Debian. To fix it you need to edit the apt sources file that the package creates to point at the correct distribution.
I think I was getting those because I didn't have net-cls or net-prio enabled in the kernel options. I did rebuild the kernel last night but have not had a chance to check it yet as I have been at work since. As mentioned above though I need to switch one machine back to the stock Debian Buster kernel tonight to double check. Edit: Just checked one machines completed work units and everything after 8AM this morning is no longer getting the errors so the rebuild with net-cls and net-prio worked. https://lhcathome.cern.ch/lhcathome/results.php?hostid=10586088 https://gridcoin.ddns.net Rx7Qvpvc7qEZbrHeJ8ukknJo4Gwrc3unBg Bitshares: g-uk https://wallet.bitshares.org |
![]() Send message Joined: 21 Nov 10 Posts: 5 Credit: 2,007,500 RAC: 0 ![]() ![]() |
Right, I've got home and downgraded one machine back to the default Debian Buster Kernel https://lhcathome.cern.ch/lhcathome/results.php?hostid=10586087 As you can see the jobs started failing again so something is wrong with this set-up. I'll next try rebuilding the Kernel with the Debian config applied to a Vanilla kernel.org 4.19 (with and without CONFIG_CGROUP_HUGETLB enabled), this should hopefully tell us if it is a Debian Buster specific patch that is throwing it off. https://gridcoin.ddns.net Rx7Qvpvc7qEZbrHeJ8ukknJo4Gwrc3unBg Bitshares: g-uk https://wallet.bitshares.org |
Send message Joined: 13 Apr 18 Posts: 443 Credit: 8,438,885 RAC: 0 ![]() ![]() |
Hoping to add native CMS to the mix someday. Optimism promotes longevity and it tastes better than liver, kale and fat free ice cream. If you haven't already done so, snag some sixtrack and see how nicely they play with native ATLAS and native Theory. With "switch between tasks every..." set to 2080 minutes and a sane task cache nothing gets preempted which avoids ATLAS restarting from 0 events. Very nice!! |
Send message Joined: 15 Nov 14 Posts: 602 Credit: 24,371,321 RAC: 0 ![]() ![]() |
If you haven't already done so, snag some sixtrack and see how nicely they play with native ATLAS and native Theory. With "switch between tasks every..." set to 2080 minutes and a sane task cache nothing gets preempted which avoids ATLAS restarting from 0 events. Very nice!! I avoid sixtrack. It is too easy, and requires no special software. Anyone can run it, so I let them. But I did pick up a few inadvertently when I had "allow other work" selected. The problem is that the BOINC scheduler does not deal with single-thread and multi-thread work units very well, and often runs both for a while on more cores than allotted. (It is possible that the mt tasks aren't really using all the cores all the time, and the BOINC scheduler may be doing the right thing, but it looks strange.) And with all the support from the large computer centers, LHC doesn't really need me anyway. It is only all the strange software that makes it worthwhile |
©2025 CERN