1) Message boards : Theory Application : (Native) Theory - Sherpa looooooong runners (Message 38433)
Posted 26 Mar 2019 by pianoman
Post:
I have what appears to be my first native Sherpa job. It's been running 100% on one core for 1d, 20h. and is 11h from its deadline.
I can't find a running.log on my system, but runRivet.log shows:

===> [runRivet] Sun Mar 24 06:22:54 UTC 2019 [boinc pp winclusive 7000 -,-,10 - sherpa 2.1.1 default 4000 34]


the last lines of that file are:
Display update finished (0 histograms, 0 events).
Updating display...
Display update finished (0 histograms, 0 events).
Updating display...
Display update finished (0 histograms, 0 events).
Updating display...
Display update finished (0 histograms, 0 events).


and the last line before those is
...
Initialized the Beam_Remnant_Handler.
Hadron_Decay_Map::Read:   Initializing HadronDecays.dat. This may take some time.
Initialized the Hadron_Decay_Handler, Decay model = Hadrons
Initialized the Soft_Photon_Handler.
Process_Group::CalculateTotalXSec(): Calculate xs for '2_2__j__j__e-__nu_eb' (Internal)
Starting the calculation at 06:23:26. Lean back and enjoy ... .
Updating display...
Display update finished (0 histograms, 0 events).
...


I don't know anything about sherpa jobs.. is it working or not?[/code][/quote]
2) Message boards : Theory Application : Issues Native Theory application (Message 38424)
Posted 25 Mar 2019 by pianoman
Post:
I think the cgroup sysfs error messages are a red herring.. I mean, yes, they're errors, but I think runc is designed to still work with those present. I'm still running the stock Buster kernel, and at least one of my two machines that was experiencing errors started to work fine when I really didn't change anything.

The other one I don't think has received any more TheoryN tasks so I don't know if that one magically solved itself or not.

And thank you G_UK, I forgot to mention I had to change the repo to stretch by hand as well; I expect that running running testing.
3) Message boards : Theory Application : Issues Native Theory application (Message 38384)
Posted 23 Mar 2019 by pianoman
Post:
Ahh, well, for one reason, the debian buster 4.19 kernel doesn't have CONFIG_CGROUP_HUGETLB set, so there is no /sys/fs/cgroup/hugetlb directory. I wonder why..
4) Message boards : Theory Application : Issues Native Theory application (Message 38383)
Posted 23 Mar 2019 by pianoman
Post:
Your Debian machine may be handling cgroups differently. Did you do the steps for Suspend/Resume from the instructions?


I did, but I'll dig a little deeper into the logs to make sure those scripts are getting run, thanks for the pointer.
5) Message boards : Theory Application : Issues Native Theory application (Message 38370)
Posted 22 Mar 2019 by pianoman
Post:
Seeing failed tasks on debian testing (buster). Yes, I know that's unstable, but it looks to be a container permission issue. Did I miss a step? I have 5 machines, two running ubuntu 18.04 are running theory native just fine (third has its first native theory job in its queue), but my two debian buster systems throwing this error:

Edit: both system run native atlas just fine.

machine 1:
https://lhcathome.cern.ch/lhcathome/result.php?resultid=219687183
<core_client_version>7.14.2</core_client_version>
<![CDATA[
<message>
process exited with code 195 (0xc3, -61)</message>
<stderr_txt>
13:21:36 (21317): wrapper (7.15.26016): starting
13:21:36 (21317): wrapper (7.15.26016): starting
13:21:36 (21317): wrapper: running ../../projects/lhcathome.cern.ch_lhcathome/cranky-0.0.28 ()
17:21:36 2019-03-21: cranky-0.0.28: [INFO] Detected TheoryN App
17:21:36 2019-03-21: cranky-0.0.28: [INFO] Checking CVMFS.
17:21:36 2019-03-21: cranky-0.0.28: [INFO] Checking runc.
17:21:38 2019-03-21: cranky-0.0.28: [INFO] Creating the filesystem.
17:21:38 2019-03-21: cranky-0.0.28: [INFO] Using /cvmfs/cernvm-prod.cern.ch/cvm3
17:21:38 2019-03-21: cranky-0.0.28: [INFO] Creating cgroup for slot 8
mkdir: cannot create directory &#226;&#128;&#152;/sys/fs/cgroup/blkio/boinc&#226;&#128;&#153;: Permission denied
mkdir: cannot create directory &#226;&#128;&#152;/sys/fs/cgroup/hugetlb&#226;&#128;&#153;: Read-only file system
17:21:38 2019-03-21: cranky-0.0.28: [INFO] Updating config.json.
17:21:38 2019-03-21: cranky-0.0.28: [INFO] Running Container 'runc'.
container_linux.go:336: starting container process caused "process_linux.go:293: applying cgroup configuration for process caused \"mkdir /sys/fs/cgroup/blkio/boinc: permission denied\""
17:21:38 2019-03-21: cranky-0.0.28: [ERROR] Container 'runc' terminated with status code 1.
13:21:38 (21317): cranky exited; CPU time 0.136671
13:21:38 (21317): app exit status: 0xce
13:21:38 (21317): called boinc_finish(195)

</stderr_txt>
]]>


And a similar message on machine 2:
https://lhcathome.cern.ch/lhcathome/result.php?resultid=219590462
<core_client_version>7.14.2</core_client_version>
<![CDATA[
<message>
process exited with code 195 (0xc3, -61)</message>
<stderr_txt>
01:41:48 (31651): wrapper (7.15.26016): starting
01:41:48 (31651): wrapper (7.15.26016): starting
01:41:48 (31651): wrapper: running ../../projects/lhcathome.cern.ch_lhcathome/cranky-0.0.28 ()
05:41:48 2019-03-20: cranky-0.0.28: [INFO] Detected TheoryN App
05:41:48 2019-03-20: cranky-0.0.28: [INFO] Checking CVMFS.
05:41:48 2019-03-20: cranky-0.0.28: [INFO] Checking runc.
05:41:48 2019-03-20: cranky-0.0.28: [INFO] Creating the filesystem.
05:41:48 2019-03-20: cranky-0.0.28: [INFO] Using /cvmfs/cernvm-prod.cern.ch/cvm3
05:41:48 2019-03-20: cranky-0.0.28: [INFO] Creating cgroup for slot 7
mkdir: cannot create directory &#226;&#128;&#152;/sys/fs/cgroup/hugetlb&#226;&#128;&#153;: Read-only file system
05:41:48 2019-03-20: cranky-0.0.28: [INFO] Updating config.json.
05:41:48 2019-03-20: cranky-0.0.28: [INFO] Running Container 'runc'.
05:41:48 2019-03-20: cranky-0.0.28: [ERROR] Container 'runc' terminated with status code 139.
01:41:49 (31651): cranky exited; CPU time 0.065568
01:41:49 (31651): app exit status: 0xce
01:41:49 (31651): called boinc_finish(195)

</stderr_txt>
]]>
6) Message boards : Theory Application : Issues Native Theory application (Message 38331)
Posted 20 Mar 2019 by pianoman
Post:
Took me some fits and starts to set up on my small set of 5 computers. Looks to be working on at least some of them now, but I've had several failed tasks and it looks like my machines are being throttled to 1 task a day. I'm assuming as I start completing tasks without error, that throttling will gradually resolve itself?

By the way, boinc is a 'fire and forget' thing for me, I'm glad someone thought to reach out over email and let me know. I wouldn't have noticed for a long time. I didn't even know native apps without virtualbox were an option (edit: and atlas too!)! I'm gleefully removing virtualbox from all my systems now...



©2024 CERN