Message boards : Theory Application : Issues Native Theory application
Message board moderation

To post messages, you must log in.

Previous · 1 · 2 · 3 · 4 · 5

AuthorMessage
bronco

Send message
Joined: 13 Apr 18
Posts: 443
Credit: 8,257,683
RAC: 9,559
Message 38447 - Posted: 26 Mar 2019, 17:09:04 UTC - in response to Message 38445.  

[quoteI avoid sixtrack. It is too easy, and requires no special software. Anyone can run it, so I let them.{/quote]
Good point. It illustrates the lost opportunity cost concept very well... when you crunch sixtrack you lose the opportunity to crunch a task that many other volunteers cannot crunch.
ID: 38447 · Report as offensive     Reply Quote
Jim1348

Send message
Joined: 15 Nov 14
Posts: 330
Credit: 10,824,283
RAC: 18,660
Message 38448 - Posted: 26 Mar 2019, 17:31:19 UTC - in response to Message 38447.  

Good point. It illustrates the lost opportunity cost concept very well... when you crunch sixtrack you lose the opportunity to crunch a task that many other volunteers cannot crunch.

Good economics. I try to optimize the total return, while having fun at the same time (that is an economic benefit also).
ID: 38448 · Report as offensive     Reply Quote
Crystal Pellet
Volunteer moderator
Volunteer tester

Send message
Joined: 14 Jan 10
Posts: 755
Credit: 6,034,572
RAC: 1,118
Message 38503 - Posted: 30 Mar 2019, 20:32:39 UTC

It rarely happens, but sometimes an error between all the valids.

After 1.5 hours runtime: Exit status 195 (0x000000C3) EXIT_CHILD_FAILED

https://lhcathome.cern.ch/lhcathome/result.php?resultid=220279871
ID: 38503 · Report as offensive     Reply Quote
Crystal Pellet
Volunteer moderator
Volunteer tester

Send message
Joined: 14 Jan 10
Posts: 755
Credit: 6,034,572
RAC: 1,118
Message 38525 - Posted: 3 Apr 2019, 11:24:53 UTC

Another error: https://lhcathome.cern.ch/lhcathome/result.php?resultid=220489764 Exit status - 195 (0x000000C3) EXIT_CHILD_FAILED

Job description: ===> [runRivet] Wed Apr 3 08:48:44 UTC 2019 [boinc pp jets 8000 25 - pythia8 8.230 tune-monashstar 100000 38]

in BOINC's Event log:
Wed 03 Apr 2019 12:19:44 PM CEST | LHC@home | Computation for task TheoryN_2279-789428-38_0 finished
Wed 03 Apr 2019 12:19:44 PM CEST | LHC@home | Output file TheoryN_2279-789428-38_0_r937641198_result for task TheoryN_2279-789428-38_0 absent


Last lines of the runRivet.log after 100000 events processed:

Generator run finished successfully
100000 events processed
dumping histograms...
Rivet.Analysis.Handler: INFO Finalising analyses
terminate called after throwing an instance of 'YODA::LowStatsError'
what(): Requested variance of a distribution with only one effective entry
./runRivet.sh: line 376: 263 Aborted (core dumped) $rivetExecString (wd: /shared/tmp/tmp.jIgxbeAbd0)
INFO: waiting for jobs completion timeout=49
[1] 262 Done env $origEnv $generatorExecString
[3]+ 264 Running display_service $tmpd_dump "$beam $process $energy $params $generator $version $tune" &

Processing histograms...
input = /shared/tmp/tmp.jIgxbeAbd0/flat
output = /shared
./runRivet.sh: line 850: 264 Killed display_service $tmpd_dump "$beam $process $energy $params $generator $version $tune" (wd: /shared)
ERROR: following histograms should be produced according to run parameters,
but missing from Rivet output:

ATLAS_2015_I1393758_d01-x01-y01
ATLAS_2015_I1393758_d02-x01-y01
ATLAS_2015_I1393758_d03-x01-y01
ATLAS_2015_I1393758_d04-x01-y01
ATLAS_2015_I1393758_d05-x01-y01
ATLAS_2015_I1393758_d06-x01-y01
ATLAS_2015_I1393758_d07-x01-y01
ATLAS_2015_I1393758_d08-x01-y01
ATLAS_2015_I1393758_d09-x01-y01
ATLAS_2015_I1393758_d10-x01-y01
ATLAS_2015_I1393758_d11-x01-y01
ATLAS_2015_I1393758_d12-x01-y01
ATLAS_2016_I1419070_d01-x01-y01
ATLAS_2016_I1419070_d02-x01-y01
ATLAS_2016_I1419070_d03-x01-y01
ATLAS_2016_I1419070_d04-x01-y01
ATLAS_2016_I1419070_d05-x01-y01
ATLAS_2016_I1419070_d06-x01-y01
ATLAS_2016_I1419070_d07-x01-y01
ATLAS_2016_I1419070_d08-x01-y01
ATLAS_2016_I1419070_d09-x01-y01
ATLAS_2016_I1419070_d10-x01-y01
ATLAS_2016_I1419070_d11-x01-y01
ATLAS_2016_I1419070_d12-x01-y01

check mapping of above histograms in configuration file:
configuration/rivet-histograms.map
ID: 38525 · Report as offensive     Reply Quote
gyllic

Send message
Joined: 9 Dec 14
Posts: 201
Credit: 2,500,279
RAC: 354
Message 38608 - Posted: 23 Apr 2019, 3:12:17 UTC - in response to Message 38503.  

It rarely happens, but sometimes an error between all the valids.

After 1.5 hours runtime: Exit status 195 (0x000000C3) EXIT_CHILD_FAILED

https://lhcathome.cern.ch/lhcathome/result.php?resultid=220279871
same here. Until now, 3 out of ~100 failed with the same error as mentioned above:

https://lhcathome.cern.ch/lhcathome/result.php?resultid=221596097
https://lhcathome.cern.ch/lhcathome/result.php?resultid=221500071
https://lhcathome.cern.ch/lhcathome/result.php?resultid=221484725

Any idea why that happens?
ID: 38608 · Report as offensive     Reply Quote
Crystal Pellet
Volunteer moderator
Volunteer tester

Send message
Joined: 14 Jan 10
Posts: 755
Credit: 6,034,572
RAC: 1,118
Message 38611 - Posted: 23 Apr 2019, 7:25:02 UTC - in response to Message 38608.  

Any idea why that happens?
From the previous post: mismatch between run parameters and Rivet output. The project has to solve this.
I'm not sure whether your errors have the same cause.
ID: 38611 · Report as offensive     Reply Quote
Jim1348

Send message
Joined: 15 Nov 14
Posts: 330
Credit: 10,824,283
RAC: 18,660
Message 38655 - Posted: 27 Apr 2019, 14:38:37 UTC

FWIW, I have always wondered whether you could run VBox and Native work units on the same machine, if you used two separate BOINC instances. It turns out that you can, if you put the VBox ones in the original BOINC instance (it does not run in the second one, at least under Ubuntu 16.04.6).

I am running CMS in the first BOINC instance, and Native Theory in the second, with four cores each on an i7-4790. My ultimate goal is to remove VirtualBox entirely, and run native ATLAS in the first instance. That way, if Native Theory hangs up, I can at least limit the number of cores affected (until bronco comes up with a fix).
ID: 38655 · Report as offensive     Reply Quote
bronco

Send message
Joined: 13 Apr 18
Posts: 443
Credit: 8,257,683
RAC: 9,559
Message 38671 - Posted: 30 Apr 2019, 16:41:57 UTC - in response to Message 38655.  

That way, if Native Theory hangs up, I can at least limit the number of cores affected (until bronco comes up with a fix).

What do you mean by "if Native Theory hangs up"? If you mean the problem where the task runs into the deadline and doesn't stop, the latest version of my watchdog handles that by aborting the task 1 hour before deadline. It would be nice if there was a way to do a graceful shutdown but native Theory doesn't have that facility. Aborting the task isn't really what most volunteers will regard as a solution but it's better than just letting the task run until the server cancels it (which the server doesn't seem to be doing ATM).

The only other problem I have noticed with native Theory is tasks ending with the 195 EXIT_CHILD_FAILED error which I don't understand and don't have a way to handle, yet.

The 196 (0x000000C4) EXIT_DISK_LIMIT_EXCEEDED problem doesn't seem to affect native Theory which is fortunate because I see no way for a watchdog running on the user's account to detect the condition.

So I think my watchdog does everything it can possibly do for both native and VBox Theory. Unless somebody has any further suggestions, I believe it's ready for beta test :)
ID: 38671 · Report as offensive     Reply Quote
Crystal Pellet
Volunteer moderator
Volunteer tester

Send message
Joined: 14 Jan 10
Posts: 755
Credit: 6,034,572
RAC: 1,118
Message 38676 - Posted: 30 Apr 2019, 19:45:20 UTC - in response to Message 38671.  

The 196 (0x000000C4) EXIT_DISK_LIMIT_EXCEEDED problem doesn't seem to affect native Theory which is fortunate because I see no way for a watchdog running on the user's account to detect the condition.
I have had a native Theory with that condition and reported that Feb 19th at the dev-project.

Extracted:

In BOINC Manager: Aborting task Theory_2279-790023-18_2: exceeded disk limit: 3038.16MB > 1907.35MB
runRivet.log 3184721191 bytes.

Job: [boinc pp jets 8000 250,-,4160 - sherpa 1.2.3 default 31000 18]
ID: 38676 · Report as offensive     Reply Quote
bronco

Send message
Joined: 13 Apr 18
Posts: 443
Credit: 8,257,683
RAC: 9,559
Message 38677 - Posted: 30 Apr 2019, 20:52:59 UTC - in response to Message 38676.  

The 196 (0x000000C4) EXIT_DISK_LIMIT_EXCEEDED problem doesn't seem to affect native Theory which is fortunate because I see no way for a watchdog running on the user's account to detect the condition.
I have had a native Theory with that condition and reported that Feb 19th at the dev-project.

Hmm. Don't know if my watchdog can detect that. Or perhaps I should say I don't know how to make it detect that. For Theory VBox it's easy. The script simply recursively walks the directory rooted at the task's slot dir and sums the sizes of all the files it finds. Running the script as root (or making user a member of the boinc group) ensures the script has read permission for all pathnames encountered. It encounters < 100 files.

For native Theory it's not so easy. Walking the slot folder causes thousands of no read permission exceptions which of course are trapped and handled in the script. The problem is it finds either:
1) thousands of files and the total of the file sizes is ~10 X <rsc_disk_bound> which triggers task abort
2) just a few files that never total more than 0.01 X <rsc_disk_bound>
Sometimes it just hangs on certain paths as if it's waiting for a response from the OS's stat function. Sometimes the response comes, sometimes not in which case the script hangs forever.

I assume the problem walking the slot dir is because native Theory runs in a runc owned by user boinc-client. Sometimes the walk recurses into directories that appear to belong to CVMFS and that seems to be where it throws exceptions or hangs.
ID: 38677 · Report as offensive     Reply Quote
computezrmle
Avatar

Send message
Joined: 15 Jun 08
Posts: 1144
Credit: 56,277,770
RAC: 97,712
Message 38678 - Posted: 30 Apr 2019, 21:08:37 UTC - in response to Message 38677.  

Ever heard of "du"?
:-)
ID: 38678 · Report as offensive     Reply Quote
bronco

Send message
Joined: 13 Apr 18
Posts: 443
Credit: 8,257,683
RAC: 9,559
Message 38679 - Posted: 1 May 2019, 1:26:29 UTC - in response to Message 38678.  

Nope. Apparently neither did Bill Gates until he bought Sysinternals.
Thanks for that :-))
ID: 38679 · Report as offensive     Reply Quote
Previous · 1 · 2 · 3 · 4 · 5

Message boards : Theory Application : Issues Native Theory application


©2019 CERN