21) Message boards : Theory Application : Could not get X509 credentials (Message 39203)
Posted 27 Jun 2019 by bronco
Post:
Please let me know if these problems are still occurring.
there was another one short time ago:
https://lhcathome.cern.ch/lhcathome/result.php?resultid=234571121

Edit: Now I am seeing this problem on another of my machines, too. So my inital suspicion that something might be wrong with computer ID: 10555784 is definitely wrong.


Or there is something wrong with computer ID: 10555784 and it has now spread to the other machine.
22) Questions and Answers : Windows : Recommended Virtualbox version (Message 39196)
Posted 26 Jun 2019 by bronco
Post:
If I cap the number of cores or processing time in Boinc on 6.0.8 the unmanageable message does not appear. Unfortunately it doesn't seem to calculate at all then. Progress is VERY slow, and task manager states CPU usage about 1 to 10 percent.


How are you measuring progress? The "% complete" in BOINC Manager is not a reliable indicator for these tasks.
ATLAS tasks are very slow to startup. CPU usage will be low for at least 10 minutes after task startup. You'll notice other short periods of low CPU usage as the task progresses.
23) Message boards : Theory Application : New Native Theory Version 1.1 (Message 39138)
Posted 17 Jun 2019 by bronco
Post:
So... watchdog script reconfigured to reject any and all Sherpa and limit task duration to 20 hours. Bad for the science? Maybe I'll care about that when LHC devs care about wasting my CPU time.

I will look into this.
Don't bother. The solution is nigh.

Please continue to post links to tasks and log snippets regarding long running or looping jobs to Theory's endless looping thread.
What for? Crystal Pellet and a few others have been posting links and log snippets for 10 years and all that's got us is devs have turned long runners into forever runners. That's not progress. That's regress.
24) Message boards : Theory Application : New Native Theory Version 1.1 (Message 39133)
Posted 15 Jun 2019 by bronco
Post:

finished after 2 days and 22 Hours - not all are bad:
https://lhcathome.cern.ch/lhcathome/result.php?resultid=231900388

True, not all are bad. The problem, at least for me, is that the forever runners pile up in a big log jam and block the long runners that have a chance at succeeding. That can't be good for the science.

I guess if I were a "serious" cruncher I would have time to manually pick the log jam apart (with assistance from MC Production) but cherry picking is boring as well as tedious.

If I were a "serious" programmer I guess I would automate the "cherry picking via MC Production" with some script. Maybe one day.

Depending on how lucky they feel, watchdog users will be able to opt to accept Sherpa but set max task duration to as many days as they want or allow them to run forever. Maybe having such options available will get the number of TheoryN users up into the 50's. ~35 is abysmal.
25) Message boards : Theory Application : New Native Theory Version 1.1 (Message 39131)
Posted 15 Jun 2019 by bronco
Post:
https://lhcathome.cern.ch/lhcathome/result.php?resultid=230970330
3 days after the server marked the above task, a Sherpa, "abandoned", it was still running on my host.
It's not the first one, I've had several others in the past month.
Congrats LHC devs, you've taken Sherpa long runners and made them into forever runners.

So... watchdog script reconfigured to reject any and all Sherpa and limit task duration to 20 hours. Bad for the science? Maybe I'll care about that when LHC devs care about wasting my CPU time.
26) Message boards : Number crunching : Setting up a local squid cache for a home cluster - old comments (Message 39121)
Posted 13 Jun 2019 by bronco
Post:
Thanks to everybody who took the time to set this up, especially computezrmie for the example config file and ground work, also Purple Hat, Darrell and others for hints and suggestions. It's running nicely here on Lubuntu serving native ATLAS and Theory to 3 BOINC clients and Firefox.
27) Questions and Answers : Unix/Linux : Setting up cvmfs (necessary for Cern experiments)(Linux)(mac also) Also if you have a boinc Ubuntu VM or docker VM (Have at least 8gb of dynamic ram available & swap if possible Set in the VM) (Message 39016)
Posted 1 Jun 2019 by bronco
Post:
I am getting work units for TheoryN and they are completing.

Congrats!

On a slightly different tack: I am running Atlas via VirtualBox, would it be better to switch it to the native Linux now that cvmfs and TheoryN seem to be working?

Yes. The fact that BOINC won't start with systemctl is disconcerting but if native ATLAS works for you then by all means enjoy the benefits.
28) Message boards : Theory Application : 196 (0x000000C4) EXIT_DISK_LIMIT_EXCEEDED - how come? (Message 39013)
Posted 31 May 2019 by bronco
Post:
Not true. I have had problems while using windows. Vbox even shows infame message "can't handle job"...
More:
https://lhcathome.cern.ch/lhcathome/forum_thread.php?id=5015

See forums, plenty of user complaining :)

Meh, just a few complainers who, like you, aren't "serious" about crunching LHC, just serious about complaining.
Again, take a look at the result reports from the many hundreds of users who are returning tasks that were paused/resumed, Linux as well as Windows. You'll see that their tasks validate. Your claim that pause/resume doesn't work is complete BS. If pause/resume doesn't work on your hosts it's because your hosts are misconfigured or you don't follow the necessary procedure.
If you ever decide to get "serious" about it you'll be able to crunch LHC too. Until then you should stick to the easy projects.
29) Message boards : Theory Application : 196 (0x000000C4) EXIT_DISK_LIMIT_EXCEEDED - how come? (Message 38996)
Posted 29 May 2019 by bronco
Post:
Vbox is simply "not working" for a normal user (can't stops & resume jobs with no errors)

I used to think that too. I was wrong. They stop and resume just fine now after fixing my VBox installation. In fact if you take the time to look through other users' result reports you'll see (in the stderr text for successful tasks) that their tasks pause/resume several times.

and we all have seen all kind of faulty tasks,

Not really. Just 1 kind... Sherpa... and only a small percentage of those fail.
30) Message boards : Theory Application : Out of BOINC-workunits for Theory Native (Message 38987)
Posted 28 May 2019 by bronco
Post:
Again.
31) Message boards : Theory Application : 197 (0x000000C5) EXIT_TIME_LIMIT_EXCEEDED (Message 38936)
Posted 22 May 2019 by bronco
Post:
I got a number of these too. They occurred on tasks that I had pushed beyond the 18 hour limit. It's interesting that I did NOT get the error on numerous other tasks I pushed well beyond 18 hours before the pentathlon. This started happening around the time of the "adjustments" that occurred to accommodate the pentathlon. I'm guessing a config file got accidentally altered during those "adjustments" and now the <rsc_fpops_bound> is far too low.
32) Message boards : News : 2019 BOINC Pentathlon is over - a big thank you from the SixTrack team! (Message 38920)
Posted 20 May 2019 by bronco
Post:
I can think of only one way to give BOINC Pentathlon the recognition it deserves... establish it as a regular millenial event to be held in the Spring of the 19th year of every millenium.
33) Message boards : News : BOINC Pentathlon - Sixtrack sprint (Message 38867)
Posted 16 May 2019 by bronco
Post:
It might never happen. It appears hackers have broken into your account and are adding garbage to your posts. Can you see it? Looks like a pic of the mess street people leave under a bridge after a spray paint party. You might wanna change your password or something.
34) Message boards : Theory Application : New Native Theory Version 1.1 (Message 38803)
Posted 13 May 2019 by bronco
Post:
what surprises me: as per info from the Server Status page, Theory Native shows between 16 and 20 crunchers in the "users in past 24 hours" column.I

I just assumed that they have tried it, gotten hung up with a long runner, and abandoned it. The remaining people are either willing to babysit their machine, or haven't run into the problem yet.


Native theory won't gain wider acceptance until it has a mechanism for graceful shutdown. My watchdog script circumvents the "run past deadline" and the "disk limit exceeded" problems but it does so by aborting the task. That's not user friendly to volunteers who expect credits. Hopefully the new native Theory allows graceful shutdown. Or maybe crunchers can add it by putting a <completion_trigger_file> line in one of the config files.
35) Message boards : Theory Application : New Native Theory Version 1.1 (Message 38794)
Posted 13 May 2019 by bronco
Post:
Mon 13 May 2019 02:29:21 AM MDT | LHC@home | Started upload of TheoryN_2279-800506-55_0_r1972795353_result
Mon 13 May 2019 02:29:24 AM MDT | LHC@home | Finished upload of TheoryN_2279-800506-55_0_r1972795353_result
Mon 13 May 2019 02:32:19 AM MDT | LHC@home | Sending scheduler request: To report completed tasks.
Mon 13 May 2019 02:32:19 AM MDT | LHC@home | Reporting 2 completed tasks
Mon 13 May 2019 02:32:19 AM MDT | LHC@home | Requesting new tasks for CPU
Mon 13 May 2019 02:34:21 AM MDT | | Project communication failed: attempting access to reference site
Mon 13 May 2019 02:34:21 AM MDT | LHC@home | Scheduler request failed: Timeout was reached
Mon 13 May 2019 02:34:23 AM MDT | | Internet access OK - project servers may be temporarily down.
Mon 13 May 2019 02:41:41 AM MDT | LHC@home | Sending scheduler request: To report completed tasks.
Mon 13 May 2019 02:41:41 AM MDT | LHC@home | Reporting 2 completed tasks
Mon 13 May 2019 02:41:41 AM MDT | LHC@home | Requesting new tasks for CPU
Mon 13 May 2019 02:41:43 AM MDT | LHC@home | Scheduler request completed: got 0 new tasks
Mon 13 May 2019 02:41:43 AM MDT | LHC@home | Server can't open database
Mon 13 May 2019 02:59:49 AM MDT | LHC@home | Computation for task Theory_3365477_1557644839.539115_0 finished
Mon 13 May 2019 03:41:48 AM MDT | LHC@home | Sending scheduler request: To report completed tasks.
Mon 13 May 2019 03:41:48 AM MDT | LHC@home | Reporting 3 completed tasks
Mon 13 May 2019 03:41:48 AM MDT | LHC@home | Requesting new tasks for CPU
Mon 13 May 2019 03:42:01 AM MDT | LHC@home | Scheduler request completed: got 0 new tasks
Mon 13 May 2019 03:42:01 AM MDT | LHC@home | Server error: feeder not running
Mon 13 May 2019 04:42:05 AM MDT | LHC@home | Sending scheduler request: To report completed tasks.
Mon 13 May 2019 04:42:05 AM MDT | LHC@home | Reporting 3 completed tasks
Mon 13 May 2019 04:42:05 AM MDT | LHC@home | Requesting new tasks for CPU
Mon 13 May 2019 04:42:17 AM MDT | LHC@home | Scheduler request completed: got 0 new tasks
Mon 13 May 2019 04:42:17 AM MDT | LHC@home | Server error: feeder not running
36) Message boards : Theory Application : Theory's endless looping (Message 38694)
Posted 4 May 2019 by bronco
Post:
Again a longrunner where the last job failed after just a bit more than 36 h runtime:
https://lhcathome.cern.ch/lhcathome/result.php?resultid=221925569

OK, I'm convinced Condor imposes a 36.6 hour limit. If suspending the task causes Condor to reset the elapsed time counter then it would seem a watchdog could extend the job beyond 36.6 by:
1) suspending the task when current job reaches 36 hours
2) wait until VBox reports (via VBox Manage) that the VM has properly suspended
3) resume the task

I modified the above 3 steps to something easier to code and observe. Steps are now:
1) suspend task at integer multiples of 35 hours elapsed task time (so 35, 70, 105...)
2) user manually resumes the task

It seems to work. I have:
https://lhcathome.cern.ch/lhcathome/result.php?resultid=222336270 with tet (task elapsed time) 2 days 9 hours crunching a Sherpa 2.2.4 with jet (job elapsed time) 50 hours
https://lhcathome.cern.ch/lhcathome/result.php?resultid=222334188 with tet 2 days 8 hours crunching a Sherpa 1.2.3 with jet 41 hours.
37) Message boards : Theory Application : Theory's endless looping (Message 38682)
Posted 1 May 2019 by bronco
Post:
Again a longrunner where the last job failed after just a bit more than 36 h runtime:
https://lhcathome.cern.ch/lhcathome/result.php?resultid=221925569

OK, I'm convinced Condor imposes a 36.6 hour limit. If suspending the task causes Condor to reset the elapsed time counter then it would seem a watchdog could extend the job beyond 36.6 by:
1) suspending the task when current job reaches 36 hours
2) wait until VBox reports (via VBox Manage) that the VM has properly suspended
3) resume the task
38) Message boards : Theory Application : Issues Native Theory application (Message 38679)
Posted 1 May 2019 by bronco
Post:
Nope. Apparently neither did Bill Gates until he bought Sysinternals.
Thanks for that :-))
39) Message boards : Theory Application : Issues Native Theory application (Message 38677)
Posted 30 Apr 2019 by bronco
Post:
The 196 (0x000000C4) EXIT_DISK_LIMIT_EXCEEDED problem doesn't seem to affect native Theory which is fortunate because I see no way for a watchdog running on the user's account to detect the condition.
I have had a native Theory with that condition and reported that Feb 19th at the dev-project.

Hmm. Don't know if my watchdog can detect that. Or perhaps I should say I don't know how to make it detect that. For Theory VBox it's easy. The script simply recursively walks the directory rooted at the task's slot dir and sums the sizes of all the files it finds. Running the script as root (or making user a member of the boinc group) ensures the script has read permission for all pathnames encountered. It encounters < 100 files.

For native Theory it's not so easy. Walking the slot folder causes thousands of no read permission exceptions which of course are trapped and handled in the script. The problem is it finds either:
1) thousands of files and the total of the file sizes is ~10 X <rsc_disk_bound> which triggers task abort
2) just a few files that never total more than 0.01 X <rsc_disk_bound>
Sometimes it just hangs on certain paths as if it's waiting for a response from the OS's stat function. Sometimes the response comes, sometimes not in which case the script hangs forever.

I assume the problem walking the slot dir is because native Theory runs in a runc owned by user boinc-client. Sometimes the walk recurses into directories that appear to belong to CVMFS and that seems to be where it throws exceptions or hangs.
40) Message boards : Theory Application : Theory's endless looping (Message 38673)
Posted 30 Apr 2019 by bronco
Post:
If there is such an additional watchdog then it appears to be inconsistent.

Not necessarily.
Your example shows a "VM state change" from running to paused and later back to running.
This may have reset the shutdown timer.

I missed that line and yes it may have reset the timer if there is one (yours may shutdown for some reason other than a Condor imposed limit)


Previous 20 · Next 20


©2024 CERN