21) Message boards : Theory Application : (Native) Theory - Sherpa looooooong runners (Message 41023)
Posted 20 Dec 2019 by Brummig
Post:
This Sherpa was finished now from a other Volunteer in half a hour?

I've seen that happen with my Virtual Box long-runners; tasks that I aborted after days or weeks of running were completed in a fraction of the time by another host.
22) Message boards : Theory Application : New version 300.00 (Message 40979)
Posted 16 Dec 2019 by Brummig
Post:
OK, thanks. I just aborted it, as it doesn't seem to be making any obvious progress. I don't have time for this hand-holding of tasks, and long-runners stop my host from running LHC tasks that play nicely and complete in a reasonable time. I'll just go back to aborting tasks if they switch to displaying an impossible ETA in BOINC manager.
23) Message boards : Theory Application : New version 300.00 (Message 40973)
Posted 16 Dec 2019 by Brummig
Post:
How am I supposed to know from the following VM console output if this long-runner is doing anything useful?
0.0918751 pb  +- C 0.000644023 pb = 0.700976 % ) 130000 ( 477274 -> 28.6 % )
full optimization:  ( 1h 35m 44s elapsed / 2h 23m 36s left ) [09:07:Z3]
Updating display.
Display update finished (0 histograms, 0 events).
Updating display.
Display update finished (0 histograms, 0 events).
Updating display.
Display update finished (0 histograms, 0 events).
Updating display.
Display update finished (0 histograms, 0 events).
Updating display.
Display update finished (0 histograms, 0 events).
Updating display.
Display update finished (0 histograms, 0 events).
Updating display.
Display update finished (0 histograms, 0 events).
Updating display.
Display update finished (0 histograms, 0 events).
0.0920208 pb  +- ( 0.000606447 pb = 0.659032  ) 140000 ( 511623 -> 29.1 %)
full optimization:  ( 1h 43m 11s elapsed / 2h 16m 21s left ) [09:14:58]
Updating display.
Display update finished (0 histograms, 0 events).
Updating display.
Display update finished (0 histograms, 0 events).

The part that says "2h 16m 21s left" doesn't appear to represent the end point of the entire task, but rather some sub-task, and I've even less idea what the rest of the text means (other than the time in square brackets).
24) Message boards : Theory Application : New version 300.00 (Message 40945)
Posted 13 Dec 2019 by Brummig
Post:
OK, thanks, I'll try that. I have a Sherpa task running now that says in the VM console integration time: (3h 4m 22s elapsed / 1h 5m 18s left), but BOINC manager says the remaining time is nearly 4 days. I'll leave it to run and see which matches reality.
25) Message boards : Theory Application : New version 300.00 (Message 40926)
Posted 12 Dec 2019 by Brummig
Post:
I always shut down BOINC gracefully before shutting down the PC, but not when I hibernate the PC (every night). I've not had any problems with Theory tasks until now.

When I ran out of time on the first long-runner, I did take a look with the console. It was Sherpa and it did appear to be hard at work doing something. But it was already well out of time and still had (from memory) well over a day (of CPU time) to complete, which is why I aborted it. That seems to be the only option to avoid the risk of weeks of wasted crunching. As best I can tell from the task report not all those I have aborted were Sherpa, but all of them had an estimated completion time of about four days.
26) Message boards : Theory Application : New version 300.00 (Message 40917)
Posted 12 Dec 2019 by Brummig
Post:
Since the new version I've been getting quite a few (Virtual Box) tasks that start out normally enough, but then (after about a day) suddenly jump to an estimated time to completion of about four days. Given that my PC only crunches during office hours (ie how BOINC was intended to be used), then assuming a working day of eight hours that would equate to running for an additional 12 days, ie 13 days in total. Add in a couple of weekends (when the PC is off), and the total number of days required to complete the task will be around 17 days. This is much greater than the time allowed to complete the task. That of course means the task gets farmed out to another host, that may or may not beat my host to the finish, depending on how the host is operated. An additional concern is that since the task is behaving oddly, that might be indicative that something has gone wrong and the task will eventually fail (or run forever). Consequently I've been aborting these tasks. I aborted the first on the 9th December, and I see the host to which it was resent is still chewing on it.

Are these tasks broken? If they are doing useful work I'm prepared to let them run, but only if the allowed time is greatly extended (I would suggest a month). However, I much preferred the shorter running tasks from the last version, as there is less risk of weeks of crunching being wasted on a task that fails (for whatever reason). These long-runners have an estimated time to completion that is greater than a typical CPDN task, but at least CPDN uses trickles, ensuring that work isn't completely wasted if a task fails. And CPDN allows plenty of time to complete tasks.
27) Message boards : ATLAS application : Uploads of finished tasks not possible since last night (Message 33405)
Posted 16 Dec 2017 by Brummig
Post:
We'll do our best to make that fall over ASAP, I'm sure :)
Thank you for coming out on a Saturday morning with the drain cleaning rods.
It is clear we have reached the limits of the current infrastructure (thank you all for getting us to these limits :) , but early in the new year we will move to different filesystems which will handle this load much better.
28) Message boards : ATLAS application : Uploads of finished tasks not possible since last night (Message 33403)
Posted 16 Dec 2017 by Brummig
Post:
I have been patient as requested, but I still can't upload:
16/12/2017 08:29:55 | LHC@home | Started upload of 4FYKDm9M7irnSu7Ccp2YYBZmABFKDmABFKDmStGKDmABFKDm3GgJWn_1_r494246443_ATLAS_result
16/12/2017 08:29:58 | LHC@home | [error] Error reported by file upload server: Server is out of disk space
16/12/2017 08:29:58 | LHC@home | Temporarily failed upload of 4FYKDm9M7irnSu7Ccp2YYBZmABFKDmABFKDmStGKDmABFKDm3GgJWn_1_r494246443_ATLAS_result: transient upload error
16/12/2017 08:29:58 | LHC@home | Backing off 04:46:25 on upload of 4FYKDm9M7irnSu7Ccp2YYBZmABFKDmABFKDmStGKDmABFKDm3GgJWn_1_r494246443_ATLAS_result
29) Message boards : Number crunching : I am sent just ATLAS tasks (Message 33355)
Posted 14 Dec 2017 by Brummig
Post:
I'm currently receiving a CMS VDI file, but I note that something else has happened in the last few days - Atlas has gone awry. So maybe it only appears to be fixed.
30) Message boards : ATLAS application : Uploads of finished tasks not possible since last night (Message 33354)
Posted 14 Dec 2017 by Brummig
Post:
I have the same problem. However, what alerted me to there being a problem is that on BoincStats my LHC rank has dropped by a whopping 330 in one day. Sometimes I drop back a few places in any one day, but on average I creep forward each day. Also, I notice everyone around me in the ranking tables has dropped back about 300 places, as have large numbers of users well above my position in the table. I notice too that some users near the top of the table are listed as "new" with a very large number of points, but a tiny RAC and no activity over the past month.
31) Message boards : Number crunching : I am sent just ATLAS tasks (Message 33072)
Posted 17 Nov 2017 by Brummig
Post:
For some time now I too have only had ATLAS tasks. I'm not sure when I stopped getting other LHC tasks, but it was a few months ago.
32) Message boards : ATLAS application : Atlas task running over 45 hours, 100% complete (Message 32850)
Posted 17 Oct 2017 by Brummig
Post:
So far Yeti's suggestion of switching to three-core has done the trick.

I used to use an app_config.xml file, but no matter which calculation I used, no matter whose "pet" app_config.xml I used, it always seemed to be wrong (for me).
33) Message boards : ATLAS application : Atlas task running over 45 hours, 100% complete (Message 32653)
Posted 6 Oct 2017 by Brummig
Post:
Ok, thanks. I'll try 3 core first, as I got really fed up with the endless fiddling about with app_config, and dumped it when I saw I could set the number of cores in the LHC settings.
34) Message boards : ATLAS application : Atlas task running over 45 hours, 100% complete (Message 32647)
Posted 6 Oct 2017 by Brummig
Post:
Actually I do both. On the machine that crunches for LHC, there are 8 CPUs on the processor, and I let BOINC use four of them, running at 50%. This keeps the machine responsive for me, and ensures the fans don't run with excessive noise (I use non-dedicated machines for BOINC, as per the original intention). I let Atlas use two of the processors, and non-Atlas tasks use the other two (or all four if there's no Atlas task). I did try running Atlas with one CPU, but then I had even more tasks that ended in a slow car crash. With Atlas using two processors, fewer Atlas tasks fail this way, but it's only Atlas tasks that are (routinely) failing, and this has only been happening recently.
35) Message boards : ATLAS application : Atlas task running over 45 hours, 100% complete (Message 32643)
Posted 6 Oct 2017 by Brummig
Post:
I'm seeing this too. Most tasks complete normally, but a significant number go slower and slower and slower, and (usually) eventually fail. The information revealed by the Properties button indicates they are working, and the VM console confirms this (Alt-F3 shows two athena tasks working away like crazy as expected, and Alt-F2 shows events happening).

I've aborted most of these tasks, but I have let two run to the bitter conclusion:

https://lhcathome.cern.ch/lhcathome/result.php?resultid=157950256
https://lhcathome.cern.ch/lhcathome/result.php?resultid=158351522

The wingman completed these tasks OK, but that doesn't mean there isn't some problem that appears randomly (an improperly initialised pointer, say) that sometimes sends tasks out into the wilderness, bumbling around until they crash. This wastes a terrific amount of CPU time, and it's impossible to see for sure that it has happened. I have recently had one task that ran slower and slower to the point where it had almost stopped, but eventually it completed, and with lots of brownie points.
36) Message boards : Number crunching : troubles with new windows 10 "creators update" (Message 30638)
Posted 5 Jun 2017 by Brummig
Post:
In addition to finding "Creators Edition" had broken Virtual Box, I also found that standard Windows time and date controls no longer displayed correctly (making them unusable). I removed "Creators Edition" using the following instructions:

https://betanews.com/2017/04/11/how-to-rollback-and-uninstall-windows-10-creators-update/

and now everything is working again. Removal takes a fraction of the time of installation.
37) Message boards : CMS Application : CMS Simulation 47.60 (Message 30614)
Posted 3 Jun 2017 by Brummig
Post:
Does this answer your question:

http://lhcathome.web.cern.ch/
38) Message boards : CMS Application : CMS Tasks Failing (Message 29914)
Posted 12 Apr 2017 by Brummig
Post:
Well of course the glitch could have been out on the net somewhere, and glitches can be very short.

Why did the task give up so quickly and easily when trying to connect to the server? It's not like it was hard up against the deadline.

(That URL is public, BTW).
39) Message boards : CMS Application : CMS Tasks Failing (Message 29901)
Posted 11 Apr 2017 by Brummig
Post:
More problems connecting to the mother ship, this time on a Theory task:

2017-04-11 13:51:19 (11052): VM Completion Message: Could not connect to lhchomeproxy.cern.ch on port 3125


(https://lhcathome.cern.ch/lhcathome/result.php?resultid=132873626)

Given that that followed 6 hours 11 min 46 sec of CPU work, it would have been nice if it had tried again.

No evidence of a network connectivity problem my end (ie no problems with the Radio Paradise stream).
40) Message boards : ATLAS application : Some Validate errors (Message 29827)
Posted 5 Apr 2017 by Brummig
Post:
OK, thanks, I'll try that. So the recommended value of 1.6 + 1 * ncores is wrong?


Previous 20 · Next 20


©2024 CERN