1) Message boards : Theory Application : New Version v300.05 (Message 41412)
Posted 28 Jan 2020 by Brummig
Post:
I've now confirmed Theory doesn't survive an overnight hibernation either, even with Leave non-GPU tasks in memory while suspended not selected (I've never had this selected). So that explains the tasks that never complete, but then get completed in a fraction of the time by another host. A task that doesn't complete by the time the host is put into hibernation will restart the following morning, and if it can complete by the end of the working day it should do so. But if it can't complete by the end of the working day, it will just run and run, never completing. I've not yet tried suspending VM tasks before hibernating, but I've never had to do that with Theory tasks in the past.
2) Message boards : Theory Application : New Version v300.05 (Message 41370)
Posted 27 Jan 2020 by Brummig
Post:
Following resume from hibernation over the weekend, this long-runner briefly continued on to something over 57,000 events, and then it reset itself and started again from zero:

https://lhcathome.cern.ch/lhcathome/result.php?resultid=259725584

It's possible Theory tasks don't survive hibernation over a weekend. However, I also caught it last week throwing errors/warnings:

PDFVeto warning: Ratio > GtobbbarSudakov:PDFmax (by a factor of 1.10978) for g to b
PDFVeto warning: Ratio > GtobbbarSudakov:PDFmax (by a factor of 1.34534) for g to b
PDFVeto warning: Ratio > GtobbbarSudakov:PDFmax (by a factor of 6.61948) for g to b
The decay Xi(1690)- -> Sigma- KbarO 2.10871 500 is too inefficient for the particle 816 Xi(1690)
- 13312 [601]

0.935 2.078 25 .560 25.718 5 ¬ębs 9
vetoing the decay
PDFVeto warning: Ratio > GtobbbarSudakovu:PDFmax (by a factor of 1.05218) for g to bbar
PDFVeto warning: Ratio > GtobbbarSudakov:PDFmax (by a factor of 5.63208) for g to bbar
PDFVeto warning: Ratio > GtobbbarSudakovu:PDFmax (by a factor of 3.54622) for g to bbar
PDFVeto warning: Ratio > GtobbbarSudakov:PDFmax (by a factor of 1.06896) for g to b
PDFVeto warning: Ratio > GtobbbarSudakov:PDFmax (by a factor of 1.98784) for g to bbar
PDFVeto warning: Ratio > GtobbbarSudakovu:PDFmax (by a factor of 1.04204) for g to bbar
PDFVeto warning: Ratio > GtobbbarSudakov:PDFmax (by a factor of 1.83883) for g to b
PDFVeto warning: Ratio > GtobbbarSudakov:PDFmax (by a factor of 1.85025) for g to bbar
PDFVeto warning: Ratio > GtobbbarSudakov:PDFmax (by a factor of 12.6764) for g to b
PDFVeto warning: Ratio > GtobbbarSudakov:PDFmax (by a factor of 3.83015) for g to b
PDFVeto warning: Ratio > GtobbbarSudakov:PDFmax (by a factor of 1.55048) for g to bbar
PDFVeto warning: Ratio > GtobbbarSudakov:PDFmax (by a factor of 2.53167) for g to b
PDFVeto warning: Ratio > GtobbbarSudakov:PDFmax (by a factor of 1.04879) for g to b
PDFVeto warning: Ratio > GtobbbarSudakov:PDFmax (by a factor of 2.41224) for g to b
PDFVeto warning: Ratio > GtobbbarSudakovu:PDFmax (by a factor of 1.92092) for g to bbar
PDFVeto warning: Ratio > GtobbbarSudakov:PDFmax (by a factor of 1.52194) for g to b
PDFVeto warning: Ratio > GtobbbarSudakov:PDFmax (by a factor of 1.07241) for g to b
PDFVeto warning: Ratio > GtobbbarSudakovu:PDFmax (by a factor of 4.16827) for g to bbar
PDFVeto warning: Ratio > GtobbbarSudakov:PDFmax (by a factor of 1.85123) for g to b
PDFVeto warning: Ratio > GtobbbarSudakov:PDFmax (by a factor of 1.09399) for g to bbar
PDFVeto warning: Ratio > GtobbbarSudakov:PDFmax (by a factor of 2.30685) for g to bbar
PDFVeto warning: Ratio > GtobbbarSudakov:PDFmax (by a factor of 1.31057) for g to bbar
PDFVeto warning: Ratio > GtobbbarSudakov:PDFmax (by a factor of 1.38341) for g to b
a An event exception of type ThePEG: :Exception occurred while generating event number 28880:
Remnant extraction failed in ShowerHandler::cascadeQ) from primary interaction
The event will be discarded.
28900 events processed
29000 events processed
dumping histograms...
3) Message boards : Theory Application : New Version v300.05 (Message 41312)
Posted 20 Jan 2020 by Brummig
Post:
@Crystal Pellet:
Yes, I monitored one for some time. There was no evidence of any progress, and after switching back and forth between displaying different information, it settled on saying that it had processed zero of zero events. I aborted it, and it went to another host that completed it in a fraction of the time my host had been chewing on it. Curiously, whilst that task ran frantically doing nothing, two tasks, after being aborted, reported the run time and CPU time as zero. For example, task 259168280 has a start timestamp of 2020-01-13 15:31:52. I aborted it at 16 Jan 2020, 8:32:03 UTC because it jumped to an extreme estimated completion time, but apparently it did absolutely nothing during the time it was supposedly running (a couple of hours). Task 259230427 was sent 14 Jan 2020, 12:55:32 UTC, and aborted 15 Jan 2020, 8:59:17 UTC. That second task has just this in the stderr output:
<core_client_version>7.14.2</core_client_version>
<![CDATA[
<message>
aborted by user</message>
]]>
4) Message boards : Theory Application : New Version v300.05 (Message 41293)
Posted 17 Jan 2020 by Brummig
Post:
Well you can call it what you like, but the fact remains that those tasks that say they will complete in a reasonable time (a few hours), and stay that way, will complete within a few hours of computer run time. Those that suddenly jump from displaying a few hours to four days will run and run. Some people have let them run and run, only to find they fail, but I just abort them because I don't want to waste time and electricity (at my expense) on them. Typically when resent the receiving host completes them in a fraction of the time my host spent on them, and I'm not the only person seeing that behaviour. Since this wasn't previously a problem, it strongly suggests a bug has been introduced in the latest Theory tasks.
5) Message boards : Theory Application : New Version v300.05 (Message 41291)
Posted 17 Jan 2020 by Brummig
Post:
No, it's not enough time. One Theory task I have at present says it will take over four days of CPU time. The host it is running on is powered up during work hours, ie around eight hours a day, five days a week (using spare CPU cycles is the intention behind BOINC). So four days of CPU time will take 12 working days plus two weekends, ie 16 days. The deadline is in ten days. Ten is quite a bit less than 16, and I haven't even taken into account doing CPU intensive tasks as part of my work. I don't mind if a task genuinely takes four days of CPU time (like CPDN tasks typically do), but the deadline needs to be suitably distant in the future.

I did once leave one of these tasks to run beyond the deadline, but even once past the deadline it was still a couple of days away from completing, so I aborted it. Others on this forum have let these long-runner tasks run and run, only to have them fail. That suggests the solution is to fix the bug, rather than to extend the deadline.
6) Message boards : Theory Application : New Version v300.05 (Message 41289)
Posted 17 Jan 2020 by Brummig
Post:
The problem is they typically don't finish by the deadline, but when passed to another host that host may complete the task quickly, making leaving them to run a waste of host time and electricity. The tasks that say they will complete in a reasonable time do complete in a reasonable time.
7) Message boards : Theory Application : New Version v300.05 (Message 41287)
Posted 17 Jan 2020 by Brummig
Post:
I've just had two of these, and (once again) both tasks reported they required significantly more CPU time than was available before the deadline. I've aborted them. Is this problem going to be addressed (please)?
8) Message boards : Theory Application : (Native) Theory - Sherpa looooooong runners (Message 41023)
Posted 20 Dec 2019 by Brummig
Post:
This Sherpa was finished now from a other Volunteer in half a hour?

I've seen that happen with my Virtual Box long-runners; tasks that I aborted after days or weeks of running were completed in a fraction of the time by another host.
9) Message boards : Theory Application : New version 300.00 (Message 40979)
Posted 16 Dec 2019 by Brummig
Post:
OK, thanks. I just aborted it, as it doesn't seem to be making any obvious progress. I don't have time for this hand-holding of tasks, and long-runners stop my host from running LHC tasks that play nicely and complete in a reasonable time. I'll just go back to aborting tasks if they switch to displaying an impossible ETA in BOINC manager.
10) Message boards : Theory Application : New version 300.00 (Message 40973)
Posted 16 Dec 2019 by Brummig
Post:
How am I supposed to know from the following VM console output if this long-runner is doing anything useful?
0.0918751 pb  +- C 0.000644023 pb = 0.700976 % ) 130000 ( 477274 -> 28.6 % )
full optimization:  ( 1h 35m 44s elapsed / 2h 23m 36s left ) [09:07:Z3]
Updating display.
Display update finished (0 histograms, 0 events).
Updating display.
Display update finished (0 histograms, 0 events).
Updating display.
Display update finished (0 histograms, 0 events).
Updating display.
Display update finished (0 histograms, 0 events).
Updating display.
Display update finished (0 histograms, 0 events).
Updating display.
Display update finished (0 histograms, 0 events).
Updating display.
Display update finished (0 histograms, 0 events).
Updating display.
Display update finished (0 histograms, 0 events).
0.0920208 pb  +- ( 0.000606447 pb = 0.659032  ) 140000 ( 511623 -> 29.1 %)
full optimization:  ( 1h 43m 11s elapsed / 2h 16m 21s left ) [09:14:58]
Updating display.
Display update finished (0 histograms, 0 events).
Updating display.
Display update finished (0 histograms, 0 events).

The part that says "2h 16m 21s left" doesn't appear to represent the end point of the entire task, but rather some sub-task, and I've even less idea what the rest of the text means (other than the time in square brackets).
11) Message boards : Theory Application : New version 300.00 (Message 40945)
Posted 13 Dec 2019 by Brummig
Post:
OK, thanks, I'll try that. I have a Sherpa task running now that says in the VM console integration time: (3h 4m 22s elapsed / 1h 5m 18s left), but BOINC manager says the remaining time is nearly 4 days. I'll leave it to run and see which matches reality.
12) Message boards : Theory Application : New version 300.00 (Message 40926)
Posted 12 Dec 2019 by Brummig
Post:
I always shut down BOINC gracefully before shutting down the PC, but not when I hibernate the PC (every night). I've not had any problems with Theory tasks until now.

When I ran out of time on the first long-runner, I did take a look with the console. It was Sherpa and it did appear to be hard at work doing something. But it was already well out of time and still had (from memory) well over a day (of CPU time) to complete, which is why I aborted it. That seems to be the only option to avoid the risk of weeks of wasted crunching. As best I can tell from the task report not all those I have aborted were Sherpa, but all of them had an estimated completion time of about four days.
13) Message boards : Theory Application : New version 300.00 (Message 40917)
Posted 12 Dec 2019 by Brummig
Post:
Since the new version I've been getting quite a few (Virtual Box) tasks that start out normally enough, but then (after about a day) suddenly jump to an estimated time to completion of about four days. Given that my PC only crunches during office hours (ie how BOINC was intended to be used), then assuming a working day of eight hours that would equate to running for an additional 12 days, ie 13 days in total. Add in a couple of weekends (when the PC is off), and the total number of days required to complete the task will be around 17 days. This is much greater than the time allowed to complete the task. That of course means the task gets farmed out to another host, that may or may not beat my host to the finish, depending on how the host is operated. An additional concern is that since the task is behaving oddly, that might be indicative that something has gone wrong and the task will eventually fail (or run forever). Consequently I've been aborting these tasks. I aborted the first on the 9th December, and I see the host to which it was resent is still chewing on it.

Are these tasks broken? If they are doing useful work I'm prepared to let them run, but only if the allowed time is greatly extended (I would suggest a month). However, I much preferred the shorter running tasks from the last version, as there is less risk of weeks of crunching being wasted on a task that fails (for whatever reason). These long-runners have an estimated time to completion that is greater than a typical CPDN task, but at least CPDN uses trickles, ensuring that work isn't completely wasted if a task fails. And CPDN allows plenty of time to complete tasks.
14) Message boards : ATLAS application : Uploads of finished tasks not possible since last night (Message 33405)
Posted 16 Dec 2017 by Brummig
Post:
We'll do our best to make that fall over ASAP, I'm sure :)
Thank you for coming out on a Saturday morning with the drain cleaning rods.
It is clear we have reached the limits of the current infrastructure (thank you all for getting us to these limits :) , but early in the new year we will move to different filesystems which will handle this load much better.
15) Message boards : ATLAS application : Uploads of finished tasks not possible since last night (Message 33403)
Posted 16 Dec 2017 by Brummig
Post:
I have been patient as requested, but I still can't upload:
16/12/2017 08:29:55 | LHC@home | Started upload of 4FYKDm9M7irnSu7Ccp2YYBZmABFKDmABFKDmStGKDmABFKDm3GgJWn_1_r494246443_ATLAS_result
16/12/2017 08:29:58 | LHC@home | [error] Error reported by file upload server: Server is out of disk space
16/12/2017 08:29:58 | LHC@home | Temporarily failed upload of 4FYKDm9M7irnSu7Ccp2YYBZmABFKDmABFKDmStGKDmABFKDm3GgJWn_1_r494246443_ATLAS_result: transient upload error
16/12/2017 08:29:58 | LHC@home | Backing off 04:46:25 on upload of 4FYKDm9M7irnSu7Ccp2YYBZmABFKDmABFKDmStGKDmABFKDm3GgJWn_1_r494246443_ATLAS_result
16) Message boards : Number crunching : I am sent just ATLAS tasks (Message 33355)
Posted 14 Dec 2017 by Brummig
Post:
I'm currently receiving a CMS VDI file, but I note that something else has happened in the last few days - Atlas has gone awry. So maybe it only appears to be fixed.
17) Message boards : ATLAS application : Uploads of finished tasks not possible since last night (Message 33354)
Posted 14 Dec 2017 by Brummig
Post:
I have the same problem. However, what alerted me to there being a problem is that on BoincStats my LHC rank has dropped by a whopping 330 in one day. Sometimes I drop back a few places in any one day, but on average I creep forward each day. Also, I notice everyone around me in the ranking tables has dropped back about 300 places, as have large numbers of users well above my position in the table. I notice too that some users near the top of the table are listed as "new" with a very large number of points, but a tiny RAC and no activity over the past month.
18) Message boards : Number crunching : I am sent just ATLAS tasks (Message 33072)
Posted 17 Nov 2017 by Brummig
Post:
For some time now I too have only had ATLAS tasks. I'm not sure when I stopped getting other LHC tasks, but it was a few months ago.
19) Message boards : ATLAS application : Atlas task running over 45 hours, 100% complete (Message 32850)
Posted 17 Oct 2017 by Brummig
Post:
So far Yeti's suggestion of switching to three-core has done the trick.

I used to use an app_config.xml file, but no matter which calculation I used, no matter whose "pet" app_config.xml I used, it always seemed to be wrong (for me).
20) Message boards : ATLAS application : Atlas task running over 45 hours, 100% complete (Message 32653)
Posted 6 Oct 2017 by Brummig
Post:
Ok, thanks. I'll try 3 core first, as I got really fed up with the endless fiddling about with app_config, and dumped it when I saw I could set the number of cores in the LHC settings.


Next 20


©2020 CERN