Message boards :
Theory Application :
Tasks run 4 days and finish with error
Message board moderation
Previous · 1 · 2 · 3 · 4 · 5 . . . 8 · Next
Author | Message |
---|---|
Send message Joined: 18 Nov 17 Posts: 128 Credit: 54,933,976 RAC: 18,233 |
I dont think thats going to help it looks like it will take 6000+ days for these tasks to finish. I saw it. But I also saw successfully running tasks interrupted because of 100 hours limit. That's why I want to increase it. |
Send message Joined: 17 Oct 06 Posts: 79 Credit: 56,249,522 RAC: 16,511 |
Yeah that would be good. Is there anyway to determine what actual maximum time a task could run for so that way we can set limit to that and any tasks that would take like 6000 days would just time out. |
Send message Joined: 2 May 07 Posts: 2181 Credit: 172,554,084 RAC: 49,909 |
This is from the Server-Status page: https://lhcathome.cern.ch/lhcathome/server_status.php Anwendung Ungesendet In Bearbeitung Laufzeiten der letzten 100 Aufgaben in h: durchschnitt, min, max Benutzer in den letzten 24 Stunden Theory Simulation 1583 11183 3 (0.01 - 101.07) 825 |
Send message Joined: 18 Dec 15 Posts: 1735 Credit: 114,124,490 RAC: 85,354 |
my experience is: whenever in the console you read "Poincare(): inaccurate rotation" - this task won't succeed. You can abort it immediately, anything else is waste. |
Send message Joined: 17 Oct 06 Posts: 79 Credit: 56,249,522 RAC: 16,511 |
my experience is: Is there anyway we could set up like a watch dog process that would kill tasks once it starts seeing this output? |
Send message Joined: 18 Dec 15 Posts: 1735 Credit: 114,124,490 RAC: 85,354 |
Is there anyway we could set up like a watch dog process that would kill tasks once it starts seeing this output?good question. I, too, would be interested in such a tool :-) |
Send message Joined: 13 Jul 05 Posts: 169 Credit: 14,982,010 RAC: 209 |
my experience is: Well, my experience is that the majority of tasks for which I've seen a warning like this in the rivet.log go on to finish successfully. The obvious issue in the screenshots is surely the "out of memory" error! |
Send message Joined: 17 Oct 06 Posts: 79 Credit: 56,249,522 RAC: 16,511 |
my experience is: I agree. Could this be due to the fact its running in virtual box vs nativity? I have 64 gigs of ram so I don't mind giving the vm's some more memory if needed. |
Send message Joined: 13 Jul 05 Posts: 169 Credit: 14,982,010 RAC: 209 |
For example, task 271157267 threw some of these but appears to have finished correctly. Yay!Well, my experience is that the majority of tasks for which I've seen a warning like this in the rivet.log go on to finish successfully. Possibly, as the Theory VM only has 730 MB, but then the definition of the tasks should take this into account. Also, I don't recall any posts about this, suggesting that it's a pretty rare occurence (though personally I run the native version so they may not have registered).The obvious issue in the screenshots is surely the "out of memory" error!I agree. Could this be due to the fact its running in virtual box vs nativity? I have 64 gigs of ram so I don't mind giving the vm's some more memory if needed.It depends on what causes the memory usage. Since what happens within the VM should be identical each time, it would be interesting to know what happened to the resubmissions to other users. If high memory use is a consequence of what is being modelled then the project should make the VM bigger, fixing the problem for everyone. If it's some freak task-specific combination causing say some runaway recursion thing, then providing more memory within the VM just means it runs away for longer before failing anyway. It's not easy to figure out from the volunteer perspective. One of the "charms" of this activity is the length of the chain from those that understand the innards and needs of the code to those of us that host the running of it. :( |
Send message Joined: 18 Nov 17 Posts: 128 Credit: 54,933,976 RAC: 18,233 |
I have successfull task that have been running 6 d 6 h 16 m 20 s CPU time - 4 d 5 h 55 m 27 s https://lhcathome.cern.ch/lhcathome/result.php?resultid=271085743 |
Send message Joined: 14 Jan 10 Posts: 1352 Credit: 9,086,236 RAC: 2,738 |
That's good NOGOOD, Finally it was successful, but it restarted from scratch after almost 2 days run time. |
Send message Joined: 18 Nov 17 Posts: 128 Credit: 54,933,976 RAC: 18,233 |
That's good NOGOOD, Does it mean that there is no more limit of 100 hours of runtime from the project? |
Send message Joined: 14 Jan 10 Posts: 1352 Credit: 9,086,236 RAC: 2,738 |
Does it mean that there is no more limit of 100 hours of runtime from the project?Without manipulation the limit is 100 hours, but because of the restart from scratch this 100 hours had a new begin. See your result: https://lhcathome.cern.ch/lhcathome/result.php?resultid=271085743 2020-04-05 17:07:18 (11100): Status Report: Job Duration: '360000.000000' 2020-04-05 17:07:18 (11100): Status Report: Elapsed Time: '6000.000000' and 2.5 days later 2020-04-08 08:12:08 (1276): Status Report: Job Duration: '360000.000000' 2020-04-08 08:12:08 (1276): Status Report: Elapsed Time: '6000.065889' |
Send message Joined: 17 Oct 06 Posts: 79 Credit: 56,249,522 RAC: 16,511 |
Is there anyway we can ask to not be distributed the long sherpa tasks. Ive been keeping an eye on my system and for every good sherpa task I get 5 bad ones that have to be aborted. |
Send message Joined: 15 Jun 08 Posts: 2473 Credit: 246,516,005 RAC: 95,944 |
You can only check/uncheck Theory Simulation as a whole. If it is checked you get jobs from the currently active mcplots list: http://mcplots-dev.cern.ch/production.php?view=control The recent revision 2378 lists 70957 job definitions with only 3% being sherpas. Longrunners can't be avoided since each job definition usually creates less than 30 attempts and runtimes can't be estimated before a job starts. The reason why it appears that there are much more sherpas than other jobs is caused by the fact that other jobs often have very short runtimes, hence the hosts run many of them before they get a longrunner. |
Send message Joined: 28 Sep 04 Posts: 707 Credit: 47,056,341 RAC: 31,467 |
Is there anyway we can ask to not be distributed the long sherpa tasks. Ive been keeping an eye on my system and for every good sherpa task I get 5 bad ones that have to be aborted. Some time ago there was posted here a script that would abort all sherpa tasks if you happened to download one. I just can't find that post now. |
Send message Joined: 17 Oct 06 Posts: 79 Credit: 56,249,522 RAC: 16,511 |
You can only check/uncheck Theory Simulation as a whole. Ive only been getting theory tasks recently. Would this mean I would be doing nothing till other types of tasks come back into the queue? |
Send message Joined: 13 Jul 05 Posts: 169 Credit: 14,982,010 RAC: 209 |
Is there anyway we can ask to not be distributed the long sherpa tasks.I did once suggest that Sherpa be hived off into a dedicated sub-project, but there doesn't seem to have been any follow-up to the project's request for comments. Ive been keeping an eye on my system and for every good sherpa task I get 5 bad ones that have to be aborted.I won't pretend I don't see problems, but the failures are definitely a minority of my Sherpas. (Edit: ah, maybe others aren't allowed to see that link. :( ) |
Send message Joined: 17 Oct 06 Posts: 79 Credit: 56,249,522 RAC: 16,511 |
I won't pretend I don't see problems, but the failures are definitely a minority of my Sherpas.] Yeah I think I worded that poorly. Probably have a bit of false positive bias because the only tasks that I see fail are sherpas because they get stuck. I do like the idea a pushing all the sherpas off into there own little sub project. |
Send message Joined: 27 Sep 08 Posts: 817 Credit: 682,051,046 RAC: 139,425 |
I had one that ran out of memory, since all the other projects use plenty of memory, I'd be fine with an increase if WU's need it. |
©2024 CERN