Thread 'Tasks run 4 days and finish with error'

Author	Message
NOGOOD Send message Joined: 18 Nov 17 Posts: 134 Credit: 59,038,497 RAC: 3,828	Message 42023 - Posted: 1 Apr 2020, 20:41:29 UTC - in response to Message 42017. I dont think thats going to help it looks like it will take 6000+ days for these tasks to finish. I saw it. But I also saw successfully running tasks interrupted because of 100 hours limit. That's why I want to increase it. ID: 42023 · Reply Quote

CloverField Send message Joined: 17 Oct 06 Posts: 96 Credit: 64,757,866 RAC: 22,200	Message 42028 - Posted: 2 Apr 2020, 21:31:20 UTC - in response to Message 42023. Yeah that would be good. Is there anyway to determine what actual maximum time a task could run for so that way we can set limit to that and any tasks that would take like 6000 days would just time out. ID: 42028 · Reply Quote

maeax Send message Joined: 2 May 07 Posts: 2285 Credit: 178,823,324 RAC: 520	Message 42029 - Posted: 2 Apr 2020, 22:45:13 UTC - in response to Message 42028. This is from the Server-Status page: https://lhcathome.cern.ch/lhcathome/server_status.php Anwendung Ungesendet In Bearbeitung Laufzeiten der letzten 100 Aufgaben in h: durchschnitt, min, max Benutzer in den letzten 24 Stunden Theory Simulation 1583 11183 3 (0.01 - 101.07) 825 ID: 42029 · Reply Quote

Erich56 Send message Joined: 18 Dec 15 Posts: 1959 Credit: 158,900,524 RAC: 47,180	Message 42031 - Posted: 3 Apr 2020, 4:52:03 UTC my experience is: whenever in the console you read "Poincare(): inaccurate rotation" - this task won't succeed. You can abort it immediately, anything else is waste. ID: 42031 · Reply Quote

CloverField Send message Joined: 17 Oct 06 Posts: 96 Credit: 64,757,866 RAC: 22,200	Message 42095 - Posted: 8 Apr 2020, 11:56:48 UTC - in response to Message 42031. my experience is: whenever in the console you read "Poincare(): inaccurate rotation" - this task won't succeed. You can abort it immediately, anything else is waste. Is there anyway we could set up like a watch dog process that would kill tasks once it starts seeing this output? ID: 42095 · Reply Quote

Erich56 Send message Joined: 18 Dec 15 Posts: 1959 Credit: 158,900,524 RAC: 47,180	Message 42099 - Posted: 8 Apr 2020, 18:15:34 UTC - in response to Message 42095. Is there anyway we could set up like a watch dog process that would kill tasks once it starts seeing this output? good question. I, too, would be interested in such a tool :-) ID: 42099 · Reply Quote

Henry Nebrensky Send message Joined: 13 Jul 05 Posts: 170 Credit: 15,020,549 RAC: 0	Message 42100 - Posted: 8 Apr 2020, 18:53:31 UTC - in response to Message 42031. my experience is: whenever in the console you read "Poincare(): inaccurate rotation" - this task won't succeed. Well, my experience is that the majority of tasks for which I've seen a warning like this in the rivet.log go on to finish successfully. The obvious issue in the screenshots is surely the "out of memory" error! ID: 42100 · Reply Quote

CloverField Send message Joined: 17 Oct 06 Posts: 96 Credit: 64,757,866 RAC: 22,200	Message 42102 - Posted: 8 Apr 2020, 21:49:14 UTC - in response to Message 42100. my experience is: whenever in the console you read "Poincare(): inaccurate rotation" - this task won't succeed. Well, my experience is that the majority of tasks for which I've seen a warning like this in the rivet.log go on to finish successfully. The obvious issue in the screenshots is surely the "out of memory" error! I agree. Could this be due to the fact its running in virtual box vs nativity? I have 64 gigs of ram so I don't mind giving the vm's some more memory if needed. ID: 42102 · Reply Quote

Henry Nebrensky Send message Joined: 13 Jul 05 Posts: 170 Credit: 15,020,549 RAC: 0	Message 42113 - Posted: 9 Apr 2020, 15:18:09 UTC - in response to Message 42102. Well, my experience is that the majority of tasks for which I've seen a warning like this in the rivet.log go on to finish successfully. For example, task 271157267 threw some of these but appears to have finished correctly. Yay! The obvious issue in the screenshots is surely the "out of memory" error! I agree. Could this be due to the fact its running in virtual box vs nativity? Possibly, as the Theory VM only has 730 MB, but then the definition of the tasks should take this into account. Also, I don't recall any posts about this, suggesting that it's a pretty rare occurence (though personally I run the native version so they may not have registered). I have 64 gigs of ram so I don't mind giving the vm's some more memory if needed. It depends on what causes the memory usage. Since what happens within the VM should be identical each time, it would be interesting to know what happened to the resubmissions to other users. If high memory use is a consequence of what is being modelled then the project should make the VM bigger, fixing the problem for everyone. If it's some freak task-specific combination causing say some runaway recursion thing, then providing more memory within the VM just means it runs away for longer before failing anyway. It's not easy to figure out from the volunteer perspective. One of the "charms" of this activity is the length of the chain from those that understand the innards and needs of the code to those of us that host the running of it. :( ID: 42113 · Reply Quote

NOGOOD Send message Joined: 18 Nov 17 Posts: 134 Credit: 59,038,497 RAC: 3,828	Message 42155 - Posted: 13 Apr 2020, 7:37:03 UTC I have successfull task that have been running 6 d 6 h 16 m 20 s CPU time - 4 d 5 h 55 m 27 s https://lhcathome.cern.ch/lhcathome/result.php?resultid=271085743 ID: 42155 · Reply Quote

Crystal Pellet Volunteer moderator Volunteer tester Send message Joined: 14 Jan 10 Posts: 1530 Credit: 10,031,459 RAC: 1,323	Message 42158 - Posted: 13 Apr 2020, 8:57:18 UTC That's good NOGOOD, Finally it was successful, but it restarted from scratch after almost 2 days run time. ID: 42158 · Reply Quote

NOGOOD Send message Joined: 18 Nov 17 Posts: 134 Credit: 59,038,497 RAC: 3,828	Message 42162 - Posted: 13 Apr 2020, 18:24:34 UTC - in response to Message 42158. That's good NOGOOD, Finally it was successful, but it restarted from scratch after almost 2 days run time. Does it mean that there is no more limit of 100 hours of runtime from the project? ID: 42162 · Reply Quote

Crystal Pellet Volunteer moderator Volunteer tester Send message Joined: 14 Jan 10 Posts: 1530 Credit: 10,031,459 RAC: 1,323	Message 42168 - Posted: 14 Apr 2020, 8:01:45 UTC - in response to Message 42162. Does it mean that there is no more limit of 100 hours of runtime from the project? Without manipulation the limit is 100 hours, but because of the restart from scratch this 100 hours had a new begin. See your result: https://lhcathome.cern.ch/lhcathome/result.php?resultid=271085743 2020-04-05 17:07:18 (11100): Status Report: Job Duration: '360000.000000' 2020-04-05 17:07:18 (11100): Status Report: Elapsed Time: '6000.000000' and 2.5 days later 2020-04-08 08:12:08 (1276): Status Report: Job Duration: '360000.000000' 2020-04-08 08:12:08 (1276): Status Report: Elapsed Time: '6000.065889' ID: 42168 · Reply Quote

CloverField Send message Joined: 17 Oct 06 Posts: 96 Credit: 64,757,866 RAC: 22,200	Message 42196 - Posted: 16 Apr 2020, 12:06:27 UTC - in response to Message 42168. Is there anyway we can ask to not be distributed the long sherpa tasks. Ive been keeping an eye on my system and for every good sherpa task I get 5 bad ones that have to be aborted. ID: 42196 · Reply Quote

computezrmle Volunteer moderator Volunteer developer Volunteer tester Help desk expert Send message Joined: 15 Jun 08 Posts: 2724 Credit: 299,413,222 RAC: 31,160	Message 42197 - Posted: 16 Apr 2020, 12:29:41 UTC - in response to Message 42196. You can only check/uncheck Theory Simulation as a whole. If it is checked you get jobs from the currently active mcplots list: http://mcplots-dev.cern.ch/production.php?view=control The recent revision 2378 lists 70957 job definitions with only 3% being sherpas. Longrunners can't be avoided since each job definition usually creates less than 30 attempts and runtimes can't be estimated before a job starts. The reason why it appears that there are much more sherpas than other jobs is caused by the fact that other jobs often have very short runtimes, hence the hosts run many of them before they get a longrunner. ID: 42197 · Reply Quote

Harri Liljeroos Send message Joined: 28 Sep 04 Posts: 797 Credit: 64,616,967 RAC: 30,139	Message 42198 - Posted: 16 Apr 2020, 13:27:32 UTC - in response to Message 42196. Is there anyway we can ask to not be distributed the long sherpa tasks. Ive been keeping an eye on my system and for every good sherpa task I get 5 bad ones that have to be aborted. Some time ago there was posted here a script that would abort all sherpa tasks if you happened to download one. I just can't find that post now. ID: 42198 · Reply Quote

CloverField Send message Joined: 17 Oct 06 Posts: 96 Credit: 64,757,866 RAC: 22,200	Message 42199 - Posted: 16 Apr 2020, 13:32:41 UTC - in response to Message 42197. You can only check/uncheck Theory Simulation as a whole. If it is checked you get jobs from the currently active mcplots list: http://mcplots-dev.cern.ch/production.php?view=control The recent revision 2378 lists 70957 job definitions with only 3% being sherpas. Longrunners can't be avoided since each job definition usually creates less than 30 attempts and runtimes can't be estimated before a job starts. The reason why it appears that there are much more sherpas than other jobs is caused by the fact that other jobs often have very short runtimes, hence the hosts run many of them before they get a longrunner. Ive only been getting theory tasks recently. Would this mean I would be doing nothing till other types of tasks come back into the queue? ID: 42199 · Reply Quote

Henry Nebrensky Send message Joined: 13 Jul 05 Posts: 170 Credit: 15,020,549 RAC: 0	Message 42202 - Posted: 16 Apr 2020, 16:02:19 UTC - in response to Message 42196. Last modified: 16 Apr 2020, 16:05:57 UTC Is there anyway we can ask to not be distributed the long sherpa tasks. I did once suggest that Sherpa be hived off into a dedicated sub-project, but there doesn't seem to have been any follow-up to the project's request for comments. Ive been keeping an eye on my system and for every good sherpa task I get 5 bad ones that have to be aborted. I won't pretend I don't see problems, but the failures are definitely a minority of my Sherpas. (Edit: ah, maybe others aren't allowed to see that link. :( ) ID: 42202 · Reply Quote

CloverField Send message Joined: 17 Oct 06 Posts: 96 Credit: 64,757,866 RAC: 22,200	Message 42205 - Posted: 16 Apr 2020, 21:11:16 UTC - in response to Message 42202. I won't pretend I don't see problems, but the failures are definitely a minority of my Sherpas.] Yeah I think I worded that poorly. Probably have a bit of false positive bias because the only tasks that I see fail are sherpas because they get stuck. I do like the idea a pushing all the sherpas off into there own little sub project. ID: 42205 · Reply Quote

Toby Broom Volunteer moderator Send message Joined: 27 Sep 08 Posts: 911 Credit: 777,817,779 RAC: 165,793	Message 42230 - Posted: 18 Apr 2020, 15:25:31 UTC I had one that ran out of memory, since all the other projects use plenty of memory, I'd be fine with an increase if WU's need it. ID: 42230 · Reply Quote