Message boards : Theory Application : Tasks run 4 days and finish with error
Message board moderation

To post messages, you must log in.

Previous · 1 · 2 · 3 · 4 · 5 . . . 8 · Next

AuthorMessage
NOGOOD

Send message
Joined: 18 Nov 17
Posts: 119
Credit: 51,837,543
RAC: 22,732
Message 42023 - Posted: 1 Apr 2020, 20:41:29 UTC - in response to Message 42017.  

I dont think thats going to help it looks like it will take 6000+ days for these tasks to finish.


I saw it. But I also saw successfully running tasks interrupted because of 100 hours limit. That's why I want to increase it.
ID: 42023 · Report as offensive     Reply Quote
CloverField

Send message
Joined: 17 Oct 06
Posts: 74
Credit: 52,159,290
RAC: 31,224
Message 42028 - Posted: 2 Apr 2020, 21:31:20 UTC - in response to Message 42023.  

Yeah that would be good. Is there anyway to determine what actual maximum time a task could run for so that way we can set limit to that and any tasks that would take like 6000 days would just time out.
ID: 42028 · Report as offensive     Reply Quote
maeax

Send message
Joined: 2 May 07
Posts: 2090
Credit: 158,856,517
RAC: 126,388
Message 42029 - Posted: 2 Apr 2020, 22:45:13 UTC - in response to Message 42028.  

This is from the Server-Status page:
https://lhcathome.cern.ch/lhcathome/server_status.php
Anwendung Ungesendet In Bearbeitung Laufzeiten der letzten 100 Aufgaben in h: durchschnitt, min, max Benutzer in den letzten 24 Stunden
Theory Simulation 1583 11183 3 (0.01 - 101.07) 825
ID: 42029 · Report as offensive     Reply Quote
Erich56

Send message
Joined: 18 Dec 15
Posts: 1687
Credit: 103,091,780
RAC: 127,129
Message 42031 - Posted: 3 Apr 2020, 4:52:03 UTC

my experience is:
whenever in the console you read "Poincare(): inaccurate rotation" - this task won't succeed. You can abort it immediately, anything else is waste.
ID: 42031 · Report as offensive     Reply Quote
CloverField

Send message
Joined: 17 Oct 06
Posts: 74
Credit: 52,159,290
RAC: 31,224
Message 42095 - Posted: 8 Apr 2020, 11:56:48 UTC - in response to Message 42031.  

my experience is:
whenever in the console you read "Poincare(): inaccurate rotation" - this task won't succeed. You can abort it immediately, anything else is waste.


Is there anyway we could set up like a watch dog process that would kill tasks once it starts seeing this output?
ID: 42095 · Report as offensive     Reply Quote
Erich56

Send message
Joined: 18 Dec 15
Posts: 1687
Credit: 103,091,780
RAC: 127,129
Message 42099 - Posted: 8 Apr 2020, 18:15:34 UTC - in response to Message 42095.  

Is there anyway we could set up like a watch dog process that would kill tasks once it starts seeing this output?
good question. I, too, would be interested in such a tool :-)
ID: 42099 · Report as offensive     Reply Quote
Henry Nebrensky

Send message
Joined: 13 Jul 05
Posts: 167
Credit: 14,938,551
RAC: 191
Message 42100 - Posted: 8 Apr 2020, 18:53:31 UTC - in response to Message 42031.  

my experience is:
whenever in the console you read "Poincare(): inaccurate rotation" - this task won't succeed.


Well, my experience is that the majority of tasks for which I've seen a warning like this in the rivet.log go on to finish successfully.

The obvious issue in the screenshots is surely the "out of memory" error!
ID: 42100 · Report as offensive     Reply Quote
CloverField

Send message
Joined: 17 Oct 06
Posts: 74
Credit: 52,159,290
RAC: 31,224
Message 42102 - Posted: 8 Apr 2020, 21:49:14 UTC - in response to Message 42100.  

my experience is:
whenever in the console you read "Poincare(): inaccurate rotation" - this task won't succeed.


Well, my experience is that the majority of tasks for which I've seen a warning like this in the rivet.log go on to finish successfully.

The obvious issue in the screenshots is surely the "out of memory" error!


I agree. Could this be due to the fact its running in virtual box vs nativity?
I have 64 gigs of ram so I don't mind giving the vm's some more memory if needed.
ID: 42102 · Report as offensive     Reply Quote
Henry Nebrensky

Send message
Joined: 13 Jul 05
Posts: 167
Credit: 14,938,551
RAC: 191
Message 42113 - Posted: 9 Apr 2020, 15:18:09 UTC - in response to Message 42102.  

Well, my experience is that the majority of tasks for which I've seen a warning like this in the rivet.log go on to finish successfully.
For example, task 271157267 threw some of these but appears to have finished correctly. Yay!

The obvious issue in the screenshots is surely the "out of memory" error!
I agree. Could this be due to the fact its running in virtual box vs nativity?
Possibly, as the Theory VM only has 730 MB, but then the definition of the tasks should take this into account. Also, I don't recall any posts about this, suggesting that it's a pretty rare occurence (though personally I run the native version so they may not have registered).

I have 64 gigs of ram so I don't mind giving the vm's some more memory if needed.
It depends on what causes the memory usage. Since what happens within the VM should be identical each time, it would be interesting to know what happened to the resubmissions to other users.
If high memory use is a consequence of what is being modelled then the project should make the VM bigger, fixing the problem for everyone. If it's some freak task-specific combination causing say some runaway recursion thing, then providing more memory within the VM just means it runs away for longer before failing anyway. It's not easy to figure out from the volunteer perspective.
One of the "charms" of this activity is the length of the chain from those that understand the innards and needs of the code to those of us that host the running of it. :(
ID: 42113 · Report as offensive     Reply Quote
NOGOOD

Send message
Joined: 18 Nov 17
Posts: 119
Credit: 51,837,543
RAC: 22,732
Message 42155 - Posted: 13 Apr 2020, 7:37:03 UTC

I have successfull task that have been running 6 d 6 h 16 m 20 s
CPU time - 4 d 5 h 55 m 27 s
https://lhcathome.cern.ch/lhcathome/result.php?resultid=271085743
ID: 42155 · Report as offensive     Reply Quote
Crystal Pellet
Volunteer moderator
Volunteer tester

Send message
Joined: 14 Jan 10
Posts: 1274
Credit: 8,480,147
RAC: 2,155
Message 42158 - Posted: 13 Apr 2020, 8:57:18 UTC

That's good NOGOOD,

Finally it was successful, but it restarted from scratch after almost 2 days run time.
ID: 42158 · Report as offensive     Reply Quote
NOGOOD

Send message
Joined: 18 Nov 17
Posts: 119
Credit: 51,837,543
RAC: 22,732
Message 42162 - Posted: 13 Apr 2020, 18:24:34 UTC - in response to Message 42158.  

That's good NOGOOD,

Finally it was successful, but it restarted from scratch after almost 2 days run time.


Does it mean that there is no more limit of 100 hours of runtime from the project?
ID: 42162 · Report as offensive     Reply Quote
Crystal Pellet
Volunteer moderator
Volunteer tester

Send message
Joined: 14 Jan 10
Posts: 1274
Credit: 8,480,147
RAC: 2,155
Message 42168 - Posted: 14 Apr 2020, 8:01:45 UTC - in response to Message 42162.  

Does it mean that there is no more limit of 100 hours of runtime from the project?
Without manipulation the limit is 100 hours, but because of the restart from scratch this 100 hours had a new begin.
See your result: https://lhcathome.cern.ch/lhcathome/result.php?resultid=271085743
2020-04-05 17:07:18 (11100): Status Report: Job Duration: '360000.000000'
2020-04-05 17:07:18 (11100): Status Report: Elapsed Time: '6000.000000'
and 2.5 days later
2020-04-08 08:12:08 (1276): Status Report: Job Duration: '360000.000000'
2020-04-08 08:12:08 (1276): Status Report: Elapsed Time: '6000.065889'
ID: 42168 · Report as offensive     Reply Quote
CloverField

Send message
Joined: 17 Oct 06
Posts: 74
Credit: 52,159,290
RAC: 31,224
Message 42196 - Posted: 16 Apr 2020, 12:06:27 UTC - in response to Message 42168.  

Is there anyway we can ask to not be distributed the long sherpa tasks. Ive been keeping an eye on my system and for every good sherpa task I get 5 bad ones that have to be aborted.
ID: 42196 · Report as offensive     Reply Quote
computezrmle
Volunteer moderator
Volunteer developer
Volunteer tester
Help desk expert
Avatar

Send message
Joined: 15 Jun 08
Posts: 2401
Credit: 225,505,811
RAC: 125,141
Message 42197 - Posted: 16 Apr 2020, 12:29:41 UTC - in response to Message 42196.  

You can only check/uncheck Theory Simulation as a whole.
If it is checked you get jobs from the currently active mcplots list:
http://mcplots-dev.cern.ch/production.php?view=control

The recent revision 2378 lists 70957 job definitions with only 3% being sherpas.
Longrunners can't be avoided since each job definition usually creates less than 30 attempts and runtimes can't be estimated before a job starts.

The reason why it appears that there are much more sherpas than other jobs is caused by the fact that other jobs often have very short runtimes, hence the hosts run many of them before they get a longrunner.
ID: 42197 · Report as offensive     Reply Quote
Harri Liljeroos
Avatar

Send message
Joined: 28 Sep 04
Posts: 675
Credit: 43,539,635
RAC: 15,635
Message 42198 - Posted: 16 Apr 2020, 13:27:32 UTC - in response to Message 42196.  

Is there anyway we can ask to not be distributed the long sherpa tasks. Ive been keeping an eye on my system and for every good sherpa task I get 5 bad ones that have to be aborted.

Some time ago there was posted here a script that would abort all sherpa tasks if you happened to download one. I just can't find that post now.
ID: 42198 · Report as offensive     Reply Quote
CloverField

Send message
Joined: 17 Oct 06
Posts: 74
Credit: 52,159,290
RAC: 31,224
Message 42199 - Posted: 16 Apr 2020, 13:32:41 UTC - in response to Message 42197.  

You can only check/uncheck Theory Simulation as a whole.
If it is checked you get jobs from the currently active mcplots list:
http://mcplots-dev.cern.ch/production.php?view=control

The recent revision 2378 lists 70957 job definitions with only 3% being sherpas.
Longrunners can't be avoided since each job definition usually creates less than 30 attempts and runtimes can't be estimated before a job starts.

The reason why it appears that there are much more sherpas than other jobs is caused by the fact that other jobs often have very short runtimes, hence the hosts run many of them before they get a longrunner.


Ive only been getting theory tasks recently.
Would this mean I would be doing nothing till other types of tasks come back into the queue?
ID: 42199 · Report as offensive     Reply Quote
Henry Nebrensky

Send message
Joined: 13 Jul 05
Posts: 167
Credit: 14,938,551
RAC: 191
Message 42202 - Posted: 16 Apr 2020, 16:02:19 UTC - in response to Message 42196.  
Last modified: 16 Apr 2020, 16:05:57 UTC

Is there anyway we can ask to not be distributed the long sherpa tasks.
I did once suggest that Sherpa be hived off into a dedicated sub-project, but there doesn't seem to have been any follow-up to the project's request for comments.

Ive been keeping an eye on my system and for every good sherpa task I get 5 bad ones that have to be aborted.
I won't pretend I don't see problems, but the failures are definitely a minority of my Sherpas. (Edit: ah, maybe others aren't allowed to see that link. :( )
ID: 42202 · Report as offensive     Reply Quote
CloverField

Send message
Joined: 17 Oct 06
Posts: 74
Credit: 52,159,290
RAC: 31,224
Message 42205 - Posted: 16 Apr 2020, 21:11:16 UTC - in response to Message 42202.  

I won't pretend I don't see problems, but the failures are definitely a minority of my Sherpas.]


Yeah I think I worded that poorly. Probably have a bit of false positive bias because the only tasks that I see fail are sherpas because they get stuck.

I do like the idea a pushing all the sherpas off into there own little sub project.
ID: 42205 · Report as offensive     Reply Quote
Toby Broom
Volunteer moderator

Send message
Joined: 27 Sep 08
Posts: 804
Credit: 650,095,539
RAC: 240,792
Message 42230 - Posted: 18 Apr 2020, 15:25:31 UTC

I had one that ran out of memory, since all the other projects use plenty of memory, I'd be fine with an increase if WU's need it.
ID: 42230 · Report as offensive     Reply Quote
Previous · 1 · 2 · 3 · 4 · 5 . . . 8 · Next

Message boards : Theory Application : Tasks run 4 days and finish with error


©2024 CERN