log in

Sherpa job hit 18 hrs limit


Advanced search

Message boards : Theory Application : Sherpa job hit 18 hrs limit

Author Message
computezrmle
Send message
Joined: 15 Jun 08
Posts: 347
Credit: 3,494,852
RAC: 1,536
Message 30080 - Posted: 27 Apr 2017, 11:00:41 UTC

This sherpa job ran for many hours and was killed as the WU hit the 18 hrs limit.
How can things like that be avoided?

https://lhcathome.cern.ch/lhcathome/result.php?resultid=136727247

2017-04-25 23:20:32 (21421): Guest Log: [INFO] New Job Starting in slot2
2017-04-25 23:20:32 (21421): Guest Log: [INFO] Condor JobID: 2715566.0 in slot2
2017-04-25 23:20:37 (21421): Guest Log: [INFO] MCPlots JobID: 36326515 in slot2

2017-04-26 12:59:04 (21421): Powering off VM.
2017-04-26 12:59:05 (21421): Successfully stopped VM.

maeax
Send message
Joined: 2 May 07
Posts: 229
Credit: 11,965,588
RAC: 14,359
Message 30081 - Posted: 27 Apr 2017, 11:48:22 UTC

If you search for sherpa over all messages:

1. Sherpa sometime is running in a loop, see messages of Crystal Pellet.

2. Sherpa job's need a lot of runtime and this is not easy to calculate.

Crystal Pellet
Volunteer moderator
Volunteer tester
Send message
Joined: 14 Jan 10
Posts: 384
Credit: 2,989,535
RAC: 1,578
Message 30082 - Posted: 27 Apr 2017, 12:07:27 UTC - in response to Message 30080.

How can things like that be avoided?

It can be avoided, but is not useful to micromanage such tasks.

When a sherpa and also jobs with other type of generators are running very long and will not report back a result,
a new job is created most of the time with less number of events. This way a subsequential job has more chance to finish on time.

It means that your long running job is not totally worthless.

computezrmle
Send message
Joined: 15 Jun 08
Posts: 347
Credit: 3,494,852
RAC: 1,536
Message 30083 - Posted: 27 Apr 2017, 12:18:08 UTC - in response to Message 30082.

Thank you.


It means that your long running job is not totally worthless.

I´m glad to read this.
:-)

maeax
Send message
Joined: 2 May 07
Posts: 229
Credit: 11,965,588
RAC: 14,359
Message 30615 - Posted: 3 Jun 2017, 8:23:29 UTC

This was a idle of 15 hours of a Sherpa-Job:

2017-06-02 07:35:33 (10084): Guest Log: [INFO] New Job Starting in slot1

2017-06-02 07:35:33 (10084): Guest Log: [INFO] Condor JobID: 3421419.0 in slot1

2017-06-02 07:35:38 (10084): Guest Log: [INFO] MCPlots JobID: 37145519 in slot1

2017-06-02 07:54:54 (10084): Status Report: Job Duration: '64800.000000'
2017-06-02 07:54:54 (10084): Status Report: Elapsed Time: '6000.121424'
2017-06-02 07:54:54 (10084): Status Report: CPU Time: '4839.166620'
2017-06-02 09:35:35 (10084): Status Report: Job Duration: '64800.000000'
2017-06-02 09:35:35 (10084): Status Report: Elapsed Time: '12000.121424'
2017-06-02 09:35:35 (10084): Status Report: CPU Time: '5371.442032'
2017-06-02 11:16:06 (10084): Status Report: Job Duration: '64800.000000'
2017-06-02 11:16:06 (10084): Status Report: Elapsed Time: '18000.121424'
2017-06-02 11:16:06 (10084): Status Report: CPU Time: '5909.739083'
2017-06-02 12:56:45 (10084): Status Report: Job Duration: '64800.000000'
2017-06-02 12:56:45 (10084): Status Report: Elapsed Time: '24000.121424'
2017-06-02 12:56:45 (10084): Status Report: CPU Time: '6451.670957'
2017-06-02 14:37:35 (10084): Status Report: Job Duration: '64800.000000'
2017-06-02 14:37:35 (10084): Status Report: Elapsed Time: '30000.745533'
2017-06-02 14:37:35 (10084): Status Report: CPU Time: '6959.469812'
2017-06-02 16:18:11 (10084): Status Report: Job Duration: '64800.000000'
2017-06-02 16:18:11 (10084): Status Report: Elapsed Time: '36000.745533'
2017-06-02 16:18:11 (10084): Status Report: CPU Time: '7484.943580'
2017-06-02 17:59:02 (10084): Status Report: Job Duration: '64800.000000'
2017-06-02 17:59:02 (10084): Status Report: Elapsed Time: '42000.745533'
2017-06-02 17:59:02 (10084): Status Report: CPU Time: '8009.153740'
2017-06-02 19:39:28 (10084): Status Report: Job Duration: '64800.000000'
2017-06-02 19:39:28 (10084): Status Report: Elapsed Time: '48000.745533'
2017-06-02 19:39:28 (10084): Status Report: CPU Time: '8555.687644'
2017-06-02 21:20:06 (10084): Status Report: Job Duration: '64800.000000'
2017-06-02 21:20:06 (10084): Status Report: Elapsed Time: '54000.745533'
2017-06-02 21:20:06 (10084): Status Report: CPU Time: '9090.396671'
2017-06-02 23:00:44 (10084): Status Report: Job Duration: '64800.000000'
2017-06-02 23:00:44 (10084): Status Report: Elapsed Time: '60000.745533'
2017-06-02 23:00:44 (10084): Status Report: CPU Time: '9625.994905'
2017-06-03 00:21:43 (10084): Powering off VM.
2017-06-03 00:26:46 (10084): VM did not power off when requested.
2017-06-03 00:26:46 (10084): VM was successfully terminated.
2017-06-03 00:26:46 (10084): Deregistering VM. (boinc_494018919c91d7cc, slot#0)
2017-06-03 00:26:47 (10084): Removing network bandwidth throttle group from VM.
2017-06-03 00:26:47 (10084): Removing storage controller(s) from VM.
2017-06-03 00:26:47 (10084): Removing VM from VirtualBox.
2017-06-03 00:26:47 (10084): Removing virtual disk drive from VirtualBox.
00:26:53 (10084): called boinc_finish(0)

</stderr_txt>
]]>

computezrmle
Send message
Joined: 15 Jun 08
Posts: 347
Credit: 3,494,852
RAC: 1,536
Message 30726 - Posted: 10 Jun 2017, 15:44:27 UTC

I got a Theory WU this morning that ran 3 jobs successfully.
The 4th job - a Sherpa - is now running for >840 min at 100% CPU usage.
The job output is:

Updating display...
Display update finished (0 histograms, 0 events).

I guess this will continue until the WU hits the 18h limit.
What's wrong with the Sherpa job?

Task-ID: https://lhcathome.cern.ch/lhcathome/result.php?resultid=145000933
Condor JobID: 3342613.0
MCPlots JobID: 37067675

computezrmle
Send message
Joined: 15 Jun 08
Posts: 347
Credit: 3,494,852
RAC: 1,536
Message 30783 - Posted: 14 Jun 2017, 20:48:02 UTC - in response to Message 30726.

I got a Theory WU this morning that ran 3 jobs successfully.
The 4th job - a Sherpa - is now running for >840 min at 100% CPU usage.
The job output is:
Updating display...
Display update finished (0 histograms, 0 events).

I guess this will continue until the WU hits the 18h limit.
What's wrong with the Sherpa job?

Task-ID: https://lhcathome.cern.ch/lhcathome/result.php?resultid=145000933
Condor JobID: 3342613.0
MCPlots JobID: 37067675

Got the next Sherpa with the same output.
Is it an output error or a job error?

2017-06-14 21:15:09 (28725): Guest Log: [INFO] Condor JobID: 3574607.0 in slot1
2017-06-14 21:15:14 (28725): Guest Log: [INFO] MCPlots JobID: 36820662 in slot1

computezrmle
Send message
Joined: 15 Jun 08
Posts: 347
Credit: 3,494,852
RAC: 1,536
Message 30784 - Posted: 15 Jun 2017, 8:08:41 UTC - in response to Message 30783.

That Sherpa is still running (continuously since last evening) and the output is still
"... Display update finished (0 histograms, 0 events). ..." (repeatedly).

Will abort the WU now.
https://lhcathome.cern.ch/lhcathome/result.php?resultid=145728371

Crystal Pellet
Volunteer moderator
Volunteer tester
Send message
Joined: 14 Jan 10
Posts: 384
Credit: 2,989,535
RAC: 1,578
Message 30791 - Posted: 15 Jun 2017, 13:16:24 UTC - in response to Message 30784.

That Sherpa is still running (continuously since last evening) and the output is still
"... Display update finished (0 histograms, 0 events). ..." (repeatedly).

That was one for this thread: Theory's endless looping

computezrmle
Send message
Joined: 15 Jun 08
Posts: 347
Credit: 3,494,852
RAC: 1,536
Message 30792 - Posted: 15 Jun 2017, 13:53:51 UTC - in response to Message 30791.

Thank you.

What is recommended?

Abort WUs like that? [yes|no]
Leave a post in the MB? [yes|no]

Profile Ben Segal
Volunteer moderator
Project administrator
Send message
Joined: 1 Sep 04
Posts: 83
Credit: 2,579
RAC: 0
Message 30793 - Posted: 15 Jun 2017, 15:42:42 UTC - in response to Message 30792.
Last modified: 15 Jun 2017, 15:43:48 UTC

Thank you.

What is recommended?

Abort WUs like that? [yes|no]

Leave a post in the MB? [yes|no]

Yes and Yes!

This is a problem with the Sherpa application and the Sherpa scientists do look at the Message Boards from time to time.
____________

Profile Ray Murray
Volunteer moderator
Avatar
Send message
Joined: 29 Sep 04
Posts: 146
Credit: 5,068,673
RAC: 3,664
Message 30794 - Posted: 15 Jun 2017, 17:01:48 UTC - in response to Message 30792.
Last modified: 15 Jun 2017, 17:24:07 UTC

Or, rather than Aborting the task, you could collect the faulty job details then Reset the VM in Vbox so it will go and fetch a, hopefully, better job.

Or the more fiddly "End Task Gracefully" option.

computezrmle
Send message
Joined: 15 Jun 08
Posts: 347
Credit: 3,494,852
RAC: 1,536
Message 30796 - Posted: 15 Jun 2017, 18:07:25 UTC - in response to Message 30794.

Or, rather than Aborting the task, you could collect the faulty job details then Reset the VM in Vbox so it will go and fetch a, hopefully, better job.

Or the more fiddly "End Task Gracefully" option.

Continued here.

maeax
Send message
Joined: 2 May 07
Posts: 229
Credit: 11,965,588
RAC: 14,359
Message 30798 - Posted: 16 Jun 2017, 8:05:02 UTC - in response to Message 30793.

This is a problem with the Sherpa application and the Sherpa scientists do look at the Message Boards from time to time.


Is there a statistic to see how long the duration-time of a successful Sherpa job is?

Rasputin42
Send message
Joined: 26 Dec 09
Posts: 10
Credit: 615,653
RAC: 34
Message 30860 - Posted: 19 Jun 2017, 12:32:22 UTC
Last modified: 19 Jun 2017, 12:36:20 UTC

I have extended the 18h to a 24h Limit in the Theoryxxx.xml file (84600sec).

So far, i had a number of jobs finishing after the 18h Limit, but before the new 24h.

BTW. There are no Theory jobs currently, eighter.

Crystal Pellet
Volunteer moderator
Volunteer tester
Send message
Joined: 14 Jan 10
Posts: 384
Credit: 2,989,535
RAC: 1,578
Message 30880 - Posted: 19 Jun 2017, 16:14:23 UTC - in response to Message 30860.

I have extended the 18h to a 24h Limit in the Theoryxxx.xml file (84600sec).

I did the same some time ago for my slow 32-bit ATOM tablet: https://lhcathome.cern.ch/lhcathome/show_host_detail.php?hostid=10416365

Crystal Pellet
Volunteer moderator
Volunteer tester
Send message
Joined: 14 Jan 10
Posts: 384
Credit: 2,989,535
RAC: 1,578
Message 31352 - Posted: 11 Jul 2017, 10:11:23 UTC

Already running 9 hours and another 27 hours to go:

===> [runRivet] Mon Jul 10 18:53:10 CEST 2017 [boinc pp uemb-hard 900 - - sherpa 1.3.1 default 100000 972]
.
.
.
Event 21900 ( 7h 34m 19s elapsed / 1d 3h 11s left ) -> ETA: Wed Jul 12 14:57

Crystal Pellet
Volunteer moderator
Volunteer tester
Send message
Joined: 14 Jan 10
Posts: 384
Credit: 2,989,535
RAC: 1,578
Message 31361 - Posted: 12 Jul 2017, 6:48:49 UTC - in response to Message 31352.

Already running 9 hours and another 27 hours to go:

===> [runRivet] Mon Jul 10 18:53:10 CEST 2017 [boinc pp uemb-hard 900 - - sherpa 1.3.1 default 100000 972]
.
.
.
Event 21900 ( 7h 34m 19s elapsed / 1d 3h 11s left ) -> ETA: Wed Jul 12 14:57

This task was not killed by BOINC's 18hrs duration limit (extended it), but by the automatic update and reboot of Win10 (bloody M$).

computezrmle
Send message
Joined: 15 Jun 08
Posts: 347
Credit: 3,494,852
RAC: 1,536
Message 31363 - Posted: 12 Jul 2017, 7:17:24 UTC - in response to Message 31361.

This task was not killed by BOINC's 18hrs duration limit (extended it), but by the automatic update and reboot of Win10 ... .

I guess it was 5 minutes before the ultimate answer to life, universe and everything was determined?
;-)

Message boards : Theory Application : Sherpa job hit 18 hrs limit