Message boards : Theory Application : Sherpa job hit 18 hrs limit
Message board moderation

To post messages, you must log in.

AuthorMessage
computezrmle
Volunteer moderator
Volunteer developer
Volunteer tester
Help desk expert
Avatar

Send message
Joined: 15 Jun 08
Posts: 2386
Credit: 222,892,386
RAC: 138,179
Message 30080 - Posted: 27 Apr 2017, 11:00:41 UTC

This sherpa job ran for many hours and was killed as the WU hit the 18 hrs limit.
How can things like that be avoided?

https://lhcathome.cern.ch/lhcathome/result.php?resultid=136727247
2017-04-25 23:20:32 (21421): Guest Log: [INFO] New Job Starting in slot2
2017-04-25 23:20:32 (21421): Guest Log: [INFO] Condor JobID: 2715566.0 in slot2
2017-04-25 23:20:37 (21421): Guest Log: [INFO] MCPlots JobID: 36326515 in slot2

2017-04-26 12:59:04 (21421): Powering off VM.
2017-04-26 12:59:05 (21421): Successfully stopped VM.
ID: 30080 · Report as offensive     Reply Quote
maeax

Send message
Joined: 2 May 07
Posts: 2071
Credit: 156,084,038
RAC: 105,553
Message 30081 - Posted: 27 Apr 2017, 11:48:22 UTC

If you search for sherpa over all messages:

1. Sherpa sometime is running in a loop, see messages of Crystal Pellet.

2. Sherpa job's need a lot of runtime and this is not easy to calculate.
ID: 30081 · Report as offensive     Reply Quote
Crystal Pellet
Volunteer moderator
Volunteer tester

Send message
Joined: 14 Jan 10
Posts: 1268
Credit: 8,421,616
RAC: 2,139
Message 30082 - Posted: 27 Apr 2017, 12:07:27 UTC - in response to Message 30080.  

How can things like that be avoided?

It can be avoided, but is not useful to micromanage such tasks.

When a sherpa and also jobs with other type of generators are running very long and will not report back a result,
a new job is created most of the time with less number of events. This way a subsequential job has more chance to finish on time.

It means that your long running job is not totally worthless.
ID: 30082 · Report as offensive     Reply Quote
computezrmle
Volunteer moderator
Volunteer developer
Volunteer tester
Help desk expert
Avatar

Send message
Joined: 15 Jun 08
Posts: 2386
Credit: 222,892,386
RAC: 138,179
Message 30083 - Posted: 27 Apr 2017, 12:18:08 UTC - in response to Message 30082.  

Thank you.


It means that your long running job is not totally worthless.

I´m glad to read this.
:-)
ID: 30083 · Report as offensive     Reply Quote
maeax

Send message
Joined: 2 May 07
Posts: 2071
Credit: 156,084,038
RAC: 105,553
Message 30615 - Posted: 3 Jun 2017, 8:23:29 UTC

This was a idle of 15 hours of a Sherpa-Job:

2017-06-02 07:35:33 (10084): Guest Log: [INFO] New Job Starting in slot1

2017-06-02 07:35:33 (10084): Guest Log: [INFO] Condor JobID: 3421419.0 in slot1

2017-06-02 07:35:38 (10084): Guest Log: [INFO] MCPlots JobID: 37145519 in slot1

2017-06-02 07:54:54 (10084): Status Report: Job Duration: '64800.000000'
2017-06-02 07:54:54 (10084): Status Report: Elapsed Time: '6000.121424'
2017-06-02 07:54:54 (10084): Status Report: CPU Time: '4839.166620'
2017-06-02 09:35:35 (10084): Status Report: Job Duration: '64800.000000'
2017-06-02 09:35:35 (10084): Status Report: Elapsed Time: '12000.121424'
2017-06-02 09:35:35 (10084): Status Report: CPU Time: '5371.442032'
2017-06-02 11:16:06 (10084): Status Report: Job Duration: '64800.000000'
2017-06-02 11:16:06 (10084): Status Report: Elapsed Time: '18000.121424'
2017-06-02 11:16:06 (10084): Status Report: CPU Time: '5909.739083'
2017-06-02 12:56:45 (10084): Status Report: Job Duration: '64800.000000'
2017-06-02 12:56:45 (10084): Status Report: Elapsed Time: '24000.121424'
2017-06-02 12:56:45 (10084): Status Report: CPU Time: '6451.670957'
2017-06-02 14:37:35 (10084): Status Report: Job Duration: '64800.000000'
2017-06-02 14:37:35 (10084): Status Report: Elapsed Time: '30000.745533'
2017-06-02 14:37:35 (10084): Status Report: CPU Time: '6959.469812'
2017-06-02 16:18:11 (10084): Status Report: Job Duration: '64800.000000'
2017-06-02 16:18:11 (10084): Status Report: Elapsed Time: '36000.745533'
2017-06-02 16:18:11 (10084): Status Report: CPU Time: '7484.943580'
2017-06-02 17:59:02 (10084): Status Report: Job Duration: '64800.000000'
2017-06-02 17:59:02 (10084): Status Report: Elapsed Time: '42000.745533'
2017-06-02 17:59:02 (10084): Status Report: CPU Time: '8009.153740'
2017-06-02 19:39:28 (10084): Status Report: Job Duration: '64800.000000'
2017-06-02 19:39:28 (10084): Status Report: Elapsed Time: '48000.745533'
2017-06-02 19:39:28 (10084): Status Report: CPU Time: '8555.687644'
2017-06-02 21:20:06 (10084): Status Report: Job Duration: '64800.000000'
2017-06-02 21:20:06 (10084): Status Report: Elapsed Time: '54000.745533'
2017-06-02 21:20:06 (10084): Status Report: CPU Time: '9090.396671'
2017-06-02 23:00:44 (10084): Status Report: Job Duration: '64800.000000'
2017-06-02 23:00:44 (10084): Status Report: Elapsed Time: '60000.745533'
2017-06-02 23:00:44 (10084): Status Report: CPU Time: '9625.994905'
2017-06-03 00:21:43 (10084): Powering off VM.
2017-06-03 00:26:46 (10084): VM did not power off when requested.
2017-06-03 00:26:46 (10084): VM was successfully terminated.
2017-06-03 00:26:46 (10084): Deregistering VM. (boinc_494018919c91d7cc, slot#0)
2017-06-03 00:26:47 (10084): Removing network bandwidth throttle group from VM.
2017-06-03 00:26:47 (10084): Removing storage controller(s) from VM.
2017-06-03 00:26:47 (10084): Removing VM from VirtualBox.
2017-06-03 00:26:47 (10084): Removing virtual disk drive from VirtualBox.
00:26:53 (10084): called boinc_finish(0)

</stderr_txt>
]]>
ID: 30615 · Report as offensive     Reply Quote
computezrmle
Volunteer moderator
Volunteer developer
Volunteer tester
Help desk expert
Avatar

Send message
Joined: 15 Jun 08
Posts: 2386
Credit: 222,892,386
RAC: 138,179
Message 30726 - Posted: 10 Jun 2017, 15:44:27 UTC

I got a Theory WU this morning that ran 3 jobs successfully.
The 4th job - a Sherpa - is now running for >840 min at 100% CPU usage.
The job output is:
Updating display...
Display update finished (0 histograms, 0 events).

I guess this will continue until the WU hits the 18h limit.
What's wrong with the Sherpa job?

Task-ID: https://lhcathome.cern.ch/lhcathome/result.php?resultid=145000933
Condor JobID: 3342613.0
MCPlots JobID: 37067675
ID: 30726 · Report as offensive     Reply Quote
computezrmle
Volunteer moderator
Volunteer developer
Volunteer tester
Help desk expert
Avatar

Send message
Joined: 15 Jun 08
Posts: 2386
Credit: 222,892,386
RAC: 138,179
Message 30783 - Posted: 14 Jun 2017, 20:48:02 UTC - in response to Message 30726.  

I got a Theory WU this morning that ran 3 jobs successfully.
The 4th job - a Sherpa - is now running for >840 min at 100% CPU usage.
The job output is:
Updating display...
Display update finished (0 histograms, 0 events).

I guess this will continue until the WU hits the 18h limit.
What's wrong with the Sherpa job?

Task-ID: https://lhcathome.cern.ch/lhcathome/result.php?resultid=145000933
Condor JobID: 3342613.0
MCPlots JobID: 37067675

Got the next Sherpa with the same output.
Is it an output error or a job error?

2017-06-14 21:15:09 (28725): Guest Log: [INFO] Condor JobID: 3574607.0 in slot1
2017-06-14 21:15:14 (28725): Guest Log: [INFO] MCPlots JobID: 36820662 in slot1
ID: 30783 · Report as offensive     Reply Quote
computezrmle
Volunteer moderator
Volunteer developer
Volunteer tester
Help desk expert
Avatar

Send message
Joined: 15 Jun 08
Posts: 2386
Credit: 222,892,386
RAC: 138,179
Message 30784 - Posted: 15 Jun 2017, 8:08:41 UTC - in response to Message 30783.  

That Sherpa is still running (continuously since last evening) and the output is still
"... Display update finished (0 histograms, 0 events). ..." (repeatedly).

Will abort the WU now.
https://lhcathome.cern.ch/lhcathome/result.php?resultid=145728371
ID: 30784 · Report as offensive     Reply Quote
Crystal Pellet
Volunteer moderator
Volunteer tester

Send message
Joined: 14 Jan 10
Posts: 1268
Credit: 8,421,616
RAC: 2,139
Message 30791 - Posted: 15 Jun 2017, 13:16:24 UTC - in response to Message 30784.  

That Sherpa is still running (continuously since last evening) and the output is still
"... Display update finished (0 histograms, 0 events). ..." (repeatedly).

That was one for this thread: Theory's endless looping
ID: 30791 · Report as offensive     Reply Quote
computezrmle
Volunteer moderator
Volunteer developer
Volunteer tester
Help desk expert
Avatar

Send message
Joined: 15 Jun 08
Posts: 2386
Credit: 222,892,386
RAC: 138,179
Message 30792 - Posted: 15 Jun 2017, 13:53:51 UTC - in response to Message 30791.  

Thank you.

What is recommended?

Abort WUs like that? [yes|no]
Leave a post in the MB? [yes|no]
ID: 30792 · Report as offensive     Reply Quote
Profile Ben Segal
Volunteer moderator
Project administrator

Send message
Joined: 1 Sep 04
Posts: 139
Credit: 2,579
RAC: 0
Message 30793 - Posted: 15 Jun 2017, 15:42:42 UTC - in response to Message 30792.  
Last modified: 15 Jun 2017, 15:43:48 UTC

Thank you.

What is recommended?

Abort WUs like that? [yes|no]

Leave a post in the MB? [yes|no]

Yes and Yes!

This is a problem with the Sherpa application and the Sherpa scientists do look at the Message Boards from time to time.
ID: 30793 · Report as offensive     Reply Quote
Profile Ray Murray
Volunteer moderator
Avatar

Send message
Joined: 29 Sep 04
Posts: 281
Credit: 11,859,285
RAC: 1
Message 30794 - Posted: 15 Jun 2017, 17:01:48 UTC - in response to Message 30792.  
Last modified: 15 Jun 2017, 17:24:07 UTC

Or, rather than Aborting the task, you could collect the faulty job details then Reset the VM in Vbox so it will go and fetch a, hopefully, better job.

Or the more fiddly "End Task Gracefully" option.
ID: 30794 · Report as offensive     Reply Quote
computezrmle
Volunteer moderator
Volunteer developer
Volunteer tester
Help desk expert
Avatar

Send message
Joined: 15 Jun 08
Posts: 2386
Credit: 222,892,386
RAC: 138,179
Message 30796 - Posted: 15 Jun 2017, 18:07:25 UTC - in response to Message 30794.  

Or, rather than Aborting the task, you could collect the faulty job details then Reset the VM in Vbox so it will go and fetch a, hopefully, better job.

Or the more fiddly "End Task Gracefully" option.

Continued here.
ID: 30796 · Report as offensive     Reply Quote
maeax

Send message
Joined: 2 May 07
Posts: 2071
Credit: 156,084,038
RAC: 105,553
Message 30798 - Posted: 16 Jun 2017, 8:05:02 UTC - in response to Message 30793.  

This is a problem with the Sherpa application and the Sherpa scientists do look at the Message Boards from time to time.


Is there a statistic to see how long the duration-time of a successful Sherpa job is?
ID: 30798 · Report as offensive     Reply Quote
Rasputin42

Send message
Joined: 26 Dec 09
Posts: 10
Credit: 1,192,862
RAC: 0
Message 30860 - Posted: 19 Jun 2017, 12:32:22 UTC
Last modified: 19 Jun 2017, 12:36:20 UTC

I have extended the 18h to a 24h Limit in the Theoryxxx.xml file (84600sec).

So far, i had a number of jobs finishing after the 18h Limit, but before the new 24h.

BTW. There are no Theory jobs currently, eighter.
ID: 30860 · Report as offensive     Reply Quote
Crystal Pellet
Volunteer moderator
Volunteer tester

Send message
Joined: 14 Jan 10
Posts: 1268
Credit: 8,421,616
RAC: 2,139
Message 30880 - Posted: 19 Jun 2017, 16:14:23 UTC - in response to Message 30860.  

I have extended the 18h to a 24h Limit in the Theoryxxx.xml file (84600sec).

I did the same some time ago for my slow 32-bit ATOM tablet: https://lhcathome.cern.ch/lhcathome/show_host_detail.php?hostid=10416365
ID: 30880 · Report as offensive     Reply Quote
Crystal Pellet
Volunteer moderator
Volunteer tester

Send message
Joined: 14 Jan 10
Posts: 1268
Credit: 8,421,616
RAC: 2,139
Message 31352 - Posted: 11 Jul 2017, 10:11:23 UTC

Already running 9 hours and another 27 hours to go:

===> [runRivet] Mon Jul 10 18:53:10 CEST 2017 [boinc pp uemb-hard 900 - - sherpa 1.3.1 default 100000 972]
.
.
.
Event 21900 ( 7h 34m 19s elapsed / 1d 3h 11s left ) -> ETA: Wed Jul 12 14:57
ID: 31352 · Report as offensive     Reply Quote
Crystal Pellet
Volunteer moderator
Volunteer tester

Send message
Joined: 14 Jan 10
Posts: 1268
Credit: 8,421,616
RAC: 2,139
Message 31361 - Posted: 12 Jul 2017, 6:48:49 UTC - in response to Message 31352.  

Already running 9 hours and another 27 hours to go:

===> [runRivet] Mon Jul 10 18:53:10 CEST 2017 [boinc pp uemb-hard 900 - - sherpa 1.3.1 default 100000 972]
.
.
.
Event 21900 ( 7h 34m 19s elapsed / 1d 3h 11s left ) -> ETA: Wed Jul 12 14:57

This task was not killed by BOINC's 18hrs duration limit (extended it), but by the automatic update and reboot of Win10 (bloody M$).
ID: 31361 · Report as offensive     Reply Quote
computezrmle
Volunteer moderator
Volunteer developer
Volunteer tester
Help desk expert
Avatar

Send message
Joined: 15 Jun 08
Posts: 2386
Credit: 222,892,386
RAC: 138,179
Message 31363 - Posted: 12 Jul 2017, 7:17:24 UTC - in response to Message 31361.  

This task was not killed by BOINC's 18hrs duration limit (extended it), but by the automatic update and reboot of Win10 ... .

I guess it was 5 minutes before the ultimate answer to life, universe and everything was determined?
;-)
ID: 31363 · Report as offensive     Reply Quote
Crystal Pellet
Volunteer moderator
Volunteer tester

Send message
Joined: 14 Jan 10
Posts: 1268
Credit: 8,421,616
RAC: 2,139
Message 39966 - Posted: 19 Sep 2019, 6:35:40 UTC

Thanks to the extend of the 18 hours limit, this sherpa could finish:

2019-09-17 19:20:20 (6464): Guest Log: [INFO] ===> [runRivet] Tue Sep 17 19:09:32 CEST 2019 [boinc pp ttbar 7000 - - sherpa 2.2.1 default 65000 108]
.
.
.
2019-09-19 03:22:25 (5452): Guest Log: [INFO] Job finished in slot1 with 0.
ID: 39966 · Report as offensive     Reply Quote

Message boards : Theory Application : Sherpa job hit 18 hrs limit


©2024 CERN