Thread 'Sherpa job hit 18 hrs limit'

Author	Message
computezrmle Volunteer moderator Volunteer developer Volunteer tester Help desk expert Send message Joined: 15 Jun 08 Posts: 2753 Credit: 304,103,038 RAC: 113,237	Message 30080 - Posted: 27 Apr 2017, 11:00:41 UTC This sherpa job ran for many hours and was killed as the WU hit the 18 hrs limit. How can things like that be avoided? https://lhcathome.cern.ch/lhcathome/result.php?resultid=136727247 2017-04-25 23:20:32 (21421): Guest Log: [INFO] New Job Starting in slot2 2017-04-25 23:20:32 (21421): Guest Log: [INFO] Condor JobID: 2715566.0 in slot2 2017-04-25 23:20:37 (21421): Guest Log: [INFO] MCPlots JobID: 36326515 in slot2 2017-04-26 12:59:04 (21421): Powering off VM. 2017-04-26 12:59:05 (21421): Successfully stopped VM. ID: 30080 · Reply Quote

maeax Send message Joined: 2 May 07 Posts: 2304 Credit: 179,727,092 RAC: 20,376	Message 30081 - Posted: 27 Apr 2017, 11:48:22 UTC If you search for sherpa over all messages: 1. Sherpa sometime is running in a loop, see messages of Crystal Pellet. 2. Sherpa job's need a lot of runtime and this is not easy to calculate. ID: 30081 · Reply Quote

Crystal Pellet Volunteer moderator Volunteer tester Send message Joined: 14 Jan 10 Posts: 1556 Credit: 10,100,748 RAC: 1,717	Message 30082 - Posted: 27 Apr 2017, 12:07:27 UTC - in response to Message 30080. How can things like that be avoided? It can be avoided, but is not useful to micromanage such tasks. When a sherpa and also jobs with other type of generators are running very long and will not report back a result, a new job is created most of the time with less number of events. This way a subsequential job has more chance to finish on time. It means that your long running job is not totally worthless. ID: 30082 · Reply Quote

computezrmle Volunteer moderator Volunteer developer Volunteer tester Help desk expert Send message Joined: 15 Jun 08 Posts: 2753 Credit: 304,103,038 RAC: 113,237	Message 30083 - Posted: 27 Apr 2017, 12:18:08 UTC - in response to Message 30082. Thank you. It means that your long running job is not totally worthless. IÂ´m glad to read this. :-) ID: 30083 · Reply Quote

maeax Send message Joined: 2 May 07 Posts: 2304 Credit: 179,727,092 RAC: 20,376	Message 30615 - Posted: 3 Jun 2017, 8:23:29 UTC This was a idle of 15 hours of a Sherpa-Job: 2017-06-02 07:35:33 (10084): Guest Log: [INFO] New Job Starting in slot1 2017-06-02 07:35:33 (10084): Guest Log: [INFO] Condor JobID: 3421419.0 in slot1 2017-06-02 07:35:38 (10084): Guest Log: [INFO] MCPlots JobID: 37145519 in slot1 2017-06-02 07:54:54 (10084): Status Report: Job Duration: '64800.000000' 2017-06-02 07:54:54 (10084): Status Report: Elapsed Time: '6000.121424' 2017-06-02 07:54:54 (10084): Status Report: CPU Time: '4839.166620' 2017-06-02 09:35:35 (10084): Status Report: Job Duration: '64800.000000' 2017-06-02 09:35:35 (10084): Status Report: Elapsed Time: '12000.121424' 2017-06-02 09:35:35 (10084): Status Report: CPU Time: '5371.442032' 2017-06-02 11:16:06 (10084): Status Report: Job Duration: '64800.000000' 2017-06-02 11:16:06 (10084): Status Report: Elapsed Time: '18000.121424' 2017-06-02 11:16:06 (10084): Status Report: CPU Time: '5909.739083' 2017-06-02 12:56:45 (10084): Status Report: Job Duration: '64800.000000' 2017-06-02 12:56:45 (10084): Status Report: Elapsed Time: '24000.121424' 2017-06-02 12:56:45 (10084): Status Report: CPU Time: '6451.670957' 2017-06-02 14:37:35 (10084): Status Report: Job Duration: '64800.000000' 2017-06-02 14:37:35 (10084): Status Report: Elapsed Time: '30000.745533' 2017-06-02 14:37:35 (10084): Status Report: CPU Time: '6959.469812' 2017-06-02 16:18:11 (10084): Status Report: Job Duration: '64800.000000' 2017-06-02 16:18:11 (10084): Status Report: Elapsed Time: '36000.745533' 2017-06-02 16:18:11 (10084): Status Report: CPU Time: '7484.943580' 2017-06-02 17:59:02 (10084): Status Report: Job Duration: '64800.000000' 2017-06-02 17:59:02 (10084): Status Report: Elapsed Time: '42000.745533' 2017-06-02 17:59:02 (10084): Status Report: CPU Time: '8009.153740' 2017-06-02 19:39:28 (10084): Status Report: Job Duration: '64800.000000' 2017-06-02 19:39:28 (10084): Status Report: Elapsed Time: '48000.745533' 2017-06-02 19:39:28 (10084): Status Report: CPU Time: '8555.687644' 2017-06-02 21:20:06 (10084): Status Report: Job Duration: '64800.000000' 2017-06-02 21:20:06 (10084): Status Report: Elapsed Time: '54000.745533' 2017-06-02 21:20:06 (10084): Status Report: CPU Time: '9090.396671' 2017-06-02 23:00:44 (10084): Status Report: Job Duration: '64800.000000' 2017-06-02 23:00:44 (10084): Status Report: Elapsed Time: '60000.745533' 2017-06-02 23:00:44 (10084): Status Report: CPU Time: '9625.994905' 2017-06-03 00:21:43 (10084): Powering off VM. 2017-06-03 00:26:46 (10084): VM did not power off when requested. 2017-06-03 00:26:46 (10084): VM was successfully terminated. 2017-06-03 00:26:46 (10084): Deregistering VM. (boinc_494018919c91d7cc, slot#0) 2017-06-03 00:26:47 (10084): Removing network bandwidth throttle group from VM. 2017-06-03 00:26:47 (10084): Removing storage controller(s) from VM. 2017-06-03 00:26:47 (10084): Removing VM from VirtualBox. 2017-06-03 00:26:47 (10084): Removing virtual disk drive from VirtualBox. 00:26:53 (10084): called boinc_finish(0) </stderr_txt> ]]> ID: 30615 · Reply Quote

computezrmle Volunteer moderator Volunteer developer Volunteer tester Help desk expert Send message Joined: 15 Jun 08 Posts: 2753 Credit: 304,103,038 RAC: 113,237	Message 30726 - Posted: 10 Jun 2017, 15:44:27 UTC I got a Theory WU this morning that ran 3 jobs successfully. The 4th job - a Sherpa - is now running for >840 min at 100% CPU usage. The job output is: Updating display... Display update finished (0 histograms, 0 events). I guess this will continue until the WU hits the 18h limit. What's wrong with the Sherpa job? Task-ID: https://lhcathome.cern.ch/lhcathome/result.php?resultid=145000933 Condor JobID: 3342613.0 MCPlots JobID: 37067675 ID: 30726 · Reply Quote

computezrmle Volunteer moderator Volunteer developer Volunteer tester Help desk expert Send message Joined: 15 Jun 08 Posts: 2753 Credit: 304,103,038 RAC: 113,237	Message 30783 - Posted: 14 Jun 2017, 20:48:02 UTC - in response to Message 30726. I got a Theory WU this morning that ran 3 jobs successfully. The 4th job - a Sherpa - is now running for >840 min at 100% CPU usage. The job output is: Updating display... Display update finished (0 histograms, 0 events). I guess this will continue until the WU hits the 18h limit. What's wrong with the Sherpa job? Task-ID: https://lhcathome.cern.ch/lhcathome/result.php?resultid=145000933 Condor JobID: 3342613.0 MCPlots JobID: 37067675 Got the next Sherpa with the same output. Is it an output error or a job error? 2017-06-14 21:15:09 (28725): Guest Log: [INFO] Condor JobID: 3574607.0 in slot1 2017-06-14 21:15:14 (28725): Guest Log: [INFO] MCPlots JobID: 36820662 in slot1 ID: 30783 · Reply Quote

computezrmle Volunteer moderator Volunteer developer Volunteer tester Help desk expert Send message Joined: 15 Jun 08 Posts: 2753 Credit: 304,103,038 RAC: 113,237	Message 30784 - Posted: 15 Jun 2017, 8:08:41 UTC - in response to Message 30783. That Sherpa is still running (continuously since last evening) and the output is still "... Display update finished (0 histograms, 0 events). ..." (repeatedly). Will abort the WU now. https://lhcathome.cern.ch/lhcathome/result.php?resultid=145728371 ID: 30784 · Reply Quote

Crystal Pellet Volunteer moderator Volunteer tester Send message Joined: 14 Jan 10 Posts: 1556 Credit: 10,100,748 RAC: 1,717	Message 30791 - Posted: 15 Jun 2017, 13:16:24 UTC - in response to Message 30784. That Sherpa is still running (continuously since last evening) and the output is still "... Display update finished (0 histograms, 0 events). ..." (repeatedly). That was one for this thread: Theory's endless looping ID: 30791 · Reply Quote

computezrmle Volunteer moderator Volunteer developer Volunteer tester Help desk expert Send message Joined: 15 Jun 08 Posts: 2753 Credit: 304,103,038 RAC: 113,237	Message 30792 - Posted: 15 Jun 2017, 13:53:51 UTC - in response to Message 30791. Thank you. What is recommended? Abort WUs like that? [yes\|no] Leave a post in the MB? [yes\|no] ID: 30792 · Reply Quote

Ben Segal Volunteer moderator Project administrator Send message Joined: 1 Sep 04 Posts: 143 Credit: 2,579 RAC: 0	Message 30793 - Posted: 15 Jun 2017, 15:42:42 UTC - in response to Message 30792. Last modified: 15 Jun 2017, 15:43:48 UTC Thank you. What is recommended? Abort WUs like that? [yes\|no] Leave a post in the MB? [yes\|no] Yes and Yes! This is a problem with the Sherpa application and the Sherpa scientists do look at the Message Boards from time to time. ID: 30793 · Reply Quote

Ray Murray Volunteer moderator Send message Joined: 29 Sep 04 Posts: 281 Credit: 11,888,115 RAC: 0	Message 30794 - Posted: 15 Jun 2017, 17:01:48 UTC - in response to Message 30792. Last modified: 15 Jun 2017, 17:24:07 UTC Or, rather than Aborting the task, you could collect the faulty job details then Reset the VM in Vbox so it will go and fetch a, hopefully, better job. Or the more fiddly "End Task Gracefully" option. ID: 30794 · Reply Quote

computezrmle Volunteer moderator Volunteer developer Volunteer tester Help desk expert Send message Joined: 15 Jun 08 Posts: 2753 Credit: 304,103,038 RAC: 113,237	Message 30796 - Posted: 15 Jun 2017, 18:07:25 UTC - in response to Message 30794. Or, rather than Aborting the task, you could collect the faulty job details then Reset the VM in Vbox so it will go and fetch a, hopefully, better job. Or the more fiddly "End Task Gracefully" option. Continued here. ID: 30796 · Reply Quote

maeax Send message Joined: 2 May 07 Posts: 2304 Credit: 179,727,092 RAC: 20,376	Message 30798 - Posted: 16 Jun 2017, 8:05:02 UTC - in response to Message 30793. This is a problem with the Sherpa application and the Sherpa scientists do look at the Message Boards from time to time. Is there a statistic to see how long the duration-time of a successful Sherpa job is? ID: 30798 · Reply Quote

Rasputin42 Send message Joined: 26 Dec 09 Posts: 10 Credit: 1,202,105 RAC: 175	Message 30860 - Posted: 19 Jun 2017, 12:32:22 UTC Last modified: 19 Jun 2017, 12:36:20 UTC I have extended the 18h to a 24h Limit in the Theoryxxx.xml file (84600sec). So far, i had a number of jobs finishing after the 18h Limit, but before the new 24h. BTW. There are no Theory jobs currently, eighter. ID: 30860 · Reply Quote

Crystal Pellet Volunteer moderator Volunteer tester Send message Joined: 14 Jan 10 Posts: 1556 Credit: 10,100,748 RAC: 1,717	Message 30880 - Posted: 19 Jun 2017, 16:14:23 UTC - in response to Message 30860. I have extended the 18h to a 24h Limit in the Theoryxxx.xml file (84600sec). I did the same some time ago for my slow 32-bit ATOM tablet: https://lhcathome.cern.ch/lhcathome/show_host_detail.php?hostid=10416365 ID: 30880 · Reply Quote

Crystal Pellet Volunteer moderator Volunteer tester Send message Joined: 14 Jan 10 Posts: 1556 Credit: 10,100,748 RAC: 1,717	Message 31352 - Posted: 11 Jul 2017, 10:11:23 UTC Already running 9 hours and another 27 hours to go: ===> [runRivet] Mon Jul 10 18:53:10 CEST 2017 [boinc pp uemb-hard 900 - - sherpa 1.3.1 default 100000 972] . . . Event 21900 ( 7h 34m 19s elapsed / 1d 3h 11s left ) -> ETA: Wed Jul 12 14:57 ID: 31352 · Reply Quote

Crystal Pellet Volunteer moderator Volunteer tester Send message Joined: 14 Jan 10 Posts: 1556 Credit: 10,100,748 RAC: 1,717	Message 31361 - Posted: 12 Jul 2017, 6:48:49 UTC - in response to Message 31352. Already running 9 hours and another 27 hours to go: ===> [runRivet] Mon Jul 10 18:53:10 CEST 2017 [boinc pp uemb-hard 900 - - sherpa 1.3.1 default 100000 972] . . . Event 21900 ( 7h 34m 19s elapsed / 1d 3h 11s left ) -> ETA: Wed Jul 12 14:57 This task was not killed by BOINC's 18hrs duration limit (extended it), but by the automatic update and reboot of Win10 (bloody M$). ID: 31361 · Reply Quote

computezrmle Volunteer moderator Volunteer developer Volunteer tester Help desk expert Send message Joined: 15 Jun 08 Posts: 2753 Credit: 304,103,038 RAC: 113,237	Message 31363 - Posted: 12 Jul 2017, 7:17:24 UTC - in response to Message 31361. This task was not killed by BOINC's 18hrs duration limit (extended it), but by the automatic update and reboot of Win10 ... . I guess it was 5 minutes before the ultimate answer to life, universe and everything was determined? ;-) ID: 31363 · Reply Quote

Crystal Pellet Volunteer moderator Volunteer tester Send message Joined: 14 Jan 10 Posts: 1556 Credit: 10,100,748 RAC: 1,717	Message 39966 - Posted: 19 Sep 2019, 6:35:40 UTC Thanks to the extend of the 18 hours limit, this sherpa could finish: 2019-09-17 19:20:20 (6464): Guest Log: [INFO] ===> [runRivet] Tue Sep 17 19:09:32 CEST 2019 [boinc pp ttbar 7000 - - sherpa 2.2.1 default 65000 108] . . . 2019-09-19 03:22:25 (5452): Guest Log: [INFO] Job finished in slot1 with 0. ID: 39966 · Reply Quote