Message boards :
Theory Application :
Sherpa job hit 18 hrs limit
Message board moderation
Author | Message |
---|---|
Send message Joined: 15 Jun 08 Posts: 2549 Credit: 255,472,638 RAC: 67,871 |
This sherpa job ran for many hours and was killed as the WU hit the 18 hrs limit. How can things like that be avoided? https://lhcathome.cern.ch/lhcathome/result.php?resultid=136727247 2017-04-25 23:20:32 (21421): Guest Log: [INFO] New Job Starting in slot2 |
Send message Joined: 2 May 07 Posts: 2245 Credit: 174,025,522 RAC: 9,726 |
If you search for sherpa over all messages: 1. Sherpa sometime is running in a loop, see messages of Crystal Pellet. 2. Sherpa job's need a lot of runtime and this is not easy to calculate. |
Send message Joined: 14 Jan 10 Posts: 1429 Credit: 9,541,076 RAC: 5,106 |
How can things like that be avoided? It can be avoided, but is not useful to micromanage such tasks. When a sherpa and also jobs with other type of generators are running very long and will not report back a result, a new job is created most of the time with less number of events. This way a subsequential job has more chance to finish on time. It means that your long running job is not totally worthless. |
Send message Joined: 15 Jun 08 Posts: 2549 Credit: 255,472,638 RAC: 67,871 |
Thank you. It means that your long running job is not totally worthless. I´m glad to read this. :-) |
Send message Joined: 2 May 07 Posts: 2245 Credit: 174,025,522 RAC: 9,726 |
This was a idle of 15 hours of a Sherpa-Job: 2017-06-02 07:35:33 (10084): Guest Log: [INFO] New Job Starting in slot1 2017-06-02 07:35:33 (10084): Guest Log: [INFO] Condor JobID: 3421419.0 in slot1 2017-06-02 07:35:38 (10084): Guest Log: [INFO] MCPlots JobID: 37145519 in slot1 2017-06-02 07:54:54 (10084): Status Report: Job Duration: '64800.000000' 2017-06-02 07:54:54 (10084): Status Report: Elapsed Time: '6000.121424' 2017-06-02 07:54:54 (10084): Status Report: CPU Time: '4839.166620' 2017-06-02 09:35:35 (10084): Status Report: Job Duration: '64800.000000' 2017-06-02 09:35:35 (10084): Status Report: Elapsed Time: '12000.121424' 2017-06-02 09:35:35 (10084): Status Report: CPU Time: '5371.442032' 2017-06-02 11:16:06 (10084): Status Report: Job Duration: '64800.000000' 2017-06-02 11:16:06 (10084): Status Report: Elapsed Time: '18000.121424' 2017-06-02 11:16:06 (10084): Status Report: CPU Time: '5909.739083' 2017-06-02 12:56:45 (10084): Status Report: Job Duration: '64800.000000' 2017-06-02 12:56:45 (10084): Status Report: Elapsed Time: '24000.121424' 2017-06-02 12:56:45 (10084): Status Report: CPU Time: '6451.670957' 2017-06-02 14:37:35 (10084): Status Report: Job Duration: '64800.000000' 2017-06-02 14:37:35 (10084): Status Report: Elapsed Time: '30000.745533' 2017-06-02 14:37:35 (10084): Status Report: CPU Time: '6959.469812' 2017-06-02 16:18:11 (10084): Status Report: Job Duration: '64800.000000' 2017-06-02 16:18:11 (10084): Status Report: Elapsed Time: '36000.745533' 2017-06-02 16:18:11 (10084): Status Report: CPU Time: '7484.943580' 2017-06-02 17:59:02 (10084): Status Report: Job Duration: '64800.000000' 2017-06-02 17:59:02 (10084): Status Report: Elapsed Time: '42000.745533' 2017-06-02 17:59:02 (10084): Status Report: CPU Time: '8009.153740' 2017-06-02 19:39:28 (10084): Status Report: Job Duration: '64800.000000' 2017-06-02 19:39:28 (10084): Status Report: Elapsed Time: '48000.745533' 2017-06-02 19:39:28 (10084): Status Report: CPU Time: '8555.687644' 2017-06-02 21:20:06 (10084): Status Report: Job Duration: '64800.000000' 2017-06-02 21:20:06 (10084): Status Report: Elapsed Time: '54000.745533' 2017-06-02 21:20:06 (10084): Status Report: CPU Time: '9090.396671' 2017-06-02 23:00:44 (10084): Status Report: Job Duration: '64800.000000' 2017-06-02 23:00:44 (10084): Status Report: Elapsed Time: '60000.745533' 2017-06-02 23:00:44 (10084): Status Report: CPU Time: '9625.994905' 2017-06-03 00:21:43 (10084): Powering off VM. 2017-06-03 00:26:46 (10084): VM did not power off when requested. 2017-06-03 00:26:46 (10084): VM was successfully terminated. 2017-06-03 00:26:46 (10084): Deregistering VM. (boinc_494018919c91d7cc, slot#0) 2017-06-03 00:26:47 (10084): Removing network bandwidth throttle group from VM. 2017-06-03 00:26:47 (10084): Removing storage controller(s) from VM. 2017-06-03 00:26:47 (10084): Removing VM from VirtualBox. 2017-06-03 00:26:47 (10084): Removing virtual disk drive from VirtualBox. 00:26:53 (10084): called boinc_finish(0) </stderr_txt> ]]> |
Send message Joined: 15 Jun 08 Posts: 2549 Credit: 255,472,638 RAC: 67,871 |
I got a Theory WU this morning that ran 3 jobs successfully. The 4th job - a Sherpa - is now running for >840 min at 100% CPU usage. The job output is: Updating display... I guess this will continue until the WU hits the 18h limit. What's wrong with the Sherpa job? Task-ID: https://lhcathome.cern.ch/lhcathome/result.php?resultid=145000933 Condor JobID: 3342613.0 MCPlots JobID: 37067675 |
Send message Joined: 15 Jun 08 Posts: 2549 Credit: 255,472,638 RAC: 67,871 |
I got a Theory WU this morning that ran 3 jobs successfully. Got the next Sherpa with the same output. Is it an output error or a job error? 2017-06-14 21:15:09 (28725): Guest Log: [INFO] Condor JobID: 3574607.0 in slot1 |
Send message Joined: 15 Jun 08 Posts: 2549 Credit: 255,472,638 RAC: 67,871 |
That Sherpa is still running (continuously since last evening) and the output is still "... Display update finished (0 histograms, 0 events). ..." (repeatedly). Will abort the WU now. https://lhcathome.cern.ch/lhcathome/result.php?resultid=145728371 |
Send message Joined: 14 Jan 10 Posts: 1429 Credit: 9,541,076 RAC: 5,106 |
That Sherpa is still running (continuously since last evening) and the output is still That was one for this thread: Theory's endless looping |
Send message Joined: 15 Jun 08 Posts: 2549 Credit: 255,472,638 RAC: 67,871 |
Thank you. What is recommended? Abort WUs like that? [yes|no] Leave a post in the MB? [yes|no] |
Send message Joined: 1 Sep 04 Posts: 140 Credit: 2,579 RAC: 0 |
Thank you. Yes and Yes! This is a problem with the Sherpa application and the Sherpa scientists do look at the Message Boards from time to time. |
Send message Joined: 29 Sep 04 Posts: 281 Credit: 11,866,264 RAC: 0 |
Or, rather than Aborting the task, you could collect the faulty job details then Reset the VM in Vbox so it will go and fetch a, hopefully, better job. Or the more fiddly "End Task Gracefully" option. |
Send message Joined: 15 Jun 08 Posts: 2549 Credit: 255,472,638 RAC: 67,871 |
Or, rather than Aborting the task, you could collect the faulty job details then Reset the VM in Vbox so it will go and fetch a, hopefully, better job. Continued here. |
Send message Joined: 2 May 07 Posts: 2245 Credit: 174,025,522 RAC: 9,726 |
This is a problem with the Sherpa application and the Sherpa scientists do look at the Message Boards from time to time. Is there a statistic to see how long the duration-time of a successful Sherpa job is? |
Send message Joined: 26 Dec 09 Posts: 10 Credit: 1,192,862 RAC: 0 |
I have extended the 18h to a 24h Limit in the Theoryxxx.xml file (84600sec). So far, i had a number of jobs finishing after the 18h Limit, but before the new 24h. BTW. There are no Theory jobs currently, eighter. |
Send message Joined: 14 Jan 10 Posts: 1429 Credit: 9,541,076 RAC: 5,106 |
I have extended the 18h to a 24h Limit in the Theoryxxx.xml file (84600sec). I did the same some time ago for my slow 32-bit ATOM tablet: https://lhcathome.cern.ch/lhcathome/show_host_detail.php?hostid=10416365 |
Send message Joined: 14 Jan 10 Posts: 1429 Credit: 9,541,076 RAC: 5,106 |
Already running 9 hours and another 27 hours to go: ===> [runRivet] Mon Jul 10 18:53:10 CEST 2017 [boinc pp uemb-hard 900 - - sherpa 1.3.1 default 100000 972] . . . Event 21900 ( 7h 34m 19s elapsed / 1d 3h 11s left ) -> ETA: Wed Jul 12 14:57 |
Send message Joined: 14 Jan 10 Posts: 1429 Credit: 9,541,076 RAC: 5,106 |
Already running 9 hours and another 27 hours to go: This task was not killed by BOINC's 18hrs duration limit (extended it), but by the automatic update and reboot of Win10 (bloody M$). |
Send message Joined: 15 Jun 08 Posts: 2549 Credit: 255,472,638 RAC: 67,871 |
This task was not killed by BOINC's 18hrs duration limit (extended it), but by the automatic update and reboot of Win10 ... . I guess it was 5 minutes before the ultimate answer to life, universe and everything was determined? ;-) |
Send message Joined: 14 Jan 10 Posts: 1429 Credit: 9,541,076 RAC: 5,106 |
Thanks to the extend of the 18 hours limit, this sherpa could finish: 2019-09-17 19:20:20 (6464): Guest Log: [INFO] ===> [runRivet] Tue Sep 17 19:09:32 CEST 2019 [boinc pp ttbar 7000 - - sherpa 2.2.1 default 65000 108] . . . 2019-09-19 03:22:25 (5452): Guest Log: [INFO] Job finished in slot1 with 0. |
©2025 CERN