Message boards :
Theory Application :
Estimated Remaining Time Well Past Scheduled Due Date
Message board moderation
Author | Message |
---|---|
Send message Joined: 9 May 10 Posts: 14 Credit: 3,104,485 RAC: 3 |
|
Send message Joined: 18 Dec 15 Posts: 1687 Credit: 103,040,123 RAC: 126,676 |
whenever tasks from a new series get startet, it takes a while until the times shown in BOINC get into the right balance. Simply don't care, let all the tasks run. They will finish a lot earlier than (improperly) indicated. |
Send message Joined: 14 Jan 10 Posts: 1273 Credit: 8,480,147 RAC: 2,155 |
I have a bunch of new theory apps running who's estimated completed date is well past the due date. Should I continue with these or cancel them?Yes, keep these tasks running. The deadline and the job duration in Theory_2019_10_01.xml is both the same: 10 days (causing high priority run), but the job duration is always unknown. In the previous version the VM did several jobs with a 30 days deadline and was killed after 18 hours, now the VM is running only one job, that however could run only a few minutes to several days. Laurence, could you reduce the job duration to a more realistic value like 360000 (100 hours) and keep the deadline like it is? |
Send message Joined: 20 Jun 14 Posts: 373 Credit: 238,712 RAC: 0 |
Laurence, could you reduce the job duration to a more realistic value like 360000 (100 hours) and keep the deadline like it is? I have released a new version 300.01 which should have this new setting. |
Send message Joined: 14 Jan 10 Posts: 1273 Credit: 8,480,147 RAC: 2,155 |
Laurence, could you reduce the job duration to a more realistic value like 360000 (100 hours) and keep the deadline like it is? This is true for the 32-bits version in Theory32_2019_11_13.xml, but not for the 64-bits version in Theory_2019_11_13.xml. The job duration there is still 864000 seconds. |
Send message Joined: 20 Jun 14 Posts: 373 Credit: 238,712 RAC: 0 |
This is true for the 32-bits version in Theory32_2019_11_13.xml, but not for the 64-bits version in Theory_2019_11_13.xml. Sorry the wrong file was changed. Please try with v300.02 |
Send message Joined: 9 May 10 Posts: 14 Credit: 3,104,485 RAC: 3 |
I see. Well then I won't worry about it. I can't remember the last time I had seen this and I have noticed that the "estimated time remaining" indicators have not always been super accurate. I can see more clearly now the process as it is how it has been described thus far. Cheers, Colin |
Send message Joined: 12 Jun 18 Posts: 126 Credit: 53,853,596 RAC: 111,798 |
I have a dozen or so nT 1.01 WUs that are running over 2 days. The CPU usage is jumping around in the 40-60%. Will these ever converge on a solution or should I Abort them??? |
Send message Joined: 20 Jun 14 Posts: 373 Credit: 238,712 RAC: 0 |
I have a dozen or so nT 1.01 WUs that are running over 2 days. The CPU usage is jumping around in the 40-60%. Will these ever converge on a solution or should I Abort them??? You can take a look at the runRivet.log in the slot directory to see what the job is doing. |
Send message Joined: 12 Jun 18 Posts: 126 Credit: 53,853,596 RAC: 111,798 |
What am I looking for???I have a dozen or so nT 1.01 WUs that are running over 2 days. The CPU usage is jumping around in the 40-60%. Will these ever converge on a solution or should I Abort them???You can take a look at the runRivet.log in the slot directory to see what the job is doing. The first rig had 75 folders (most empty) in /slots/ and the first runRivet.log I found has 6,022 lines in it. I scrolled though them and see nothing that tells me anything. |
Send message Joined: 15 Jun 08 Posts: 2401 Credit: 225,444,584 RAC: 123,615 |
1st line of a typical runRivet.log: ===> [runRivet] Thu Nov 21 10:27:33 UTC 2019 [boinc pp ue 900 - - pythia8 8.235 default-DL 100000 187] The value 100000 tells you that the task will simulate 100000 events. Last lines of the same logfile 45000 events processed dumping histograms... 45100 events processed Typical Sherpa longrunners look different. Empty slots are nothing to worry about as long as your BOINC client doesn't complain about too many (>99) slots. |
Send message Joined: 20 Jun 14 Posts: 373 Credit: 238,712 RAC: 0 |
The first rig had 75 folders (most empty) in /slots/ and the first runRivet.log I found has 6,022 lines in it. I scrolled though them and see nothing that tells me anything. Run tail -f on that file and check lines are being written, they are different and it looks like the program is moving forward. The first line of that file will say what job it is. Post the first line and the last 10 lines here. |
Send message Joined: 12 Jun 18 Posts: 126 Credit: 53,853,596 RAC: 111,798 |
First line: ===> [runRivet] Wed Nov 20 03:09:11 UTC 2019 [boinc pp jets 13000 180,-,3560 - sherpa 2.2.1 default 10000 148] Last lines: Display update finished (0 histograms, 0 events). 3.33645e-10 pb +- ( 1.15035e-11 pb = 3.44781 % ) 6740000 ( 99680219 -> 6.9 % ) integration time: ( 1d 6h 55m 21s elapsed / 2d 10h 36m 32s left ) [10:57:28] Updating display... Display update finished (0 histograms, 0 events). Updating display... Display update finished (0 histograms, 0 events).I assume this means Abort. |
Send message Joined: 12 Jun 18 Posts: 126 Credit: 53,853,596 RAC: 111,798 |
aurum@Rig-04:/var/lib/boinc-client/slots/5/cernvm/shared$ tail -f runRivet.log3.3369e-10 pb +- ( 1.14707e-11 pb = 3.43753 % ) 6760000 ( 99966130 -> 6.9 % ) integration time: ( 1d 7h 11s elapsed / 2d 10h 12m 30s left ) [11:02:19] Updating display... Display update finished (0 histograms, 0 events). Updating display... Display update finished (0 histograms, 0 events). Updating display... Display update finished (0 histograms, 0 events). Updating display... Display update finished (0 histograms, 0 events).22 nT WUs await their fate... |
Send message Joined: 20 Jun 14 Posts: 373 Credit: 238,712 RAC: 0 |
Sherpa jobs have a reputation for being long runners but it looks from the log that it might be finished in 2 days. I have one too at the moment which is a bit annoying as I am testing things so might have to abort. it. Will leave others to comment who have more experience with watching them. |
Send message Joined: 2 May 07 Posts: 2090 Credit: 158,816,631 RAC: 127,244 |
We have a thread Sherpa and -native: https://lhcathome.cern.ch/lhcathome/forum_thread.php?id=4979 |
Send message Joined: 14 Jan 10 Posts: 1273 Credit: 8,480,147 RAC: 2,155 |
First line:This 148th attempt could be successful.===> [runRivet] Wed Nov 20 03:09:11 UTC 2019 [boinc pp jets 13000 180,-,3560 - sherpa 2.2.1 default 10000 148] From old figures with this job description: Out of 54 attemps 18 were successful, 6 failed and 30 lost. |
Send message Joined: 20 Nov 19 Posts: 21 Credit: 1,074,330 RAC: 0 |
Right now I have 3 Theory Native 1.01 tasks that have been running more than a day. They are in slots 1,2, and 8. Slot 1 properties from Boinc Manager: Application Theory Native 1.01 (native_theory) Name TheoryN_2279-770870-156 State Running Received Thu 21 Nov 2019 12:25:33 PM CST Report deadline Sun 01 Dec 2019 12:25:31 PM CST Estimated computation size 3,600 GFLOPs CPU time 1d 06:52:19 CPU time since checkpoint 1d 06:52:19 Elapsed time 1d 05:28:37 Estimated time remaining 00:00:00 Fraction done 99.999% Virtual memory size 528.60 MB Working set size 50.60 MB Directory slots/1 Process ID 3199 Progress rate 3.240% per hour Executable wrapper_2019_03_02_x86_64-linux Tail from runRivet.log: $ sudo tail -f /var/lib/boinc-client/slots/1/cernvm/shared/runRivet.log Updating display... Display update finished (127 histograms, 72000 events). Updating display... Display update finished (127 histograms, 72000 events). Updating display... Display update finished (127 histograms, 72000 events). Updating display... Display update finished (127 histograms, 72000 events). Updating display... Display update finished (127 histograms, 72000 events). Updating display... Display update finished (127 histograms, 72000 events). Slot 2 properties from Boinc Manager: Application Theory Native 1.01 (native_theory) Name TheoryN_2279-750936-155 State Running Received Thu 21 Nov 2019 12:25:33 PM CST Report deadline Sun 01 Dec 2019 12:25:31 PM CST Estimated computation size 3,600 GFLOPs CPU time 1d 06:27:03 CPU time since checkpoint 1d 06:27:03 Elapsed time 1d 05:31:57 Estimated time remaining 00:00:00 Fraction done 100.000% Virtual memory size 600.85 MB Working set size 63.33 MB Directory slots/2 Process ID 3201 Progress rate 3.240% per hour Executable wrapper_2019_03_02_x86_64-linux Tail from runRivet.log: $ sudo tail -f /var/lib/boinc-client/slots/2/cernvm/shared/runRivet.log 3.8625e+14 pb +- ( 1.00728e+14 pb = 26.0785 % ) 772720000 ( 772720138 -> 99.9 % ) integration time: ( 1d 6h 4m 1s elapsed / 851d 1h 41m 25s left ) [14:51:05] 3.8624e+14 pb +- ( 1.00725e+14 pb = 26.0785 % ) 772740000 ( 772740138 -> 99.9 % ) integration time: ( 1d 6h 4m 5s elapsed / 851d 2h 21m 49s left ) [14:51:09] 3.8623e+14 pb +- ( 1.00723e+14 pb = 26.0785 % ) 772760000 ( 772760138 -> 99.9 % ) integration time: ( 1d 6h 4m 7s elapsed / 851d 2h 50m 55s left ) [14:51:11] 3.8622e+14 pb +- ( 1.0072e+14 pb = 26.0785 % ) 772780000 ( 772780138 -> 99.9 % ) integration time: ( 1d 6h 4m 11s elapsed / 851d 3h 32m 34s left ) [14:51:15] 3.8621e+14 pb +- ( 1.00718e+14 pb = 26.0785 % ) 772800000 ( 772800138 -> 99.9 % ) integration time: ( 1d 6h 4m 15s elapsed / 851d 4h 14m left ) [14:51:19] 3.862e+14 pb +- ( 1.00715e+14 pb = 26.0785 % ) 772820000 ( 772820138 -> 99.9 % ) integration time: ( 1d 6h 4m 18s elapsed / 851d 4h 55m 53s left ) [14:51:22] Slot 8 properties from Boinc Manager: Application Theory Native 1.01 (native_theory) Name TheoryN_2279-750240-149 State Running Received Thu 21 Nov 2019 12:25:33 PM CST Report deadline Sun 01 Dec 2019 12:25:32 PM CST Estimated computation size 3,600 GFLOPs CPU time 1d 06:29:32 CPU time since checkpoint 1d 06:29:32 Elapsed time 1d 05:34:16 Estimated time remaining 00:00:00 Fraction done 100.000% Virtual memory size 600.85 MB Working set size 63.72 MB Directory slots/8 Process ID 3200 Progress rate 3.240% per hour Executable wrapper_2019_03_02_x86_64-linux Tail from runRivet.log: $ sudo tail -f /var/lib/boinc-client/slots/8/cernvm/shared/runRivet.log 4.63146e+15 pb +- ( 1.72895e+15 pb = 37.3305 % ) 773880000 ( 773880400 -> 99.9 % ) integration time: ( 1d 6h 6m 16s elapsed / 1747d 9h 55m 55s left ) [14:53:11] 4.63134e+15 pb +- ( 1.7289e+15 pb = 37.3305 % ) 773900000 ( 773900400 -> 99.9 % ) integration time: ( 1d 6h 6m 20s elapsed / 1747d 11h 23m 54s left ) [14:53:15] 4.63122e+15 pb +- ( 1.72886e+15 pb = 37.3305 % ) 773920000 ( 773920400 -> 99.9 % ) integration time: ( 1d 6h 6m 24s elapsed / 1747d 12h 45m 50s left ) [14:53:18] 4.63111e+15 pb +- ( 1.72881e+15 pb = 37.3305 % ) 773940000 ( 773940400 -> 99.9 % ) integration time: ( 1d 6h 6m 27s elapsed / 1747d 14h 11m 2s left ) [14:53:22] 4.63099e+15 pb +- ( 1.72877e+15 pb = 37.3305 % ) 773960000 ( 773960400 -> 99.9 % ) integration time: ( 1d 6h 6m 31s elapsed / 1747d 15h 35m 17s left ) [14:53:26] Updating display... Display update finished (0 histograms, 0 events). 4.63087e+15 pb +- ( 1.72872e+15 pb = 37.3305 % ) 773980000 ( 773980400 -> 99.9 % ) integration time: ( 1d 6h 6m 34s elapsed / 1747d 17h 1s left ) [14:53:30] 4.63075e+15 pb +- ( 1.72868e+15 pb = 37.3305 % ) 774000000 ( 774000400 -> 99.9 % ) integration time: ( 1d 6h 6m 37s elapsed / 1747d 18h 7s left ) [14:53:32] 4.63063e+15 pb +- ( 1.72864e+15 pb = 37.3305 % ) 774020000 ( 774020400 -> 99.9 % ) integration time: ( 1d 6h 6m 39s elapsed / 1747d 18h 52m 6s left ) [14:53:35] It seems that slot 1 has stalled but slots 2 and 8 want to run for a few more years. Is it OK to abort? |
Send message Joined: 15 Jun 08 Posts: 2401 Credit: 225,444,584 RAC: 123,615 |
Slot 1: Report deadline Sun 01 Dec 2019 12:25:31 PM CST Lots of time to finish the task and get rewarded. CPU time 1d 06:52:19 Elapsed time 1d 05:28:37 Roughly equal. This means the task is doing well. Estimated time remaining 00:00:00 Fraction done 99.999% Progress rate 3.240% per hour Unreliable fake! The BOINC client has no interface to look into the scientific app. Check the original logfiles instead. Your logfile tail: Display update finished (127 histograms, 72000 events). Line 1 is missing. It would tell you how many events are planned to be calculated. Most tasks plan 100000 events. In this case 72000 are already finished within 1 day and there's lots of time until the deadline is reached. Suggested decision: Let the task run. Slot 2: integration time: ( 1d 6h 4m 15s elapsed / 851d 4h 14m left ) [14:51:19] More than 851d left and increasing. Most likely a longrunner that will not finish before the deadline. Slot 8: integration time: ( 1d 6h 6m 37s elapsed / 1747d 18h 7s left ) [14:53:32] More than 1747d left and increasing. Most likely a longrunner that will not finish before the deadline. |
Send message Joined: 13 Jul 05 Posts: 167 Credit: 14,938,551 RAC: 191 |
From task TheoryN_2279-750936-185_2: ===> [runRivet] Mon Nov 25 17:09:09 UTC 2019 [boinc ee zhad 200 - - sherpa 2.2.5 default 2000 185... 13.9764 pb +- ( 0.16319 pb = 1.16761 % ) 740000 ( 740000 -> 100 % )... it then gently creeps up (mostly), until... 3.3477e+14 pb +- ( 9.67854e+13 pb = 28.911 % ) 367040000 ( 367040060 -> 99.9 % ) If the time remaining is increasing then that's always going to be a bad sign, but I've had long-runners where the predicted time left does gradually reduce at a realistic rate for extended periods of time. I'm wondering if there's also some diagnostic value in the sudden jumps (by factors of, rather than fractions of) the predicted time remaining? |
©2024 CERN