Message boards : Theory Application : Estimated Remaining Time Well Past Scheduled Due Date
Message board moderation

To post messages, you must log in.

1 · 2 · Next

AuthorMessage
TribbleRED

Send message
Joined: 9 May 10
Posts: 2
Credit: 1,640,694
RAC: 469
Message 40414 - Posted: 13 Nov 2019, 5:13:34 UTC

I have a bunch of new theory apps running who's estimated completed date is well past the due date. Should I continue with these or cancel them?

[/img]
ID: 40414 · Report as offensive     Reply Quote
Erich56

Send message
Joined: 18 Dec 15
Posts: 1306
Credit: 23,558,346
RAC: 8,376
Message 40415 - Posted: 13 Nov 2019, 5:28:43 UTC - in response to Message 40414.  

whenever tasks from a new series get startet, it takes a while until the times shown in BOINC get into the right balance.
Simply don't care, let all the tasks run. They will finish a lot earlier than (improperly) indicated.
ID: 40415 · Report as offensive     Reply Quote
Crystal Pellet
Volunteer moderator
Volunteer tester

Send message
Joined: 14 Jan 10
Posts: 978
Credit: 6,378,693
RAC: 323
Message 40417 - Posted: 13 Nov 2019, 6:48:39 UTC - in response to Message 40414.  
Last modified: 13 Nov 2019, 8:55:57 UTC

I have a bunch of new theory apps running who's estimated completed date is well past the due date. Should I continue with these or cancel them?
Yes, keep these tasks running. The deadline and the job duration in Theory_2019_10_01.xml is both the same: 10 days (causing high priority run), but the job duration is always unknown.
In the previous version the VM did several jobs with a 30 days deadline and was killed after 18 hours, now the VM is running only one job, that however could run only a few minutes to several days.

Laurence, could you reduce the job duration to a more realistic value like 360000 (100 hours) and keep the deadline like it is?
ID: 40417 · Report as offensive     Reply Quote
Profile Laurence
Project administrator
Project developer

Send message
Joined: 20 Jun 14
Posts: 337
Credit: 237,918
RAC: 0
Message 40420 - Posted: 13 Nov 2019, 9:01:23 UTC - in response to Message 40417.  

Laurence, could you reduce the job duration to a more realistic value like 360000 (100 hours) and keep the deadline like it is?

I have released a new version 300.01 which should have this new setting.
ID: 40420 · Report as offensive     Reply Quote
Crystal Pellet
Volunteer moderator
Volunteer tester

Send message
Joined: 14 Jan 10
Posts: 978
Credit: 6,378,693
RAC: 323
Message 40426 - Posted: 13 Nov 2019, 11:02:32 UTC - in response to Message 40420.  

Laurence, could you reduce the job duration to a more realistic value like 360000 (100 hours) and keep the deadline like it is?

I have released a new version 300.01 which should have this new setting.

This is true for the 32-bits version in Theory32_2019_11_13.xml, but not for the 64-bits version in Theory_2019_11_13.xml.
The job duration there is still 864000 seconds.
ID: 40426 · Report as offensive     Reply Quote
Profile Laurence
Project administrator
Project developer

Send message
Joined: 20 Jun 14
Posts: 337
Credit: 237,918
RAC: 0
Message 40428 - Posted: 13 Nov 2019, 11:36:10 UTC - in response to Message 40426.  

This is true for the 32-bits version in Theory32_2019_11_13.xml, but not for the 64-bits version in Theory_2019_11_13.xml.
The job duration there is still 864000 seconds.


Sorry the wrong file was changed. Please try with v300.02
ID: 40428 · Report as offensive     Reply Quote
TribbleRED

Send message
Joined: 9 May 10
Posts: 2
Credit: 1,640,694
RAC: 469
Message 40436 - Posted: 13 Nov 2019, 18:27:04 UTC - in response to Message 40417.  

I see. Well then I won't worry about it. I can't remember the last time I had seen this and I have noticed that the "estimated time remaining" indicators have not always been super accurate. I can see more clearly now the process as it is how it has been described thus far.

Cheers,
Colin
ID: 40436 · Report as offensive     Reply Quote
Aurum
Avatar

Send message
Joined: 12 Jun 18
Posts: 92
Credit: 37,970,693
RAC: 55
Message 40559 - Posted: 20 Nov 2019, 16:50:09 UTC

I have a dozen or so nT 1.01 WUs that are running over 2 days. The CPU usage is jumping around in the 40-60%. Will these ever converge on a solution or should I Abort them???
ID: 40559 · Report as offensive     Reply Quote
Profile Laurence
Project administrator
Project developer

Send message
Joined: 20 Jun 14
Posts: 337
Credit: 237,918
RAC: 0
Message 40563 - Posted: 21 Nov 2019, 8:45:01 UTC - in response to Message 40559.  

I have a dozen or so nT 1.01 WUs that are running over 2 days. The CPU usage is jumping around in the 40-60%. Will these ever converge on a solution or should I Abort them???


You can take a look at the runRivet.log in the slot directory to see what the job is doing.
ID: 40563 · Report as offensive     Reply Quote
Aurum
Avatar

Send message
Joined: 12 Jun 18
Posts: 92
Credit: 37,970,693
RAC: 55
Message 40567 - Posted: 21 Nov 2019, 10:18:25 UTC - in response to Message 40563.  
Last modified: 21 Nov 2019, 10:25:41 UTC

I have a dozen or so nT 1.01 WUs that are running over 2 days. The CPU usage is jumping around in the 40-60%. Will these ever converge on a solution or should I Abort them???
You can take a look at the runRivet.log in the slot directory to see what the job is doing.
What am I looking for???
The first rig had 75 folders (most empty) in /slots/ and the first runRivet.log I found has 6,022 lines in it. I scrolled though them and see nothing that tells me anything.
ID: 40567 · Report as offensive     Reply Quote
computezrmle
Volunteer moderator
Volunteer developer
Volunteer tester
Avatar

Send message
Joined: 15 Jun 08
Posts: 1520
Credit: 85,578,127
RAC: 74,678
Message 40569 - Posted: 21 Nov 2019, 10:49:09 UTC - in response to Message 40567.  

1st line of a typical runRivet.log:
===> [runRivet] Thu Nov 21 10:27:33 UTC 2019 [boinc pp ue 900 - - pythia8 8.235 default-DL 100000 187]

The value 100000 tells you that the task will simulate 100000 events.

Last lines of the same logfile
45000 events processed
dumping histograms...
45100 events processed


Typical Sherpa longrunners look different.
Empty slots are nothing to worry about as long as your BOINC client doesn't complain about too many (>99) slots.
ID: 40569 · Report as offensive     Reply Quote
Profile Laurence
Project administrator
Project developer

Send message
Joined: 20 Jun 14
Posts: 337
Credit: 237,918
RAC: 0
Message 40571 - Posted: 21 Nov 2019, 10:57:31 UTC - in response to Message 40567.  
Last modified: 21 Nov 2019, 10:57:58 UTC

The first rig had 75 folders (most empty) in /slots/ and the first runRivet.log I found has 6,022 lines in it. I scrolled though them and see nothing that tells me anything.

Run tail -f on that file and check lines are being written, they are different and it looks like the program is moving forward. The first line of that file will say what job it is. Post the first line and the last 10 lines here.
ID: 40571 · Report as offensive     Reply Quote
Aurum
Avatar

Send message
Joined: 12 Jun 18
Posts: 92
Credit: 37,970,693
RAC: 55
Message 40572 - Posted: 21 Nov 2019, 11:03:46 UTC

First line:
===> [runRivet] Wed Nov 20 03:09:11 UTC 2019 [boinc pp jets 13000 180,-,3560 - sherpa 2.2.1 default 10000 148]

Last lines:
Display update finished (0 histograms, 0 events).
3.33645e-10 pb +- ( 1.15035e-11 pb = 3.44781 % ) 6740000 ( 99680219 -> 6.9 % )
integration time:  ( 1d 6h 55m 21s elapsed / 2d 10h 36m 32s left ) [10:57:28]   
Updating display...
Display update finished (0 histograms, 0 events).
Updating display...
Display update finished (0 histograms, 0 events).
I assume this means Abort.
ID: 40572 · Report as offensive     Reply Quote
Aurum
Avatar

Send message
Joined: 12 Jun 18
Posts: 92
Credit: 37,970,693
RAC: 55
Message 40573 - Posted: 21 Nov 2019, 11:06:56 UTC
Last modified: 21 Nov 2019, 11:40:51 UTC

aurum@Rig-04:/var/lib/boinc-client/slots/5/cernvm/shared$ tail -f runRivet.log3.3369e-10 pb +- ( 1.14707e-11 pb = 3.43753 % ) 6760000 ( 99966130 -> 6.9 % )
integration time:  ( 1d 7h 11s elapsed / 2d 10h 12m 30s left ) [11:02:19]   
Updating display...
Display update finished (0 histograms, 0 events).
Updating display...
Display update finished (0 histograms, 0 events).
Updating display...
Display update finished (0 histograms, 0 events).
Updating display...
Display update finished (0 histograms, 0 events).
22 nT WUs await their fate...
ID: 40573 · Report as offensive     Reply Quote
Profile Laurence
Project administrator
Project developer

Send message
Joined: 20 Jun 14
Posts: 337
Credit: 237,918
RAC: 0
Message 40574 - Posted: 21 Nov 2019, 12:32:02 UTC - in response to Message 40572.  

Sherpa jobs have a reputation for being long runners but it looks from the log that it might be finished in 2 days. I have one too at the moment which is a bit annoying as I am testing things so might have to abort. it. Will leave others to comment who have more experience with watching them.
ID: 40574 · Report as offensive     Reply Quote
maeax

Send message
Joined: 2 May 07
Posts: 1005
Credit: 34,848,150
RAC: 11,475
Message 40575 - Posted: 21 Nov 2019, 13:15:21 UTC

ID: 40575 · Report as offensive     Reply Quote
Crystal Pellet
Volunteer moderator
Volunteer tester

Send message
Joined: 14 Jan 10
Posts: 978
Credit: 6,378,693
RAC: 323
Message 40576 - Posted: 21 Nov 2019, 13:16:26 UTC - in response to Message 40572.  

First line:
===> [runRivet] Wed Nov 20 03:09:11 UTC 2019 [boinc pp jets 13000 180,-,3560 - sherpa 2.2.1 default 10000 148]
This 148th attempt could be successful.
From old figures with this job description: Out of 54 attemps 18 were successful, 6 failed and 30 lost.
ID: 40576 · Report as offensive     Reply Quote
lazlo_vii
Avatar

Send message
Joined: 20 Nov 19
Posts: 21
Credit: 1,074,330
RAC: 0
Message 40630 - Posted: 24 Nov 2019, 14:56:51 UTC

Right now I have 3 Theory Native 1.01 tasks that have been running more than a day. They are in slots 1,2, and 8.

Slot 1 properties from Boinc Manager:

Application
Theory Native 1.01 (native_theory)
Name
TheoryN_2279-770870-156
State
Running
Received
Thu 21 Nov 2019 12:25:33 PM CST
Report deadline
Sun 01 Dec 2019 12:25:31 PM CST
Estimated computation size
3,600 GFLOPs
CPU time
1d 06:52:19
CPU time since checkpoint
1d 06:52:19
Elapsed time
1d 05:28:37
Estimated time remaining
00:00:00
Fraction done
99.999%
Virtual memory size
528.60 MB
Working set size
50.60 MB
Directory
slots/1
Process ID
3199
Progress rate
3.240% per hour
Executable
wrapper_2019_03_02_x86_64-linux


Tail from runRivet.log:

$ sudo tail -f /var/lib/boinc-client/slots/1/cernvm/shared/runRivet.log
Updating display...
Display update finished (127 histograms, 72000 events).
Updating display...
Display update finished (127 histograms, 72000 events).
Updating display...
Display update finished (127 histograms, 72000 events).
Updating display...
Display update finished (127 histograms, 72000 events).
Updating display...
Display update finished (127 histograms, 72000 events).
Updating display...
Display update finished (127 histograms, 72000 events).


Slot 2 properties from Boinc Manager:

Application
Theory Native 1.01 (native_theory)
Name
TheoryN_2279-750936-155
State
Running
Received
Thu 21 Nov 2019 12:25:33 PM CST
Report deadline
Sun 01 Dec 2019 12:25:31 PM CST
Estimated computation size
3,600 GFLOPs
CPU time
1d 06:27:03
CPU time since checkpoint
1d 06:27:03
Elapsed time
1d 05:31:57
Estimated time remaining
00:00:00
Fraction done
100.000%
Virtual memory size
600.85 MB
Working set size
63.33 MB
Directory
slots/2
Process ID
3201
Progress rate
3.240% per hour
Executable
wrapper_2019_03_02_x86_64-linux


Tail from runRivet.log:

$ sudo tail -f /var/lib/boinc-client/slots/2/cernvm/shared/runRivet.log
3.8625e+14 pb +- ( 1.00728e+14 pb = 26.0785 % ) 772720000 ( 772720138 -> 99.9 % )
integration time:  ( 1d 6h 4m 1s elapsed / 851d 1h 41m 25s left ) [14:51:05]   
3.8624e+14 pb +- ( 1.00725e+14 pb = 26.0785 % ) 772740000 ( 772740138 -> 99.9 % )
integration time:  ( 1d 6h 4m 5s elapsed / 851d 2h 21m 49s left ) [14:51:09]   
3.8623e+14 pb +- ( 1.00723e+14 pb = 26.0785 % ) 772760000 ( 772760138 -> 99.9 % )
integration time:  ( 1d 6h 4m 7s elapsed / 851d 2h 50m 55s left ) [14:51:11]   
3.8622e+14 pb +- ( 1.0072e+14 pb = 26.0785 % ) 772780000 ( 772780138 -> 99.9 % )
integration time:  ( 1d 6h 4m 11s elapsed / 851d 3h 32m 34s left ) [14:51:15]   
3.8621e+14 pb +- ( 1.00718e+14 pb = 26.0785 % ) 772800000 ( 772800138 -> 99.9 % )
integration time:  ( 1d 6h 4m 15s elapsed / 851d 4h 14m left ) [14:51:19]   
3.862e+14 pb +- ( 1.00715e+14 pb = 26.0785 % ) 772820000 ( 772820138 -> 99.9 % )
integration time:  ( 1d 6h 4m 18s elapsed / 851d 4h 55m 53s left ) [14:51:22]


Slot 8 properties from Boinc Manager:

Application
Theory Native 1.01 (native_theory)
Name
TheoryN_2279-750240-149
State
Running
Received
Thu 21 Nov 2019 12:25:33 PM CST
Report deadline
Sun 01 Dec 2019 12:25:32 PM CST
Estimated computation size
3,600 GFLOPs
CPU time
1d 06:29:32
CPU time since checkpoint
1d 06:29:32
Elapsed time
1d 05:34:16
Estimated time remaining
00:00:00
Fraction done
100.000%
Virtual memory size
600.85 MB
Working set size
63.72 MB
Directory
slots/8
Process ID
3200
Progress rate
3.240% per hour
Executable
wrapper_2019_03_02_x86_64-linux


Tail from runRivet.log:

$ sudo tail -f /var/lib/boinc-client/slots/8/cernvm/shared/runRivet.log
4.63146e+15 pb +- ( 1.72895e+15 pb = 37.3305 % ) 773880000 ( 773880400 -> 99.9 % )
integration time:  ( 1d 6h 6m 16s elapsed / 1747d 9h 55m 55s left ) [14:53:11]   
4.63134e+15 pb +- ( 1.7289e+15 pb = 37.3305 % ) 773900000 ( 773900400 -> 99.9 % )
integration time:  ( 1d 6h 6m 20s elapsed / 1747d 11h 23m 54s left ) [14:53:15]   
4.63122e+15 pb +- ( 1.72886e+15 pb = 37.3305 % ) 773920000 ( 773920400 -> 99.9 % )
integration time:  ( 1d 6h 6m 24s elapsed / 1747d 12h 45m 50s left ) [14:53:18]   
4.63111e+15 pb +- ( 1.72881e+15 pb = 37.3305 % ) 773940000 ( 773940400 -> 99.9 % )
integration time:  ( 1d 6h 6m 27s elapsed / 1747d 14h 11m 2s left ) [14:53:22]   
4.63099e+15 pb +- ( 1.72877e+15 pb = 37.3305 % ) 773960000 ( 773960400 -> 99.9 % )
integration time:  ( 1d 6h 6m 31s elapsed / 1747d 15h 35m 17s left ) [14:53:26]   
Updating display...
Display update finished (0 histograms, 0 events).
4.63087e+15 pb +- ( 1.72872e+15 pb = 37.3305 % ) 773980000 ( 773980400 -> 99.9 % )
integration time:  ( 1d 6h 6m 34s elapsed / 1747d 17h 1s left ) [14:53:30]   
4.63075e+15 pb +- ( 1.72868e+15 pb = 37.3305 % ) 774000000 ( 774000400 -> 99.9 % )
integration time:  ( 1d 6h 6m 37s elapsed / 1747d 18h 7s left ) [14:53:32]   
4.63063e+15 pb +- ( 1.72864e+15 pb = 37.3305 % ) 774020000 ( 774020400 -> 99.9 % )
integration time:  ( 1d 6h 6m 39s elapsed / 1747d 18h 52m 6s left ) [14:53:35]


It seems that slot 1 has stalled but slots 2 and 8 want to run for a few more years. Is it OK to abort?
ID: 40630 · Report as offensive     Reply Quote
computezrmle
Volunteer moderator
Volunteer developer
Volunteer tester
Avatar

Send message
Joined: 15 Jun 08
Posts: 1520
Credit: 85,578,127
RAC: 74,678
Message 40631 - Posted: 24 Nov 2019, 15:29:54 UTC - in response to Message 40630.  

Slot 1:

Report deadline
Sun 01 Dec 2019 12:25:31 PM CST

Lots of time to finish the task and get rewarded.

CPU time
1d 06:52:19
Elapsed time
1d 05:28:37

Roughly equal.
This means the task is doing well.

Estimated time remaining
00:00:00
Fraction done
99.999%
Progress rate
3.240% per hour

Unreliable fake!
The BOINC client has no interface to look into the scientific app.
Check the original logfiles instead.

Your logfile tail:
Display update finished (127 histograms, 72000 events).

Line 1 is missing.
It would tell you how many events are planned to be calculated.
Most tasks plan 100000 events.
In this case 72000 are already finished within 1 day and there's lots of time until the deadline is reached.
Suggested decision: Let the task run.



Slot 2:

integration time:  ( 1d 6h 4m 15s elapsed / 851d 4h 14m left ) [14:51:19]

More than 851d left and increasing.
Most likely a longrunner that will not finish before the deadline.


Slot 8:

integration time:  ( 1d 6h 6m 37s elapsed / 1747d 18h 7s left ) [14:53:32]

More than 1747d left and increasing.
Most likely a longrunner that will not finish before the deadline.
ID: 40631 · Report as offensive     Reply Quote
Henry Nebrensky

Send message
Joined: 13 Jul 05
Posts: 134
Credit: 13,817,830
RAC: 4,306
Message 40679 - Posted: 26 Nov 2019, 13:25:54 UTC - in response to Message 40631.  

From task TheoryN_2279-750936-185_2:
===> [runRivet] Mon Nov 25 17:09:09 UTC 2019 [boinc ee zhad 200 - - sherpa 2.2.5 default 2000 185
...
13.9764 pb +- ( 0.16319 pb = 1.16761 % ) 740000 ( 740000 -> 100 % )
integration time: ( 2m 1s elapsed / 1m 34s left ) [17:21:10]
13.963 pb +- ( 0.159351 pb = 1.14123 % ) 760000 ( 760000 -> 100 % )
integration time: ( 2m 5s elapsed / 1m 25s left ) [17:21:13]
1.10104e+13 pb +- ( 1.10104e+13 pb = 100 % ) 780000 ( 780000 -> 100 % )
integration time: ( 2m 8s elapsed / 20d 10h 22m 35s left ) [17:21:17]
1.07352e+13 pb +- ( 1.07352e+13 pb = 100 % ) 800000 ( 800000 -> 100 % )
integration time: ( 2m 11s elapsed / 20d 18h 19m 21s left ) [17:21:20]
... it then gently creeps up (mostly), until...
3.3477e+14 pb +- ( 9.67854e+13 pb = 28.911 % ) 367040000 ( 367040060 -> 99.9 % )
integration time: ( 17h 10m 46s elapsed / 597d 22h 52m 23s left ) [11:59:53]
7.4538e+14 pb +- ( 4.21878e+14 pb = 56.5991 % ) 367060000 ( 367060060 -> 99.9 % )
integration time: ( 17h 10m 50s elapsed / 2293d 20h 47m 8s left ) [11:59:57]

If the time remaining is increasing then that's always going to be a bad sign, but I've had long-runners where the predicted time left does gradually reduce at a realistic rate for extended periods of time. I'm wondering if there's also some diagnostic value in the sudden jumps (by factors of, rather than fractions of) the predicted time remaining?
ID: 40679 · Report as offensive     Reply Quote
1 · 2 · Next

Message boards : Theory Application : Estimated Remaining Time Well Past Scheduled Due Date


©2020 CERN