Message boards : Theory Application : Truly long long task: Theory_2743-2822627-370_0
Message board moderation

To post messages, you must log in.

1 · 2 · Next

AuthorMessage
Lem Novantotto

Send message
Joined: 24 May 23
Posts: 35
Credit: 1,931,173
RAC: 28,703
Message 50563 - Posted: 21 Aug 2024, 22:54:13 UTC

I'm currently running a very long task: Theory_2743-2822627-370_0. It's been already running for 43 hours, and it has processed only 6000 events out of 54000. It keeps on going, but slowly.
Its runRivet.log is more than 20MB by now, full of things like:

        ZAlign::ZAlign(): p_a*p_b = 16695.4 vs. 73740, rel. diff. -0.773591
        ZAlign::ZAlign(): Q = 253.013 vs. 1772.51, rel. diff. -nan
        ZAlign::ZAlign(): p_a*p_b = 1.16649e+06 vs. 1.18111e+06, rel. diff. -0.0123759
        ZAlign::ZAlign(): p_a*p_b = 1.16649e+06 vs. 1.18111e+06, rel. diff. -0.0123759
5900 events processed
        ZAlign::ZAlign(): p_a*p_b = 140377 vs. 307629, rel. diff. -0.54368
        ZAlign::ZAlign(): Q = 558.014 vs. 521.716, rel. diff. 0.0695745
        ZAlign::ZAlign(): p_a*p_b = 140377 vs. 307629, rel. diff. -0.54368
        ZAlign::ZAlign(): Q = 558.014 vs. 521.716, rel. diff. 0.0695745
        ZAlign::ZAlign(): p_a*p_b = 171226 vs. 254459, rel. diff. -0.327095


At this pace, it won't end before the 10 days hard limit (I've read somewhere that there would be a 10 days limit for Theory tasks, AFAIR):
43 x 9 = 387 >> 240.
I wouldn't blame my CPU, which is decently fast, being a Ryzen 7500f@5200MHz.


So, I have got two questions:

1) a strange thing, to me, is that the task deadline is set to 29 Aug in my Boinc monitor (exactly 10 days since when I received it), and to 30 Aug on my online results page (exactly 11 days). Why this discrepancy? However, no big deal...

2) what will happen to the work done, after these ten/eleven days? It would be a pity to waste time and energy: the task is going, it's just long and slow. Will the processed events be sent as a partial result? If not, it would be better to discard the task now, wouldn't it?

Thanks for your time, and bye.
ID: 50563 · Report as offensive     Reply Quote
Profile Magic Quantum Mechanic
Avatar

Send message
Joined: 24 Oct 04
Posts: 1165
Credit: 53,728,528
RAC: 51,181
Message 50564 - Posted: 21 Aug 2024, 23:01:02 UTC - in response to Message 50563.  

I have had many many of the Valid ones run longer than that
Here is one


I save many of the copies just for things like this.
ID: 50564 · Report as offensive     Reply Quote
Lem Novantotto

Send message
Joined: 24 May 23
Posts: 35
Credit: 1,931,173
RAC: 28,703
Message 50565 - Posted: 21 Aug 2024, 23:08:31 UTC - in response to Message 50564.  

I have had many many of the Valid ones run longer than that
Here is one


I save many of the copies just for things like this.


Hi!

Thank You, but... 7 days are fine: less than 10 days. My task is going to last about 16 days, that's why I'm a bit worried.

Bye.
ID: 50565 · Report as offensive     Reply Quote
Profile Magic Quantum Mechanic
Avatar

Send message
Joined: 24 Oct 04
Posts: 1165
Credit: 53,728,528
RAC: 51,181
Message 50566 - Posted: 22 Aug 2024, 3:20:42 UTC - in response to Message 50565.  

Taking a guess or going by what the remaining time says is just a guess but then I am using Windows so I don't know what your OS does but I check mine by clicking on the *Show Graphics* tab on the left and there you can look at the running log and see how far it is on the running task.

That is how I can tell if it needs an abort or not and never have aborted one that is actually running in that log
ID: 50566 · Report as offensive     Reply Quote
Lem Novantotto

Send message
Joined: 24 May 23
Posts: 35
Credit: 1,931,173
RAC: 28,703
Message 50567 - Posted: 22 Aug 2024, 7:42:49 UTC

After a bit of searching, I've found this message by computezrmle: https://lhcathomedev.cern.ch/lhcathome-dev/forum_thread.php?id=646&postid=8314#8314

Long story short, if the estimated runtime, got by looking at runRivet.log and calculating (time elapsed) x (total events) / (events processed), goes far beyond the 10 days deadline, the task should be aborted... which is a bit of nonsense to me: in other projects it is surely possible to return a valid result after the deadline, and it is accepted: passing the deadline means only that the task will be reassigned too, and so that someone else could return a result before you (which is not our case, by the way, since I don't think there's a single core three times faster than mine, out there), and get the credits for it (who cares about credits? Can credits pay anything to me? No, so IDGAF...). In other projects, passing the deadline doesn't mean that the task will auto-abort or something like that.

So I renew my main question: what does happen to a theory task when it passes the deadline?

It commits suicide
or
it keeps going while reassigned, and then it'll submit a valid result at last (let's disregard the credits stuff, I'm not interested in it)?

Thanks, bye.
ID: 50567 · Report as offensive     Reply Quote
Harri Liljeroos
Avatar

Send message
Joined: 28 Sep 04
Posts: 719
Credit: 48,157,874
RAC: 32,226
Message 50568 - Posted: 22 Aug 2024, 8:13:42 UTC

As far as I know, there are two 10 day limits for a Theory task. One is the deadline for the task and going over this is not a problem. The 1 day longer deadline seen on server side is a one day grace period that the server allows task to be returned 1 day over deadline. The more problematic 10 day limit is the maximum run time project has set for these tasks. Boinc will abort the task at the 10 day mark because it has run too long.

So you have two options: If you are sure it won't finish in 10 days, abort it and get a new task or let it run and see what happens. I would abort it. I have had tasks that could not be finished in 10 days and I have aborted them. A bit annoying but not a big deal.
ID: 50568 · Report as offensive     Reply Quote
Lem Novantotto

Send message
Joined: 24 May 23
Posts: 35
Credit: 1,931,173
RAC: 28,703
Message 50569 - Posted: 22 Aug 2024, 10:34:39 UTC - in response to Message 50568.  

As far as I know, there are two 10 day limits for a Theory task. One is the deadline for the task and going over this is not a problem. The 1 day longer deadline seen on server side is a one day grace period that the server allows task to be returned 1 day over deadline. The more problematic 10 day limit is the maximum run time project has set for these tasks. Boinc will abort the task at the 10 day mark because it has run too long.


Thank You: now I understand the two deadlines matter.

Just to talk: I do not immediately understand the reason for a project like Theory to set such a strong mark. Yes, they're trying to get a valid result for every task (I mean: the 3 activities limit for every task), however the fact that we are running Monte Carlo simulations should imply that every task that has been generated... could even have not been generated at all - or not exactly in the way it's been, at least - so how can it be so crucial to be quick? Next jobs aren't based on past results.

Moreover, let's dive in this case: if I don't check, 10 core days are wasted. Then the task is reassigned: this task has no chanche to ever be completed on any cpu, so other 10 core days can be gone. And then it is reassigned once more: total 1 core month, blown for no reason. And it could be avoided so easily, I suppose! Are we concerned about energy saving, pollution control and yada yada? Or are they just empty words?

Bye.
ID: 50569 · Report as offensive     Reply Quote
Profile Magic Quantum Mechanic
Avatar

Send message
Joined: 24 Oct 04
Posts: 1165
Credit: 53,728,528
RAC: 51,181
Message 50570 - Posted: 25 Aug 2024, 9:24:06 UTC

Which one is it??


ID: 50570 · Report as offensive     Reply Quote
Lem Novantotto

Send message
Joined: 24 May 23
Posts: 35
Credit: 1,931,173
RAC: 28,703
Message 50571 - Posted: 25 Aug 2024, 10:35:32 UTC - in response to Message 50570.  

Which one is it??




These above are two other long tasks of mine.

The superlong is still running:

$ head -n 1 /var/lib/boinc/slots/5/cernvm/shared/runRivet.log
===> [runRivet] Tue Aug 20 03:33:01 UTC 2024 [boinc pp ttbar 13000 60 - sherpa 2.1.1 default 54000 370]


$ grep -n event /var/lib/boinc/slots/5/cernvm/shared/runRivet.log  |tail -n 10
811382:17700 events processed
816079:17800 events processed
820826:17900 events processed
825193:18000 events processed
829825:18100 events processed
834811:18200 events processed
838732:18300 events processed
843865:18400 events processed
848528:18500 events processed
853243:18600 events processed

(the first number is the line number in the runRivet.log file).

So we're at 18600 out of 54000. 5 days and 6 hours of computation by now.

$ grep Event /var/lib/boinc/slots/5/cernvm/shared/runRivet.log 
  # Event output file:
  EVENT_OUTPUT = HepMC_GenEvent[sherpa]
-----------    Event generation run with SHERPA started .......   -----------
  Event 1 ( 24s elapsed / 15d 11h 5m 35s left ) -> ETA: Wed Sep 04 15:46  
  Event 2 ( 46s elapsed / 14d 10h 38m 13s left ) -> ETA: Tue Sep 03 15:19  
  Event 3 ( 1m 25s elapsed / 17d 21h 4m 34s left ) -> ETA: Sat Sep 07 01:46  
  Event 4 ( 1m 32s elapsed / 14d 11h 2m 12s left ) -> ETA: Tue Sep 03 15:44  
  Event 5 ( 2m elapsed / 15d 1h 20m 47s left ) -> ETA: Wed Sep 04 06:03  
  Event 6 ( 2m 48s elapsed / 17d 13h 11s left ) -> ETA: Fri Sep 06 17:43  
  Event 7 ( 2m 51s elapsed / 15d 7h 57m 59s left ) -> ETA: Wed Sep 04 12:41  
  Event 8 ( 3m 32s elapsed / 16d 13h 44m 27s left ) -> ETA: Thu Sep 05 18:28  
  Event 9 ( 3m 47s elapsed / 15d 19h 54m 12s left ) -> ETA: Thu Sep 05 00:38  
  Event 10 ( 4m 34s elapsed / 17d 4h 15m 31s left ) -> ETA: Fri Sep 06 09:00  
  Event 20 ( 9m 57s elapsed / 18d 15h 35m 56s left ) -> ETA: Sat Sep 07 20:26  
  Event 30 ( 12m 38s elapsed / 15d 19h 51s left ) -> ETA: Wed Sep 04 23:54  
  Event 40 ( 17m 33s elapsed / 16d 10h 43m 56s left ) -> ETA: Thu Sep 05 15:42  
  Event 50 ( 21m 32s elapsed / 16d 3h 17m 42s left ) -> ETA: Thu Sep 05 08:20  
  Event 60 ( 28m 22s elapsed / 17d 17h 15m 34s left ) -> ETA: Fri Sep 06 22:24  
  Event 70 ( 31m 19s elapsed / 16d 18h 15m 36s left ) -> ETA: Thu Sep 05 23:27  
  Event 80 ( 35m 38s elapsed / 16d 16h 18m 19s left ) -> ETA: Thu Sep 05 21:34  
  Event 90 ( 41m 26s elapsed / 17d 5h 47m 15s left ) -> ETA: Fri Sep 06 11:09  
  Event 100 ( 48m 49s elapsed / 18d 6h 37m 12s left ) -> ETA: Sat Sep 07 12:06  
  Event 200 ( 1h 26m 46s elapsed / 16d 5h 30s left ) -> ETA: Thu Sep 05 11:08  
  Event 300 ( 2h 11m 29s elapsed / 16d 8h 16m 12s left ) -> ETA: Thu Sep 05 15:08  
  Event 400 ( 2h 59m 56s elapsed / 16d 17h 51m 10s left ) -> ETA: Fri Sep 06 01:31  
  Event 500 ( 3h 41m 54s elapsed / 16d 11h 43m 49s left ) -> ETA: Thu Sep 05 20:06  
  Event 600 ( 4h 21m 16s elapsed / 16d 3h 33m 32s left ) -> ETA: Thu Sep 05 12:35  
  Event 700 ( 4h 57m 42s elapsed / 15d 17h 48m 9s left ) -> ETA: Thu Sep 05 03:26  
  Event 800 ( 5h 49m 44s elapsed / 16d 3h 38m 9s left ) -> ETA: Thu Sep 05 14:08  
  Event 900 ( 6h 30m 54s elapsed / 16d 24m 2s left ) -> ETA: Thu Sep 05 11:35  
  Event 1000 ( 7h 19m 29s elapsed / 16d 4h 13m 10s left ) -> ETA: Thu Sep 05 16:13  
  Event 2000 ( 14h 15m 11s elapsed / 15d 10h 35m 8s left ) -> ETA: Thu Sep 05 05:31  
  Event 3000 ( 20h 41m 50s elapsed / 14d 15h 51m 17s left ) -> ETA: Wed Sep 04 17:13  
  Event 4000 ( 1d 3h 51m 31s elapsed / 14d 12h 14m left ) -> ETA: Wed Sep 04 20:46  
  Event 5000 ( 1d 10h 41m 51s elapsed / 14d 4h 2m 13s left ) -> ETA: Wed Sep 04 19:24  
  Event 6000 ( 1d 17h 18m 55s elapsed / 13d 18h 31m 25s left ) -> ETA: Wed Sep 04 16:31  
  Event 7000 ( 1d 23h 57m 34s elapsed / 13d 10h 53s left ) -> ETA: Wed Sep 04 14:39  
  Event 8000 ( 2d 6h 40m 1s elapsed / 13d 2h 20m 8s left ) -> ETA: Wed Sep 04 13:40  
  Event 9000 ( 2d 13h 48m 26s elapsed / 12d 21h 2m 10s left ) -> ETA: Wed Sep 04 15:31  
  Event 10000 ( 2d 20h 3m 17s elapsed / 12d 11h 26m 28s left ) -> ETA: Wed Sep 04 12:10  
Matrix_Element_Handler::GenerateOneEvent(): Point for '2_7__db__d__t[W+[nu_e__e+]__b]__tb[W-[s__cb]__bb]__G' exceeds maximum by 12.1457.

I think I'll get the next estimation time at Event n. 20000.

As You can see, the task itself knows about its abnormal duration. So why not take it into account server-side?

Bye.
ID: 50571 · Report as offensive     Reply Quote
Lem Novantotto

Send message
Joined: 24 May 23
Posts: 35
Credit: 1,931,173
RAC: 28,703
Message 50572 - Posted: 25 Aug 2024, 19:15:06 UTC - in response to Message 50571.  

I think I'll get the next estimation time at Event n. 20000.

ZAlign::ZAlign(): Q = 625.776 vs. 669.534, rel. diff. -0.0653556
  Event 20000 ( 5d 14h 29m 1s elapsed / 9d 12h 37m 19s left ) -> ETA: Wed Sep 04 07:47  
  XS = 200.523 pb +- ( 1.41822 pb = 0.7 % )  
        ZAlign::ZAlign(): p_a*p_b = 20832.8 vs. 131144, rel. diff. -0.841146

Here it is. Signs of a very little speeding up, but not too much. Still not enough.

Bye.
ID: 50572 · Report as offensive     Reply Quote
Profile Magic Quantum Mechanic
Avatar

Send message
Joined: 24 Oct 04
Posts: 1165
Credit: 53,728,528
RAC: 51,181
Message 50573 - Posted: 25 Aug 2024, 19:54:17 UTC

Well it is a Sherpa and they are known to be a problem once in a while but the last few I have run did finish Valid
Since it is just one of your threads running that you can just let it run and see what happens or abort it if you don't want to continue running and watching it.......I would just let it run just to find out and let it get sent back to the server finished.

So just do what you want to do.
ID: 50573 · Report as offensive     Reply Quote
Lem Novantotto

Send message
Joined: 24 May 23
Posts: 35
Credit: 1,931,173
RAC: 28,703
Message 50575 - Posted: 25 Aug 2024, 20:33:42 UTC - in response to Message 50573.  

So just do what you want to do.


Well... of course. :-)

I'll let her run. She's my baby, my baby task. I saw her grow! :-D
Fingers crossed.

Bye.
ID: 50575 · Report as offensive     Reply Quote
Lem Novantotto

Send message
Joined: 24 May 23
Posts: 35
Credit: 1,931,173
RAC: 28,703
Message 50591 - Posted: 30 Aug 2024, 13:27:42 UTC - in response to Message 50575.  

News: server deadline passed, still running (39600 events processed out of 54000 total), a new workunit created and ready to be sent.

https://lhcathome.cern.ch/lhcathome/workunit.php?wuid=224931340

Let's see, hoping for the best.

Proud of my baby task!

Bye.
ID: 50591 · Report as offensive     Reply Quote
computezrmle
Volunteer moderator
Volunteer developer
Volunteer tester
Help desk expert
Avatar

Send message
Joined: 15 Jun 08
Posts: 2519
Credit: 251,021,116
RAC: 122,516
Message 50592 - Posted: 30 Aug 2024, 16:28:09 UTC - in response to Message 50591.  

It makes no sense to let it run since the server already marked it "Timed out - no response".
This means BOINC as well as the backend systems (mcplots in this case) handle it as lost.
ID: 50592 · Report as offensive     Reply Quote
Crystal Pellet
Volunteer moderator
Volunteer tester

Send message
Joined: 14 Jan 10
Posts: 1409
Credit: 9,325,730
RAC: 9,392
Message 50593 - Posted: 30 Aug 2024, 17:25:49 UTC - in response to Message 50591.  

News: server deadline passed, still running (39600 events processed out of 54000 total), a new workunit created and ready to be sent.

https://lhcathome.cern.ch/lhcathome/workunit.php?wuid=224931340

Let's see, hoping for the best.
Since you are running the native_theory version, it seems no local job duration limit is set like it is for the VBox Theory version.
When running on VBox the Virtual Machine gets a shutdown signal after 10 days runtime and the task ends in computation error, cause no result file is created.

Good luck!
ID: 50593 · Report as offensive     Reply Quote
Lem Novantotto

Send message
Joined: 24 May 23
Posts: 35
Credit: 1,931,173
RAC: 28,703
Message 50594 - Posted: 30 Aug 2024, 19:29:13 UTC - in response to Message 50592.  

It makes no sense to let it run since the server already marked it "Timed out - no response".
This means BOINC as well as the backend systems (mcplots in this case) handle it as lost.


As for Boinc, it doesn't seem so.
https://boinc.berkeley.edu/forum_thread.php?id=13349&postid=94590#94590
And *surely* it wasn't so, years ago.


As for the LHC project, we'll see.
My main concern, now, is about what will happen if the task is downloaded and aborted twice more, before I can complete mine.

Bye.
ID: 50594 · Report as offensive     Reply Quote
Lem Novantotto

Send message
Joined: 24 May 23
Posts: 35
Credit: 1,931,173
RAC: 28,703
Message 50595 - Posted: 30 Aug 2024, 19:30:04 UTC - in response to Message 50593.  

Since you are running the native_theory version, it seems no local job duration limit is set like it is for the VBox Theory version.
[...]
Good luck!


Thanks, bye!
ID: 50595 · Report as offensive     Reply Quote
Lem Novantotto

Send message
Joined: 24 May 23
Posts: 35
Credit: 1,931,173
RAC: 28,703
Message 50610 - Posted: 4 Sep 2024, 7:02:46 UTC

https://lhcathome.cern.ch/lhcathome/result.php?result_name=Theory_2743-2822627-370_0

Of course: native task, and I reported my results beyond the deadline but before the three error limit.

However... well done, babytask. You're now a star in the tasks' heaven. Farewell! :-D

Bye
ID: 50610 · Report as offensive     Reply Quote
Lem Novantotto

Send message
Joined: 24 May 23
Posts: 35
Credit: 1,931,173
RAC: 28,703
Message 50611 - Posted: 4 Sep 2024, 13:21:51 UTC

Now I've a bit of time. Tu sum it up:

at least with Theory native tasks, in spite of some contrary advices, there's no point in aborting tasks that will go past their deadline.
Their results, if valid, will be accepted. Neither Boinc nor this project will complain. Going past the deadline means only that the task will be reassigned. This is pretty standard to me.
Mine was reassigned too, and it's being still crunched by someone else. And, if you are interested in credits, I'm pretty sure that even this latter cruncher will get his credits if he/she reports a valid result.

What would have happened if the task had been reassigned, and the servers had got a valid result by someone else before I could report mine? I think I wouldn't have got any credits (again, if you care about credits), and mostly that my CPU time would have been wasted. It's up to you to know if your CPU is quick enough to make this risk negligible. A very short queue of tasks, all begun with still far deadlines, can surely help.

What would have happened if the task had been reassigned two times, and the servers had got two errors (of any kind: elaboration error, aborted by the user...) from these two other users, before I could report my result? This way the task would have reached the three errors limit (my "no response" temporary error, plus two other errors). And I'm not so sure of the outcome.

Bye.
ID: 50611 · Report as offensive     Reply Quote
Erich56

Send message
Joined: 18 Dec 15
Posts: 1783
Credit: 116,807,151
RAC: 71,728
Message 50681 - Posted: 2 Oct 2024, 8:13:49 UTC - in response to Message 50568.  

Harri Liljeroos wrote:
As far as I know, there are two 10 day limits for a Theory task. One is the deadline for the task and going over this is not a problem. The 1 day longer deadline seen on server side is a one day grace period that the server allows task to be returned 1 day over deadline. The more problematic 10 day limit is the maximum run time project has set for these tasks. Boinc will abort the task at the 10 day mark because it has run too long.

So you have two options: If you are sure it won't finish in 10 days, abort it and get a new task or let it run and see what happens. I would abort it. I have had tasks that could not be finished in 10 days and I have aborted them. A bit annoying but not a big deal.
unfortunately, I did not read Harri's recent comment earlier, although I kind of was remembering the 10 days's limit from a posting some time ago.
So within the past few days it happened here that 2 Theory tasks were aborted after exactly 10 days, although far away from getting finished within this timespan:

https://lhcathome.cern.ch/lhcathome/result.php?resultid=414360907
https://lhcathome.cern.ch/lhcathome/result.php?resultid=414250277

too bad, but no much one can do; except watching every Theory task and trying to predict whether it will or will not finish within 10 days.
ID: 50681 · Report as offensive     Reply Quote
1 · 2 · Next

Message boards : Theory Application : Truly long long task: Theory_2743-2822627-370_0


©2024 CERN