Message boards :
Theory Application :
Truly long long task: Theory_2743-2822627-370_0
Message board moderation
Author | Message |
---|---|
Send message Joined: 24 May 23 Posts: 43 Credit: 2,624,143 RAC: 3,726 |
I'm currently running a very long task: Theory_2743-2822627-370_0. It's been already running for 43 hours, and it has processed only 6000 events out of 54000. It keeps on going, but slowly. Its runRivet.log is more than 20MB by now, full of things like: ZAlign::ZAlign(): p_a*p_b = 16695.4 vs. 73740, rel. diff. -0.773591 ZAlign::ZAlign(): Q = 253.013 vs. 1772.51, rel. diff. -nan ZAlign::ZAlign(): p_a*p_b = 1.16649e+06 vs. 1.18111e+06, rel. diff. -0.0123759 ZAlign::ZAlign(): p_a*p_b = 1.16649e+06 vs. 1.18111e+06, rel. diff. -0.0123759 5900 events processed ZAlign::ZAlign(): p_a*p_b = 140377 vs. 307629, rel. diff. -0.54368 ZAlign::ZAlign(): Q = 558.014 vs. 521.716, rel. diff. 0.0695745 ZAlign::ZAlign(): p_a*p_b = 140377 vs. 307629, rel. diff. -0.54368 ZAlign::ZAlign(): Q = 558.014 vs. 521.716, rel. diff. 0.0695745 ZAlign::ZAlign(): p_a*p_b = 171226 vs. 254459, rel. diff. -0.327095 At this pace, it won't end before the 10 days hard limit (I've read somewhere that there would be a 10 days limit for Theory tasks, AFAIR): 43 x 9 = 387 >> 240. I wouldn't blame my CPU, which is decently fast, being a Ryzen 7500f@5200MHz. So, I have got two questions: 1) a strange thing, to me, is that the task deadline is set to 29 Aug in my Boinc monitor (exactly 10 days since when I received it), and to 30 Aug on my online results page (exactly 11 days). Why this discrepancy? However, no big deal... 2) what will happen to the work done, after these ten/eleven days? It would be a pity to waste time and energy: the task is going, it's just long and slow. Will the processed events be sent as a partial result? If not, it would be better to discard the task now, wouldn't it? Thanks for your time, and bye. |
Send message Joined: 24 Oct 04 Posts: 1180 Credit: 54,887,670 RAC: 2,609 |
I have had many many of the Valid ones run longer than that Here is one I save many of the copies just for things like this. |
Send message Joined: 24 May 23 Posts: 43 Credit: 2,624,143 RAC: 3,726 |
I have had many many of the Valid ones run longer than that Hi! Thank You, but... 7 days are fine: less than 10 days. My task is going to last about 16 days, that's why I'm a bit worried. Bye. |
Send message Joined: 24 Oct 04 Posts: 1180 Credit: 54,887,670 RAC: 2,609 |
Taking a guess or going by what the remaining time says is just a guess but then I am using Windows so I don't know what your OS does but I check mine by clicking on the *Show Graphics* tab on the left and there you can look at the running log and see how far it is on the running task. That is how I can tell if it needs an abort or not and never have aborted one that is actually running in that log |
Send message Joined: 24 May 23 Posts: 43 Credit: 2,624,143 RAC: 3,726 |
After a bit of searching, I've found this message by computezrmle: https://lhcathomedev.cern.ch/lhcathome-dev/forum_thread.php?id=646&postid=8314#8314 Long story short, if the estimated runtime, got by looking at runRivet.log and calculating (time elapsed) x (total events) / (events processed), goes far beyond the 10 days deadline, the task should be aborted... which is a bit of nonsense to me: in other projects it is surely possible to return a valid result after the deadline, and it is accepted: passing the deadline means only that the task will be reassigned too, and so that someone else could return a result before you (which is not our case, by the way, since I don't think there's a single core three times faster than mine, out there), and get the credits for it (who cares about credits? Can credits pay anything to me? No, so IDGAF...). In other projects, passing the deadline doesn't mean that the task will auto-abort or something like that. So I renew my main question: what does happen to a theory task when it passes the deadline? It commits suicide or it keeps going while reassigned, and then it'll submit a valid result at last (let's disregard the credits stuff, I'm not interested in it)? Thanks, bye. |
Send message Joined: 28 Sep 04 Posts: 733 Credit: 49,396,952 RAC: 12,518 |
As far as I know, there are two 10 day limits for a Theory task. One is the deadline for the task and going over this is not a problem. The 1 day longer deadline seen on server side is a one day grace period that the server allows task to be returned 1 day over deadline. The more problematic 10 day limit is the maximum run time project has set for these tasks. Boinc will abort the task at the 10 day mark because it has run too long. So you have two options: If you are sure it won't finish in 10 days, abort it and get a new task or let it run and see what happens. I would abort it. I have had tasks that could not be finished in 10 days and I have aborted them. A bit annoying but not a big deal. |
Send message Joined: 24 May 23 Posts: 43 Credit: 2,624,143 RAC: 3,726 |
As far as I know, there are two 10 day limits for a Theory task. One is the deadline for the task and going over this is not a problem. The 1 day longer deadline seen on server side is a one day grace period that the server allows task to be returned 1 day over deadline. The more problematic 10 day limit is the maximum run time project has set for these tasks. Boinc will abort the task at the 10 day mark because it has run too long. Thank You: now I understand the two deadlines matter. Just to talk: I do not immediately understand the reason for a project like Theory to set such a strong mark. Yes, they're trying to get a valid result for every task (I mean: the 3 activities limit for every task), however the fact that we are running Monte Carlo simulations should imply that every task that has been generated... could even have not been generated at all - or not exactly in the way it's been, at least - so how can it be so crucial to be quick? Next jobs aren't based on past results. Moreover, let's dive in this case: if I don't check, 10 core days are wasted. Then the task is reassigned: this task has no chanche to ever be completed on any cpu, so other 10 core days can be gone. And then it is reassigned once more: total 1 core month, blown for no reason. And it could be avoided so easily, I suppose! Are we concerned about energy saving, pollution control and yada yada? Or are they just empty words? Bye. |
Send message Joined: 24 Oct 04 Posts: 1180 Credit: 54,887,670 RAC: 2,609 |
Which one is it?? |
Send message Joined: 24 May 23 Posts: 43 Credit: 2,624,143 RAC: 3,726 |
Which one is it?? These above are two other long tasks of mine. The superlong is still running: $ head -n 1 /var/lib/boinc/slots/5/cernvm/shared/runRivet.log ===> [runRivet] Tue Aug 20 03:33:01 UTC 2024 [boinc pp ttbar 13000 60 - sherpa 2.1.1 default 54000 370] $ grep -n event /var/lib/boinc/slots/5/cernvm/shared/runRivet.log |tail -n 10 811382:17700 events processed 816079:17800 events processed 820826:17900 events processed 825193:18000 events processed 829825:18100 events processed 834811:18200 events processed 838732:18300 events processed 843865:18400 events processed 848528:18500 events processed 853243:18600 events processed (the first number is the line number in the runRivet.log file). So we're at 18600 out of 54000. 5 days and 6 hours of computation by now. $ grep Event /var/lib/boinc/slots/5/cernvm/shared/runRivet.log # Event output file: EVENT_OUTPUT = HepMC_GenEvent[sherpa] ----------- Event generation run with SHERPA started ....... ----------- Event 1 ( 24s elapsed / 15d 11h 5m 35s left ) -> ETA: Wed Sep 04 15:46 Event 2 ( 46s elapsed / 14d 10h 38m 13s left ) -> ETA: Tue Sep 03 15:19 Event 3 ( 1m 25s elapsed / 17d 21h 4m 34s left ) -> ETA: Sat Sep 07 01:46 Event 4 ( 1m 32s elapsed / 14d 11h 2m 12s left ) -> ETA: Tue Sep 03 15:44 Event 5 ( 2m elapsed / 15d 1h 20m 47s left ) -> ETA: Wed Sep 04 06:03 Event 6 ( 2m 48s elapsed / 17d 13h 11s left ) -> ETA: Fri Sep 06 17:43 Event 7 ( 2m 51s elapsed / 15d 7h 57m 59s left ) -> ETA: Wed Sep 04 12:41 Event 8 ( 3m 32s elapsed / 16d 13h 44m 27s left ) -> ETA: Thu Sep 05 18:28 Event 9 ( 3m 47s elapsed / 15d 19h 54m 12s left ) -> ETA: Thu Sep 05 00:38 Event 10 ( 4m 34s elapsed / 17d 4h 15m 31s left ) -> ETA: Fri Sep 06 09:00 Event 20 ( 9m 57s elapsed / 18d 15h 35m 56s left ) -> ETA: Sat Sep 07 20:26 Event 30 ( 12m 38s elapsed / 15d 19h 51s left ) -> ETA: Wed Sep 04 23:54 Event 40 ( 17m 33s elapsed / 16d 10h 43m 56s left ) -> ETA: Thu Sep 05 15:42 Event 50 ( 21m 32s elapsed / 16d 3h 17m 42s left ) -> ETA: Thu Sep 05 08:20 Event 60 ( 28m 22s elapsed / 17d 17h 15m 34s left ) -> ETA: Fri Sep 06 22:24 Event 70 ( 31m 19s elapsed / 16d 18h 15m 36s left ) -> ETA: Thu Sep 05 23:27 Event 80 ( 35m 38s elapsed / 16d 16h 18m 19s left ) -> ETA: Thu Sep 05 21:34 Event 90 ( 41m 26s elapsed / 17d 5h 47m 15s left ) -> ETA: Fri Sep 06 11:09 Event 100 ( 48m 49s elapsed / 18d 6h 37m 12s left ) -> ETA: Sat Sep 07 12:06 Event 200 ( 1h 26m 46s elapsed / 16d 5h 30s left ) -> ETA: Thu Sep 05 11:08 Event 300 ( 2h 11m 29s elapsed / 16d 8h 16m 12s left ) -> ETA: Thu Sep 05 15:08 Event 400 ( 2h 59m 56s elapsed / 16d 17h 51m 10s left ) -> ETA: Fri Sep 06 01:31 Event 500 ( 3h 41m 54s elapsed / 16d 11h 43m 49s left ) -> ETA: Thu Sep 05 20:06 Event 600 ( 4h 21m 16s elapsed / 16d 3h 33m 32s left ) -> ETA: Thu Sep 05 12:35 Event 700 ( 4h 57m 42s elapsed / 15d 17h 48m 9s left ) -> ETA: Thu Sep 05 03:26 Event 800 ( 5h 49m 44s elapsed / 16d 3h 38m 9s left ) -> ETA: Thu Sep 05 14:08 Event 900 ( 6h 30m 54s elapsed / 16d 24m 2s left ) -> ETA: Thu Sep 05 11:35 Event 1000 ( 7h 19m 29s elapsed / 16d 4h 13m 10s left ) -> ETA: Thu Sep 05 16:13 Event 2000 ( 14h 15m 11s elapsed / 15d 10h 35m 8s left ) -> ETA: Thu Sep 05 05:31 Event 3000 ( 20h 41m 50s elapsed / 14d 15h 51m 17s left ) -> ETA: Wed Sep 04 17:13 Event 4000 ( 1d 3h 51m 31s elapsed / 14d 12h 14m left ) -> ETA: Wed Sep 04 20:46 Event 5000 ( 1d 10h 41m 51s elapsed / 14d 4h 2m 13s left ) -> ETA: Wed Sep 04 19:24 Event 6000 ( 1d 17h 18m 55s elapsed / 13d 18h 31m 25s left ) -> ETA: Wed Sep 04 16:31 Event 7000 ( 1d 23h 57m 34s elapsed / 13d 10h 53s left ) -> ETA: Wed Sep 04 14:39 Event 8000 ( 2d 6h 40m 1s elapsed / 13d 2h 20m 8s left ) -> ETA: Wed Sep 04 13:40 Event 9000 ( 2d 13h 48m 26s elapsed / 12d 21h 2m 10s left ) -> ETA: Wed Sep 04 15:31 Event 10000 ( 2d 20h 3m 17s elapsed / 12d 11h 26m 28s left ) -> ETA: Wed Sep 04 12:10 Matrix_Element_Handler::GenerateOneEvent(): Point for '2_7__db__d__t[W+[nu_e__e+]__b]__tb[W-[s__cb]__bb]__G' exceeds maximum by 12.1457. I think I'll get the next estimation time at Event n. 20000. As You can see, the task itself knows about its abnormal duration. So why not take it into account server-side? Bye. |
Send message Joined: 24 May 23 Posts: 43 Credit: 2,624,143 RAC: 3,726 |
I think I'll get the next estimation time at Event n. 20000. ZAlign::ZAlign(): Q = 625.776 vs. 669.534, rel. diff. -0.0653556 Event 20000 ( 5d 14h 29m 1s elapsed / 9d 12h 37m 19s left ) -> ETA: Wed Sep 04 07:47 XS = 200.523 pb +- ( 1.41822 pb = 0.7 % ) ZAlign::ZAlign(): p_a*p_b = 20832.8 vs. 131144, rel. diff. -0.841146 Here it is. Signs of a very little speeding up, but not too much. Still not enough. Bye. |
Send message Joined: 24 Oct 04 Posts: 1180 Credit: 54,887,670 RAC: 2,609 |
Well it is a Sherpa and they are known to be a problem once in a while but the last few I have run did finish Valid Since it is just one of your threads running that you can just let it run and see what happens or abort it if you don't want to continue running and watching it.......I would just let it run just to find out and let it get sent back to the server finished. So just do what you want to do. |
Send message Joined: 24 May 23 Posts: 43 Credit: 2,624,143 RAC: 3,726 |
So just do what you want to do. Well... of course. :-) I'll let her run. She's my baby, my baby task. I saw her grow! :-D Fingers crossed. Bye. |
Send message Joined: 24 May 23 Posts: 43 Credit: 2,624,143 RAC: 3,726 |
News: server deadline passed, still running (39600 events processed out of 54000 total), a new workunit created and ready to be sent. https://lhcathome.cern.ch/lhcathome/workunit.php?wuid=224931340 Let's see, hoping for the best. Proud of my baby task! Bye. |
Send message Joined: 15 Jun 08 Posts: 2541 Credit: 254,608,838 RAC: 15,673 |
It makes no sense to let it run since the server already marked it "Timed out - no response". This means BOINC as well as the backend systems (mcplots in this case) handle it as lost. |
Send message Joined: 14 Jan 10 Posts: 1422 Credit: 9,484,585 RAC: 573 |
News: server deadline passed, still running (39600 events processed out of 54000 total), a new workunit created and ready to be sent.Since you are running the native_theory version, it seems no local job duration limit is set like it is for the VBox Theory version. When running on VBox the Virtual Machine gets a shutdown signal after 10 days runtime and the task ends in computation error, cause no result file is created. Good luck! |
Send message Joined: 24 May 23 Posts: 43 Credit: 2,624,143 RAC: 3,726 |
It makes no sense to let it run since the server already marked it "Timed out - no response". As for Boinc, it doesn't seem so. https://boinc.berkeley.edu/forum_thread.php?id=13349&postid=94590#94590 And *surely* it wasn't so, years ago. As for the LHC project, we'll see. My main concern, now, is about what will happen if the task is downloaded and aborted twice more, before I can complete mine. Bye. |
Send message Joined: 24 May 23 Posts: 43 Credit: 2,624,143 RAC: 3,726 |
Since you are running the native_theory version, it seems no local job duration limit is set like it is for the VBox Theory version. Thanks, bye! |
Send message Joined: 24 May 23 Posts: 43 Credit: 2,624,143 RAC: 3,726 |
https://lhcathome.cern.ch/lhcathome/result.php?result_name=Theory_2743-2822627-370_0 Of course: native task, and I reported my results beyond the deadline but before the three error limit. However... well done, babytask. You're now a star in the tasks' heaven. Farewell! :-D Bye |
Send message Joined: 24 May 23 Posts: 43 Credit: 2,624,143 RAC: 3,726 |
Now I've a bit of time. Tu sum it up: at least with Theory native tasks, in spite of some contrary advices, there's no point in aborting tasks that will go past their deadline. Their results, if valid, will be accepted. Neither Boinc nor this project will complain. Going past the deadline means only that the task will be reassigned. This is pretty standard to me. Mine was reassigned too, and it's being still crunched by someone else. And, if you are interested in credits, I'm pretty sure that even this latter cruncher will get his credits if he/she reports a valid result. What would have happened if the task had been reassigned, and the servers had got a valid result by someone else before I could report mine? I think I wouldn't have got any credits (again, if you care about credits), and mostly that my CPU time would have been wasted. It's up to you to know if your CPU is quick enough to make this risk negligible. A very short queue of tasks, all begun with still far deadlines, can surely help. What would have happened if the task had been reassigned two times, and the servers had got two errors (of any kind: elaboration error, aborted by the user...) from these two other users, before I could report my result? This way the task would have reached the three errors limit (my "no response" temporary error, plus two other errors). And I'm not so sure of the outcome. Bye. |
Send message Joined: 18 Dec 15 Posts: 1823 Credit: 119,024,452 RAC: 16,876 |
Harri Liljeroos wrote: As far as I know, there are two 10 day limits for a Theory task. One is the deadline for the task and going over this is not a problem. The 1 day longer deadline seen on server side is a one day grace period that the server allows task to be returned 1 day over deadline. The more problematic 10 day limit is the maximum run time project has set for these tasks. Boinc will abort the task at the 10 day mark because it has run too long.unfortunately, I did not read Harri's recent comment earlier, although I kind of was remembering the 10 days's limit from a posting some time ago. So within the past few days it happened here that 2 Theory tasks were aborted after exactly 10 days, although far away from getting finished within this timespan: https://lhcathome.cern.ch/lhcathome/result.php?resultid=414360907 https://lhcathome.cern.ch/lhcathome/result.php?resultid=414250277 too bad, but no much one can do; except watching every Theory task and trying to predict whether it will or will not finish within 10 days. |
©2025 CERN