Message boards : Theory Application : Tasks run 4 days and finish with error
Message board moderation

To post messages, you must log in.

Previous · 1 · 2 · 3 · 4 · 5 · 6 · 7 · 8 · Next

AuthorMessage
Crystal Pellet
Volunteer moderator
Volunteer tester

Send message
Joined: 14 Jan 10
Posts: 1143
Credit: 7,004,374
RAC: 1,106
Message 42431 - Posted: 12 May 2020, 7:07:37 UTC - in response to Message 42421.  

I've just set up a new machine (Windows 10), and its theories all have a 10 day limit instead of 4 (I have adjusted nothing in the config for it). Have they changed something again?
Together with a new vdi-file the previous used Theory_2019_11_13a.xml was replaced by Theory_2019_10_01.xml with a 864000 job duration in it.
You may change that to your needs or remove that line at all as I mentioned in this post.
ID: 42431 · Report as offensive     Reply Quote
Peter Hucker

Send message
Joined: 12 Aug 06
Posts: 294
Credit: 2,100,733
RAC: 844
Message 42439 - Posted: 12 May 2020, 20:18:32 UTC - in response to Message 42431.  

I've just set up a new machine (Windows 10), and its theories all have a 10 day limit instead of 4 (I have adjusted nothing in the config for it). Have they changed something again?
Together with a new vdi-file the previous used Theory_2019_11_13a.xml was replaced by Theory_2019_10_01.xml with a 864000 job duration in it.
You may change that to your needs or remove that line at all as I mentioned in this post.


I'll just let them run to whatever you scientists think is best.

Got my two 24 core xeon machines running :-)

Oops, RAM shortage. Atlas can't fit. Memory in the post....
ID: 42439 · Report as offensive     Reply Quote
Henry Nebrensky

Send message
Joined: 13 Jul 05
Posts: 162
Credit: 14,768,010
RAC: 2
Message 42534 - Posted: 18 May 2020, 13:21:12 UTC - in response to Message 42384.  
Last modified: 18 May 2020, 13:22:01 UTC

273368209 failed the same way - "Starting the calculation" after the Comix banner and then hog a CPU with no further output. This time I gave it nearly 5 days to sort itself out, but no such luck.

Meanwhile, 272500168 has been sitting gobbling a CPU since it announced that
Initialized the Shower_Handler.
ME_Generator_Base::SetPSMasses(): Massive PS flavours for Internal: (c,cb,b,bb,e-,e+,mu-,mu+,tau-,tau+)
ME_Generator_Base::SetPSMasses(): Massive PS flavours for Comix: (c,cb,b,bb,e-,e+,mu-,mu+,tau-,tau+)
+----------------------------------+
|                                  |
|      CCC  OOO  M   M I X   X     |
|     C    O   O MM MM I  X X      |
|     C    O   O M M M I   X       |
|     C    O   O M   M I  X X      |
|      CCC  OOO  M   M I X   X     |
|                                  |
+==================================+
|  Color dressed  Matrix Elements  |
|     http://comix.freacafe.de     |
|   please cite  JHEP12(2008)039   |
+----------------------------------+
Matrix_Element_Handler::BuildProcesses(): Looking for processes .................................................................................................................................................................................... done ( 36 MB, 23s / 21s ).
Matrix_Element_Handler::InitializeProcesses(): Performing tests .................................................................................................................................................................................... done ( 36 MB, 0s / 0s ).
Initialized the Matrix_Element_Handler for the hard processes.
Initialized the Beam_Remnant_Handler.
Hadron_Decay_Map::Read:   Initializing HadronDecays.dat. This may take some time.
Initialized the Hadron_Decay_Handler, Decay model = Hadrons
Initialized the Soft_Photon_Handler.
Process_Group::CalculateTotalXSec(): Calculate xs for '2_2__j__j__e-__veb' (Comix)
Starting the calculation at 21:19:15. Lean back and enjoy ... .
(yes, that's 21:19 yesterday since it bothered with a progress report) - so I've leant back and enjoyed killing it.

ID: 42534 · Report as offensive     Reply Quote
NOGOOD

Send message
Joined: 18 Nov 17
Posts: 118
Credit: 43,255,784
RAC: 19,941
Message 42558 - Posted: 21 May 2020, 19:15:31 UTC - in response to Message 42423.  

I hope I’ll get the successful one beyond the limit of 10 days soon :-)

I've got it !!!
After runtime about 15-20 days. Seems like I was not babysitting it enough, I don't remember exactly.
But I can't find out was it successful or not. I can't find it in list of my results on web-site. I think because it was sent too long ago.
Can anyone find out, was it successful?
ID: 42558 · Report as offensive     Reply Quote
NOGOOD

Send message
Joined: 18 Nov 17
Posts: 118
Credit: 43,255,784
RAC: 19,941
Message 42559 - Posted: 21 May 2020, 19:25:09 UTC - in response to Message 42558.  

I hope I’ll get the successful one beyond the limit of 10 days soon :-)

I've got it !!!
After runtime about 15-20 days. Seems like I was not babysitting it enough, I don't remember exactly.
But I can't find out was it successful or not. I can't find it in list of my results on web-site. I think because it was sent too long ago.
Can anyone find out, was it successful?

And seems like I'll get one more soon. This one:
https://yadi.sk/i/cUMsy_242kw_kg
I already can't find it in list of my results on web-site. And again I think because it was sent too long ago.
ID: 42559 · Report as offensive     Reply Quote
NOGOOD

Send message
Joined: 18 Nov 17
Posts: 118
Credit: 43,255,784
RAC: 19,941
Message 42562 - Posted: 21 May 2020, 19:34:48 UTC

And It looks like now we PYTHIA dead longrunners instead of SHERPA.
I've already got and killed several like this:
https://yadi.sk/i/vLd3aVlzoWa9AA
ID: 42562 · Report as offensive     Reply Quote
Peter Hucker

Send message
Joined: 12 Aug 06
Posts: 294
Credit: 2,100,733
RAC: 844
Message 42563 - Posted: 21 May 2020, 20:06:18 UTC - in response to Message 42562.  
Last modified: 21 May 2020, 20:07:32 UTC

And It looks like now we PYTHIA dead longrunners instead of SHERPA.
I've already got and killed several like this:
https://yadi.sk/i/vLd3aVlzoWa9AA


I'm just letting mine stop when it wants to stop them. Almost all Theory tasks are finishing correctly, usually between 30 minutes and 12 hours. Very very few hit the limiter of 4 or 10 days. Some are 10 day limits, most are 4 day limits, so I guess the scientists have set a few of them differently.

Either the new 2390 program version helped, and/or it's because I told it not to suspend them (by setting "switch between apps" to a very large number (100000)). Somebody mentioned Virtualbox apps hate being suspended.
ID: 42563 · Report as offensive     Reply Quote
Profile MAGIC Quantum Mechanic
Avatar

Send message
Joined: 24 Oct 04
Posts: 1023
Credit: 47,982,405
RAC: 8,345
Message 42564 - Posted: 21 May 2020, 23:37:20 UTC

NG You might want to update your VB version since I think you are still running a 2019 version and Oracle tends to do lots of updates to fix the usual problems. (VirtualBox 5.2.34 (released October 15 2019)


Mine isn't on the newest list ( that I use here) VirtualBox 6.1.6 but it works
And I also have tested lots of them with VirtualBox 6.1.8 and no problems running Theory tasks

https://www.virtualbox.org/wiki/Download_Old_Builds
https://www.virtualbox.org/wiki/Downloads
I even have had good luck with the Sherpa and the many other event generators.
ID: 42564 · Report as offensive     Reply Quote
Peter Hucker

Send message
Joined: 12 Aug 06
Posts: 294
Credit: 2,100,733
RAC: 844
Message 42565 - Posted: 21 May 2020, 23:49:11 UTC - in response to Message 42564.  

NG You might want to update your VB version since I think you are still running a 2019 version and Oracle tends to do lots of updates to fix the usual problems. (VirtualBox 5.2.34 (released October 15 2019)


Mine isn't on the newest list ( that I use here) VirtualBox 6.1.6 but it works
And I also have tested lots of them with VirtualBox 6.1.8 and no problems running Theory tasks

https://www.virtualbox.org/wiki/Download_Old_Builds
https://www.virtualbox.org/wiki/Downloads
I even have had good luck with the Sherpa and the many other event generators.


I have always used the latest version and never had any problems.
ID: 42565 · Report as offensive     Reply Quote
NOGOOD

Send message
Joined: 18 Nov 17
Posts: 118
Credit: 43,255,784
RAC: 19,941
Message 42566 - Posted: 22 May 2020, 17:01:26 UTC - in response to Message 42559.  

I hope I’ll get the successful one beyond the limit of 10 days soon :-)

I've got it !!!
After runtime about 15-20 days. Seems like I was not babysitting it enough, I don't remember exactly.
But I can't find out was it successful or not. I can't find it in list of my results on web-site. I think because it was sent too long ago.
Can anyone find out, was it successful?

And seems like I'll get one more soon. This one:
https://yadi.sk/i/cUMsy_242kw_kg
I already can't find it in list of my results on web-site. And again I think because it was sent too long ago.

This task is still running, but I see it on error list:
https://yadi.sk/i/_XlHUn9TnxVaMg
https://yadi.sk/i/23Nd6Od0g0SuEg
Looks like we have no way to find out is runtime limit of 10 days enough or not while deadline is 10 days too :-(
ID: 42566 · Report as offensive     Reply Quote
Chris Jenks

Send message
Joined: 16 Jun 06
Posts: 10
Credit: 2,257,015
RAC: 1,727
Message 42567 - Posted: 22 May 2020, 18:09:06 UTC

I have five Theory processes running, all with an expected duration of ten days. This is a difficult commitment because I find if I reboot then the virtual machines don't recover and I lose the jobs.

But on top of that, with summer coming I thought I would look at restricting the computing times so my PC doesn't burn me out of the office while I am working. After a few cool hours I noticed that the deadlines seem to be set at exactly ten days, so that I can't suspend the computation without potentially finishing after the deadline.

I now have, for the five jobs:

Elapsed: 5d 22:33:00
Remaining: 4d 01:37:00
Deadline: 5/26/20 10:43:19 AM
It is currently 5/22/20 11:06:00 AM
ID: 42567 · Report as offensive     Reply Quote
Peter Hucker

Send message
Joined: 12 Aug 06
Posts: 294
Credit: 2,100,733
RAC: 844
Message 42568 - Posted: 22 May 2020, 18:27:47 UTC - in response to Message 42567.  

I have five Theory processes running, all with an expected duration of ten days. This is a difficult commitment because I find if I reboot then the virtual machines don't recover and I lose the jobs.


Sometimes they get lost, but mostly I find they continue. When I shutdown or reboot, I leave it a bit if it says "Virtual Box still has open connections", probably only 10 or 20 seconds, before I click "shut down anyway". They really ought to fix that bug.

But on top of that, with summer coming I thought I would look at restricting the computing times so my PC doesn't burn me out of the office while I am working.


That's no excuse, open a window!

After a few cool hours I noticed that the deadlines seem to be set at exactly ten days, so that I can't suspend the computation without potentially finishing after the deadline.


Those estimates are wildly out. It starts by assuming it will take whatever your average time is for Theory tasks, in my case about 1.5 hours. Once it goes much over that, it decides it could take up to 4 (or sometimes 10) days. This does have the benefit of putting Boinc into panic mode so that task will run continuously. But 95% of them finish within 12 hours.

I now have, for the five jobs:

Elapsed: 5d 22:33:00
Remaining: 4d 01:37:00
Deadline: 5/26/20 10:43:19 AM
It is currently 5/22/20 11:06:00 AM


Is that the total of all 5? What is each one at?
ID: 42568 · Report as offensive     Reply Quote
NOGOOD

Send message
Joined: 18 Nov 17
Posts: 118
Credit: 43,255,784
RAC: 19,941
Message 42569 - Posted: 22 May 2020, 18:30:01 UTC - in response to Message 42567.  
Last modified: 22 May 2020, 18:34:56 UTC

I have five Theory processes running, all with an expected duration of ten days. This is a difficult commitment because I find if I reboot then the virtual machines don't recover and I lose the jobs.

But on top of that, with summer coming I thought I would look at restricting the computing times so my PC doesn't burn me out of the office while I am working. After a few cool hours I noticed that the deadlines seem to be set at exactly ten days, so that I can't suspend the computation without potentially finishing after the deadline.

I now have, for the five jobs:

Elapsed: 5d 22:33:00
Remaining: 4d 01:37:00
Deadline: 5/26/20 10:43:19 AM
It is currently 5/22/20 11:06:00 AM

Yes, this is very important too.
Theory team increased max time duration up to 10 days, but did not increase deadline.
Now we have no right to pause :-))
Now there are 2 reasons to increase deadline.
ID: 42569 · Report as offensive     Reply Quote
computezrmle
Volunteer moderator
Volunteer developer
Volunteer tester
Avatar

Send message
Joined: 15 Jun 08
Posts: 2100
Credit: 163,251,398
RAC: 118,409
Message 42570 - Posted: 22 May 2020, 18:40:48 UTC - in response to Message 42566.  

Looking at the mcplots data shows that 99.9 % of all tasks finish whithin 3345 minutes (2.32 d).
http://mcplots-dev.cern.ch/production.php?view=revision&rev=2390
http://mcplots-dev.cern.ch/cache/stats/runtime-2390.txt

What would be the benefit to extend the limits beyond 10 d?
ID: 42570 · Report as offensive     Reply Quote
NOGOOD

Send message
Joined: 18 Nov 17
Posts: 118
Credit: 43,255,784
RAC: 19,941
Message 42571 - Posted: 22 May 2020, 18:47:56 UTC - in response to Message 42570.  

Looking at the mcplots data shows that 99.9 % of all tasks finish whithin 3345 minutes (2.32 d).
http://mcplots-dev.cern.ch/production.php?view=revision&rev=2390
http://mcplots-dev.cern.ch/cache/stats/runtime-2390.txt

What would be the benefit to extend the limits beyond 10 d?

My only reason: I think successful tasks running many-many days especially valuable for the Project.
If not, I give up.
ID: 42571 · Report as offensive     Reply Quote
NOGOOD

Send message
Joined: 18 Nov 17
Posts: 118
Credit: 43,255,784
RAC: 19,941
Message 42572 - Posted: 22 May 2020, 18:52:24 UTC - in response to Message 42571.  

Looking at the mcplots data shows that 99.9 % of all tasks finish whithin 3345 minutes (2.32 d).
http://mcplots-dev.cern.ch/production.php?view=revision&rev=2390
http://mcplots-dev.cern.ch/cache/stats/runtime-2390.txt

What would be the benefit to extend the limits beyond 10 d?

My only reason: I think successful tasks running many-many days especially valuable for the Project.
If not, I give up.

If not, the limit of 2.32 d is the best limit.
ID: 42572 · Report as offensive     Reply Quote
Peter Hucker

Send message
Joined: 12 Aug 06
Posts: 294
Credit: 2,100,733
RAC: 844
Message 42573 - Posted: 22 May 2020, 18:59:23 UTC - in response to Message 42570.  

Looking at the mcplots data shows that 99.9 % of all tasks finish whithin 3345 minutes (2.32 d).
http://mcplots-dev.cern.ch/production.php?view=revision&rev=2390
http://mcplots-dev.cern.ch/cache/stats/runtime-2390.txt

What would be the benefit to extend the limits beyond 10 d?


The deadline needs to match the time limit. Whenever I get a 10 day time limit, it's going to exceed the deadline slightly, as it didn't start right away. But my computers are pretty much on 24/7. What about people who turn them off at night? The time limit will always be past the deadline for long running tasks.
ID: 42573 · Report as offensive     Reply Quote
Peter Hucker

Send message
Joined: 12 Aug 06
Posts: 294
Credit: 2,100,733
RAC: 844
Message 42574 - Posted: 22 May 2020, 19:02:09 UTC - in response to Message 42572.  

Looking at the mcplots data shows that 99.9 % of all tasks finish whithin 3345 minutes (2.32 d).
http://mcplots-dev.cern.ch/production.php?view=revision&rev=2390
http://mcplots-dev.cern.ch/cache/stats/runtime-2390.txt

What would be the benefit to extend the limits beyond 10 d?

My only reason: I think successful tasks running many-many days especially valuable for the Project.
If not, I give up.

If not, the limit of 2.32 d is the best limit.


Agreed. Perhaps 99.9% could be done with a shorter limit, then when it's discovered that some need longer, they are reissued with a huge deadline and limit. They could even be put in a seperate tickbox on web preferences, "Theory long running" or something. Those of us with several computers running 24 hours a day like myself would be happy to leave them going for a month or so.
ID: 42574 · Report as offensive     Reply Quote
NOGOOD

Send message
Joined: 18 Nov 17
Posts: 118
Credit: 43,255,784
RAC: 19,941
Message 42576 - Posted: 22 May 2020, 19:12:39 UTC - in response to Message 42574.  

Looking at the mcplots data shows that 99.9 % of all tasks finish whithin 3345 minutes (2.32 d).
http://mcplots-dev.cern.ch/production.php?view=revision&rev=2390
http://mcplots-dev.cern.ch/cache/stats/runtime-2390.txt

What would be the benefit to extend the limits beyond 10 d?

My only reason: I think successful tasks running many-many days especially valuable for the Project.
If not, I give up.

If not, the limit of 2.32 d is the best limit.


Agreed. Perhaps 99.9% could be done with a shorter limit, then when it's discovered that some need longer, they are reissued with a huge deadline and limit. They could even be put in a seperate tickbox on web preferences, "Theory long running" or something. Those of us with several computers running 24 hours a day like myself would be happy to leave them going for a month or so.

Yes, tickbox on web preferences - superb idea.
ID: 42576 · Report as offensive     Reply Quote
computezrmle
Volunteer moderator
Volunteer developer
Volunteer tester
Avatar

Send message
Joined: 15 Jun 08
Posts: 2100
Credit: 163,251,398
RAC: 118,409
Message 42577 - Posted: 22 May 2020, 19:44:52 UTC

The following comment is a personal one, not a moderator's comment.

Peter Hucker wrote:
The deadline needs to match the time limit.

What should it be based on?
Computers running 1 h/d , 3.2 h/d, 7.6 h/d, ...
Computers running Mon-Fri, Mon+Tue+Fri, Sat+Sun ...

And at what buffer size?
0.32 days, 3.44 days, 7 days ...

And Theory only, or Theory + ATLAS, or LHC beside other projects (at what relative priority per project)?


Beside that individual runtimes per task can't be estimated beforehand.



NOGOOD wrote:
Yes, tickbox on web preferences - superb idea.

Perfect idea.
Feel yourself encouraged to change the BOINC code accordingly:
https://github.com/BOINC/boinc/tree/server_release/1/1.2

In addition:
What would you suggest to do with the downstream and upstream processes?
They may expect the results to be delivered within a given limit.


Really, all of that to get a bit more than 99.9%?

Might be a good idea to also check the mcplots failure rate:
http://mcplots-dev.cern.ch/production.php?view=user&system=3&userid=557509
http://mcplots-dev.cern.ch/production.php?view=user&system=3&userid=55945
All computers regularly running tasks are below 6 %.
That's quite reliable. Thank you guys.

Sorry, don't want to disappoint you.
Comments are always important to make the admins aware of server/project errors.
ID: 42577 · Report as offensive     Reply Quote
Previous · 1 · 2 · 3 · 4 · 5 · 6 · 7 · 8 · Next

Message boards : Theory Application : Tasks run 4 days and finish with error


©2022 CERN