Message boards :
Theory Application :
Tasks run 4 days and finish with error
Message board moderation
Previous · 1 · 2 · 3 · 4 · 5 · 6 · 7 · 8 · Next
Author | Message |
---|---|
Send message Joined: 14 Jan 10 Posts: 1275 Credit: 8,480,883 RAC: 1,913 |
I've just set up a new machine (Windows 10), and its theories all have a 10 day limit instead of 4 (I have adjusted nothing in the config for it). Have they changed something again?Together with a new vdi-file the previous used Theory_2019_11_13a.xml was replaced by Theory_2019_10_01.xml with a 864000 job duration in it. You may change that to your needs or remove that line at all as I mentioned in this post. |
Send message Joined: 12 Aug 06 Posts: 418 Credit: 5,667,249 RAC: 5 |
I've just set up a new machine (Windows 10), and its theories all have a 10 day limit instead of 4 (I have adjusted nothing in the config for it). Have they changed something again?Together with a new vdi-file the previous used Theory_2019_11_13a.xml was replaced by Theory_2019_10_01.xml with a 864000 job duration in it. I'll just let them run to whatever you scientists think is best. Got my two 24 core xeon machines running :-) Oops, RAM shortage. Atlas can't fit. Memory in the post.... |
Send message Joined: 13 Jul 05 Posts: 167 Credit: 14,938,551 RAC: 156 |
273368209 failed the same way - "Starting the calculation" after the Comix banner and then hog a CPU with no further output. This time I gave it nearly 5 days to sort itself out, but no such luck. Meanwhile, 272500168 has been sitting gobbling a CPU since it announced thatInitialized the Shower_Handler. ME_Generator_Base::SetPSMasses(): Massive PS flavours for Internal: (c,cb,b,bb,e-,e+,mu-,mu+,tau-,tau+) ME_Generator_Base::SetPSMasses(): Massive PS flavours for Comix: (c,cb,b,bb,e-,e+,mu-,mu+,tau-,tau+) +----------------------------------+ | | | CCC OOO M M I X X | | C O O MM MM I X X | | C O O M M M I X | | C O O M M I X X | | CCC OOO M M I X X | | | +==================================+ | Color dressed Matrix Elements | | http://comix.freacafe.de | | please cite JHEP12(2008)039 | +----------------------------------+ Matrix_Element_Handler::BuildProcesses(): Looking for processes .................................................................................................................................................................................... done ( 36 MB, 23s / 21s ). Matrix_Element_Handler::InitializeProcesses(): Performing tests .................................................................................................................................................................................... done ( 36 MB, 0s / 0s ). Initialized the Matrix_Element_Handler for the hard processes. Initialized the Beam_Remnant_Handler. Hadron_Decay_Map::Read: Initializing HadronDecays.dat. This may take some time. Initialized the Hadron_Decay_Handler, Decay model = Hadrons Initialized the Soft_Photon_Handler. Process_Group::CalculateTotalXSec(): Calculate xs for '2_2__j__j__e-__veb' (Comix) Starting the calculation at 21:19:15. Lean back and enjoy ... .(yes, that's 21:19 yesterday since it bothered with a progress report) - so I've leant back and enjoyed killing it. |
Send message Joined: 18 Nov 17 Posts: 119 Credit: 51,862,927 RAC: 23,112 |
I hope I’ll get the successful one beyond the limit of 10 days soon :-) I've got it !!! After runtime about 15-20 days. Seems like I was not babysitting it enough, I don't remember exactly. But I can't find out was it successful or not. I can't find it in list of my results on web-site. I think because it was sent too long ago. Can anyone find out, was it successful? |
Send message Joined: 18 Nov 17 Posts: 119 Credit: 51,862,927 RAC: 23,112 |
I hope I’ll get the successful one beyond the limit of 10 days soon :-) And seems like I'll get one more soon. This one: https://yadi.sk/i/cUMsy_242kw_kg I already can't find it in list of my results on web-site. And again I think because it was sent too long ago. |
Send message Joined: 18 Nov 17 Posts: 119 Credit: 51,862,927 RAC: 23,112 |
And It looks like now we PYTHIA dead longrunners instead of SHERPA. I've already got and killed several like this: https://yadi.sk/i/vLd3aVlzoWa9AA |
Send message Joined: 12 Aug 06 Posts: 418 Credit: 5,667,249 RAC: 5 |
And It looks like now we PYTHIA dead longrunners instead of SHERPA. I'm just letting mine stop when it wants to stop them. Almost all Theory tasks are finishing correctly, usually between 30 minutes and 12 hours. Very very few hit the limiter of 4 or 10 days. Some are 10 day limits, most are 4 day limits, so I guess the scientists have set a few of them differently. Either the new 2390 program version helped, and/or it's because I told it not to suspend them (by setting "switch between apps" to a very large number (100000)). Somebody mentioned Virtualbox apps hate being suspended. |
Send message Joined: 24 Oct 04 Posts: 1118 Credit: 49,729,010 RAC: 13,223 |
NG You might want to update your VB version since I think you are still running a 2019 version and Oracle tends to do lots of updates to fix the usual problems. (VirtualBox 5.2.34 (released October 15 2019) Mine isn't on the newest list ( that I use here) VirtualBox 6.1.6 but it works And I also have tested lots of them with VirtualBox 6.1.8 and no problems running Theory tasks https://www.virtualbox.org/wiki/Download_Old_Builds https://www.virtualbox.org/wiki/Downloads I even have had good luck with the Sherpa and the many other event generators. |
Send message Joined: 12 Aug 06 Posts: 418 Credit: 5,667,249 RAC: 5 |
NG You might want to update your VB version since I think you are still running a 2019 version and Oracle tends to do lots of updates to fix the usual problems. (VirtualBox 5.2.34 (released October 15 2019) I have always used the latest version and never had any problems. |
Send message Joined: 18 Nov 17 Posts: 119 Credit: 51,862,927 RAC: 23,112 |
I hope I’ll get the successful one beyond the limit of 10 days soon :-) This task is still running, but I see it on error list: https://yadi.sk/i/_XlHUn9TnxVaMg https://yadi.sk/i/23Nd6Od0g0SuEg Looks like we have no way to find out is runtime limit of 10 days enough or not while deadline is 10 days too :-( |
Send message Joined: 16 Jun 06 Posts: 10 Credit: 3,245,056 RAC: 0 |
I have five Theory processes running, all with an expected duration of ten days. This is a difficult commitment because I find if I reboot then the virtual machines don't recover and I lose the jobs. But on top of that, with summer coming I thought I would look at restricting the computing times so my PC doesn't burn me out of the office while I am working. After a few cool hours I noticed that the deadlines seem to be set at exactly ten days, so that I can't suspend the computation without potentially finishing after the deadline. I now have, for the five jobs: Elapsed: 5d 22:33:00 Remaining: 4d 01:37:00 Deadline: 5/26/20 10:43:19 AM It is currently 5/22/20 11:06:00 AM |
Send message Joined: 12 Aug 06 Posts: 418 Credit: 5,667,249 RAC: 5 |
I have five Theory processes running, all with an expected duration of ten days. This is a difficult commitment because I find if I reboot then the virtual machines don't recover and I lose the jobs. Sometimes they get lost, but mostly I find they continue. When I shutdown or reboot, I leave it a bit if it says "Virtual Box still has open connections", probably only 10 or 20 seconds, before I click "shut down anyway". They really ought to fix that bug. But on top of that, with summer coming I thought I would look at restricting the computing times so my PC doesn't burn me out of the office while I am working. That's no excuse, open a window! After a few cool hours I noticed that the deadlines seem to be set at exactly ten days, so that I can't suspend the computation without potentially finishing after the deadline. Those estimates are wildly out. It starts by assuming it will take whatever your average time is for Theory tasks, in my case about 1.5 hours. Once it goes much over that, it decides it could take up to 4 (or sometimes 10) days. This does have the benefit of putting Boinc into panic mode so that task will run continuously. But 95% of them finish within 12 hours. I now have, for the five jobs: Is that the total of all 5? What is each one at? |
Send message Joined: 18 Nov 17 Posts: 119 Credit: 51,862,927 RAC: 23,112 |
I have five Theory processes running, all with an expected duration of ten days. This is a difficult commitment because I find if I reboot then the virtual machines don't recover and I lose the jobs. Yes, this is very important too. Theory team increased max time duration up to 10 days, but did not increase deadline. Now we have no right to pause :-)) Now there are 2 reasons to increase deadline. |
Send message Joined: 15 Jun 08 Posts: 2401 Credit: 225,577,711 RAC: 120,939 |
Looking at the mcplots data shows that 99.9 % of all tasks finish whithin 3345 minutes (2.32 d). http://mcplots-dev.cern.ch/production.php?view=revision&rev=2390 http://mcplots-dev.cern.ch/cache/stats/runtime-2390.txt What would be the benefit to extend the limits beyond 10 d? |
Send message Joined: 18 Nov 17 Posts: 119 Credit: 51,862,927 RAC: 23,112 |
Looking at the mcplots data shows that 99.9 % of all tasks finish whithin 3345 minutes (2.32 d). My only reason: I think successful tasks running many-many days especially valuable for the Project. If not, I give up. |
Send message Joined: 18 Nov 17 Posts: 119 Credit: 51,862,927 RAC: 23,112 |
Looking at the mcplots data shows that 99.9 % of all tasks finish whithin 3345 minutes (2.32 d). If not, the limit of 2.32 d is the best limit. |
Send message Joined: 12 Aug 06 Posts: 418 Credit: 5,667,249 RAC: 5 |
Looking at the mcplots data shows that 99.9 % of all tasks finish whithin 3345 minutes (2.32 d). The deadline needs to match the time limit. Whenever I get a 10 day time limit, it's going to exceed the deadline slightly, as it didn't start right away. But my computers are pretty much on 24/7. What about people who turn them off at night? The time limit will always be past the deadline for long running tasks. |
Send message Joined: 12 Aug 06 Posts: 418 Credit: 5,667,249 RAC: 5 |
Looking at the mcplots data shows that 99.9 % of all tasks finish whithin 3345 minutes (2.32 d). Agreed. Perhaps 99.9% could be done with a shorter limit, then when it's discovered that some need longer, they are reissued with a huge deadline and limit. They could even be put in a seperate tickbox on web preferences, "Theory long running" or something. Those of us with several computers running 24 hours a day like myself would be happy to leave them going for a month or so. |
Send message Joined: 18 Nov 17 Posts: 119 Credit: 51,862,927 RAC: 23,112 |
Looking at the mcplots data shows that 99.9 % of all tasks finish whithin 3345 minutes (2.32 d). Yes, tickbox on web preferences - superb idea. |
Send message Joined: 15 Jun 08 Posts: 2401 Credit: 225,577,711 RAC: 120,939 |
The following comment is a personal one, not a moderator's comment. Peter Hucker wrote: The deadline needs to match the time limit. What should it be based on? Computers running 1 h/d , 3.2 h/d, 7.6 h/d, ... Computers running Mon-Fri, Mon+Tue+Fri, Sat+Sun ... And at what buffer size? 0.32 days, 3.44 days, 7 days ... And Theory only, or Theory + ATLAS, or LHC beside other projects (at what relative priority per project)? Beside that individual runtimes per task can't be estimated beforehand. NOGOOD wrote: Yes, tickbox on web preferences - superb idea. Perfect idea. Feel yourself encouraged to change the BOINC code accordingly: https://github.com/BOINC/boinc/tree/server_release/1/1.2 In addition: What would you suggest to do with the downstream and upstream processes? They may expect the results to be delivered within a given limit. Really, all of that to get a bit more than 99.9%? Might be a good idea to also check the mcplots failure rate: http://mcplots-dev.cern.ch/production.php?view=user&system=3&userid=557509 http://mcplots-dev.cern.ch/production.php?view=user&system=3&userid=55945 All computers regularly running tasks are below 6 %. That's quite reliable. Thank you guys. Sorry, don't want to disappoint you. Comments are always important to make the admins aware of server/project errors. |
©2024 CERN