Message boards :
Theory Application :
Tasks run 4 days and finish with error
Message board moderation
Previous · 1 · 2 · 3 · 4 · 5 · 6 · 7 . . . 8 · Next
Author | Message |
---|---|
Send message Joined: 12 Aug 06 Posts: 418 Credit: 5,667,249 RAC: 4 |
I think the GPU apps cannot keep in memory what was happening in GPU if the calculation is interrupted or at least they could not utilize it correctly and it caused some problems. So GPU's memory is released if computation is suspended. I don't see why that would happen. If the disk is busy surely the VM just has to wait? Or is Boinc not allowing it long enough to save? One thing I have noticed, if I shut my computer down while a VM is running, Windows warns me it's not closed, waiting doesn't help. There seems to be some sort of bug in it. I wonder if the same thing happens when Boinc instructs it to close when it swaps to another project? |
Send message Joined: 28 Sep 04 Posts: 675 Credit: 43,609,995 RAC: 15,775 |
I think the GPU apps cannot keep in memory what was happening in GPU if the calculation is interrupted or at least they could not utilize it correctly and it caused some problems. So GPU's memory is released if computation is suspended. For Boinc shutting down there is one minute time to shut everything down, otherwise you get an error (at least from BoincTasks). Yes, the VirtualBox Interface is usually the one that don't shutdown. I then kill it manually from Windows Task Manager. So far I haven't lost any task because of that. |
Send message Joined: 12 Aug 06 Posts: 418 Credit: 5,667,249 RAC: 4 |
Yes, the VirtualBox Interface is usually the one that don't shutdown. I then kill it manually from Windows Task Manager. So far I haven't lost any task because of that. When I'm shutting down or restarting the computer, I just wait several seconds and also make sure the disk light isn't flashing much, then click "shut down anyway". |
Send message Joined: 17 Oct 06 Posts: 74 Credit: 52,365,234 RAC: 36,792 |
Yes, the VirtualBox Interface is usually the one that don't shutdown. I then kill it manually from Windows Task Manager. So far I haven't lost any task because of that. As long as no virtual box instances are running. You can kill this whenever you want. |
Send message Joined: 12 Aug 06 Posts: 418 Credit: 5,667,249 RAC: 4 |
Yes, the VirtualBox Interface is usually the one that don't shutdown. I then kill it manually from Windows Task Manager. So far I haven't lost any task because of that. I wonder why it never closes by itself. If I try to shut down Windows 10 when 1 or more virtual box WUs are running, virtualbox is still left running, even leaving it for ages. I can only assume that waiting several seconds is enough to let them save their data? If I wait any longer than that Windows gives up waiting and brings me back to the desktop. Yes I could tell Boinc to stop them first, but I still wouldn't know when they're done. |
Send message Joined: 12 Aug 06 Posts: 418 Credit: 5,667,249 RAC: 4 |
Sherpa log ends in "integration time: ( 20h 54m 18s elapsed / 2316d 6h 59m 20s left )" If it runs for 4 days and returns a computing error instead of being validated, has that given the project any useful data? Like "the answer isn't in the first 4 days"? |
Send message Joined: 18 Nov 17 Posts: 120 Credit: 51,967,182 RAC: 25,120 |
If it runs for 4 days and returns a computing error instead of being validated, has that given the project any useful data? I think, that has not. That's why I turned 100 hours limit off and let all tasks run. I did it 10 days ago. Since that moment I've got 1 successful task that have ran for 9 days and no longrunners. |
Send message Joined: 18 Nov 17 Posts: 120 Credit: 51,967,182 RAC: 25,120 |
Since that moment I've got 1 successful task that have ran for 9 days and no longrunners. It is interesting, has that given the project any unique useful data? If such result is not exceptionally useful, I'd turn 100 hours limit on again. |
Send message Joined: 12 Aug 06 Posts: 418 Credit: 5,667,249 RAC: 4 |
If it runs for 4 days and returns a computing error instead of being validated, has that given the project any useful data? I will try the same, although one might run for 50 years! How do I change the time limit? |
Send message Joined: 18 Nov 17 Posts: 120 Credit: 51,967,182 RAC: 25,120 |
If it runs for 4 days and returns a computing error instead of being validated, has that given the project any useful data? In the projects directory there is a file called: Theory_2019_11_13a.xml Change the value in <job_duration>360000</job_duration> to your needs. And add a line in the options part of cc_config.xml: <dont_check_file_sizes>1</dont_check_file_sizes> |
Send message Joined: 12 Aug 06 Posts: 418 Credit: 5,667,249 RAC: 4 |
In the projects directory there is a file called: Theory_2019_11_13a.xml Thanks, I've multiplied it by 10, so 40 days. I'll see if any are found in that timescale. If loads go to 40 without completing, I'll put it back. If I see some managing to finish between 4 and 40, I'll leave it at 40. I think I annoyed the Boinc client - since it thinks they will take 40 days, theories now all go to high priority. Although only after they've run for an hour or so. For some reason before that time it assumes a 1h30m time (my average I assume). Also, they're now going to go past the deadline if they take that long. Unless I hear otherwise, I'll assume if I get credits for one (no matter if it passed the deadline), that the result was useful to them. |
Send message Joined: 17 Oct 06 Posts: 74 Credit: 52,365,234 RAC: 36,792 |
Do we think this https://lhcathome.cern.ch/lhcathome/forum_thread.php?id=5413 will fix our issues? |
Send message Joined: 12 Aug 06 Posts: 418 Credit: 5,667,249 RAC: 4 |
Do we think this https://lhcathome.cern.ch/lhcathome/forum_thread.php?id=5413 Sounds like it should. Although 2/3rds of the tasks I download are the old version, so it may take a while to clear the server queues. |
Send message Joined: 13 Jul 05 Posts: 167 Credit: 14,945,019 RAC: 623 |
Do we think this https://lhcathome.cern.ch/lhcathome/forum_thread.php?id=5413 will fix our issues?I don't actually know anything about the code... The problematic Sherpas that sit gobbling CPU but giving no sign of progress might well be in an endless loop, so that is hopefully a thing of that past. Those ones that report an estimated time remaining that shows sudden spikes up to an infeasible value look to me as though something other than "endless loop" is the problem. I suppose there are release notes on the Web somewhere if I could be bothered to look... I'm sure we'll find out soon enough now that the tasks are coming through. |
Send message Joined: 2 May 07 Posts: 2096 Credit: 159,567,835 RAC: 140,330 |
|
Send message Joined: 13 Jul 05 Posts: 167 Credit: 14,945,019 RAC: 623 |
Well, 272478919 seems to have been spewing just ISR_Handler::MakeISR(..): s' out of bounds. s'_{min}, s'_{max 1,2} vs. s': 0.0049, 49000000, 49000000 vs. 0.0049 ISR_Handler::MakeISR(..): s' out of bounds. s'_{min}, s'_{max 1,2} vs. s': 0.0049, 49000000, 49000000 vs. 0.0049 ISR_Handler::MakeISR(..): s' out of bounds. s'_{min}, s'_{max 1,2} vs. s': 0.0049, 49000000, 49000000 vs. 0.0049 ISR_Handler::MakeISR(..): s' out of bounds. s'_{min}, s'_{max 1,2} vs. s': 0.0049, 49000000, 49000000 vs. 0.0049 Channel_Elements::GenerateYForward(5.9997117627854e-10,{-8.98847e+307,0,-8.98847e+307,0,0},{-10,10,5.50191}): Y out of bounds ! ymin, ymax vs. y : -10 10 vs. -10 Setting y to lower bound ymin=-10 Channel_Elements::GenerateYForward(1.1155e-10,{-8.98847e+307,0,-8.98847e+307,0,0},{-10,10,0.404185}): Y out of bounds ! ymin, ymax vs. y : -10 10 vs. -10 Setting y to lower bound ymin=-10 Channel_Elements::GenerateYForward(8.17608e-10,{-8.98847e+307,0,-8.98847e+307,0,0},{-10,10,3.02746}): Y out of bounds ! ymin, ymax vs. y : -10 10 vs. -10 Setting y to lower bound ymin=-10 Channel_Elements::GenerateYBackward(1.1046941250437e-10,{-8.98847e+307,0,-8.98847e+307,0,0},{-10,10,-0.0929604}): Y out of bounds ! ymin, ymax vs. y : -10 10 vs. 10 Setting y to upper bound ymax=10 Channel_Elements::GenerateYBackward(1.7451462098184e-10,{-8.98847e+307,0,-8.98847e+307,0,0},{-10,10,3.69002}): Y out of bounds ! ymin, ymax vs. y : -10 10 vs. 10 Setting y to upper bound ymax=10 Channel_Elements::GenerateYBackward(0.0021676548016573,{-8.98847e+307,0,-8.98847e+307,0,0},{-10,10,0.551644}): Y out of bounds ! ymin, ymax vs. y : -3.0670547162051 3.0670547162051 vs. 3.0670547162051 Setting y to upper bound ymax=3.0670547162051 ISR_Handler::MakeISR(..): s' out of bounds. s'_{min}, s'_{max 1,2} vs. s': 0.0049, 49000000, 49000000 vs. 0.0049 ISR_Handler::MakeISR(..): s' out of bounds. s'_{min}, s'_{max 1,2} vs. s': 0.0049, 49000000, 49000000 vs. 0.0049for a day and a half now without any sign of useful progress in the log. Those usually go on for ever IME, so I've put a stop to that! Meanwhile, 272500168 has been sitting gobbling a CPU since it announced that Initialized the Shower_Handler. ME_Generator_Base::SetPSMasses(): Massive PS flavours for Internal: (c,cb,b,bb,e-,e+,mu-,mu+,tau-,tau+) ME_Generator_Base::SetPSMasses(): Massive PS flavours for Comix: (c,cb,b,bb,e-,e+,mu-,mu+,tau-,tau+) +----------------------------------+ | | | CCC OOO M M I X X | | C O O MM MM I X X | | C O O M M M I X | | C O O M M I X X | | CCC OOO M M I X X | | | +==================================+ | Color dressed Matrix Elements | | http://comix.freacafe.de | | please cite JHEP12(2008)039 | +----------------------------------+ Matrix_Element_Handler::BuildProcesses(): Looking for processes .................................................................................................................................................................................... done ( 36 MB, 23s / 21s ). Matrix_Element_Handler::InitializeProcesses(): Performing tests .................................................................................................................................................................................... done ( 36 MB, 0s / 0s ). Initialized the Matrix_Element_Handler for the hard processes. Initialized the Beam_Remnant_Handler. Hadron_Decay_Map::Read: Initializing HadronDecays.dat. This may take some time. Initialized the Hadron_Decay_Handler, Decay model = Hadrons Initialized the Soft_Photon_Handler. Process_Group::CalculateTotalXSec(): Calculate xs for '2_2__j__j__e-__veb' (Comix) Starting the calculation at 21:19:15. Lean back and enjoy ... .(yes, that's 21:19 yesterday since it bothered with a progress report) - so I've leant back and enjoyed killing it. OTOH, 272136516 has chugged along slowly towards 98% done and should finish some time tonight - though whether it's actually produced anything meaningful isn't clear: 97000 events processed dumping histograms... Rivet.Analysis.CMS_2017_I1605749: WARN Skipping histo with null area /CMS_2017_I1605749/d01-x01-y01[4] Rivet.Analysis.CMS_2017_I1605749: WARN Skipping histo with null area /CMS_2017_I1605749/d02-x01-y01[4] Rivet.Analysis.CMS_2017_I1605749: WARN Skipping histo with null area /CMS_2017_I1605749/d03-x01-y01[4] Rivet.Analysis.CMS_2017_I1605749: WARN Skipping histo with null area /CMS_2017_I1605749/d04-x01-y01[4] Rivet.Analysis.CMS_2017_I1605749: WARN Skipping histo with null area /CMS_2017_I1605749/d05-x01-y01[4] Rivet.Analysis.CMS_2017_I1605749: WARN Skipping histo with null area /CMS_2017_I1605749/d06-x01-y01[4] Rivet.Analysis.CMS_2017_I1605749: WARN Skipping histo with null area /CMS_2017_I1605749/d07-x01-y01[4] Rivet.Analysis.CMS_2017_I1605749: WARN Skipping histo with null area /CMS_2017_I1605749/d08-x01-y01[4] Rivet.Analysis.CMS_2017_I1605749: WARN Skipping histo with null area /CMS_2017_I1605749/d09-x01-y01[4] Rivet.Analysis.CMS_2017_I1605749: WARN Skipping histo with null area /CMS_2017_I1605749/d10-x01-y01[4] Rivet.Analysis.CMS_2017_I1605749: WARN Skipping histo with null area /CMS_2017_I1605749/d11-x01-y01[4] Rivet.Analysis.CMS_2017_I1605749: WARN Skipping histo with null area /CMS_2017_I1605749/d12-x01-y01[4] Rivet.Analysis.CMS_2017_I1605749: WARN Skipping histo with null area /CMS_2017_I1605749/d13-x01-y01[4] Rivet.Analysis.CMS_2017_I1605749: WARN Skipping histo with null area /CMS_2017_I1605749/d14-x01-y01[4] Rivet.Analysis.CMS_2017_I1605749: WARN Skipping histo with null area /CMS_2017_I1605749/d15-x01-y01[4] Rivet.Analysis.CMS_2017_I1605749: WARN Skipping histo with null area /CMS_2017_I1605749/d16-x01-y01[4] Rivet.Analysis.CMS_2017_I1605749: WARN Skipping histo with null area /CMS_2017_I1605749/d17-x01-y01[4] Rivet.Analysis.CMS_2017_I1605749: WARN Skipping histo with null area /CMS_2017_I1605749/d18-x01-y01[4] 97100 events processed 97200 events processed 97300 events processed... |
Send message Joined: 13 Jul 05 Posts: 167 Credit: 14,945,019 RAC: 623 |
OTOH, 272136516 has chugged along slowly towards 98% done and should finish some time tonight...Completed and validated: Name Theory_2378-1064671-8_0 |
Send message Joined: 18 Nov 17 Posts: 120 Credit: 51,967,182 RAC: 25,120 |
I've got one more successful beyond the limit of 100 hours: https://lhcathome.cern.ch/lhcathome/result.php?resultid=272436500 Run time - 5 d 9 h 0 m 14 s. CPU time - 5 d 8 h 46 m 57 s. |
Send message Joined: 12 Aug 06 Posts: 418 Credit: 5,667,249 RAC: 4 |
I've got one more successful beyond the limit of 100 hours: I've just set up a new machine (Windows 10), and its theories all have a 10 day limit instead of 4 (I have adjusted nothing in the config for it). Have they changed something again? 10 days conveniently matches the deadline. Note - they only change to say 10 days if they don't complete in the first few hours. |
Send message Joined: 18 Nov 17 Posts: 120 Credit: 51,967,182 RAC: 25,120 |
I've got one more successful beyond the limit of 100 hours: I hope I’ll get the successful one beyond the limit of 10 days soon :-) |
©2024 CERN