Message boards : Theory Application : Tasks run 4 days and finish with error
Message board moderation

To post messages, you must log in.

Previous · 1 · 2 · 3 · 4 · 5 · 6 · 7 . . . 8 · Next

AuthorMessage
Mr P Hucker
Avatar

Send message
Joined: 12 Aug 06
Posts: 418
Credit: 5,667,249
RAC: 48
Message 42284 - Posted: 25 Apr 2020, 21:54:24 UTC - in response to Message 42283.  

I think the GPU apps cannot keep in memory what was happening in GPU if the calculation is interrupted or at least they could not utilize it correctly and it caused some problems. So GPU's memory is released if computation is suspended.

The Virtualbox applications have double saving system: They checkpoint in Boinc and save the virtual machine status in VirtualBox Manager but I don't know if or how the Boinc checkpoint transfers to the Virtualbox when calculation resumes. I know that if you have several VirtualBox machines running from an traditional hard drive and you start or stop Boinc you can swamp the hard drive I/O and some of the Virtual Machines can fail and have unrecoverable error. SSD's can handle this disk I/O better.


I don't see why that would happen. If the disk is busy surely the VM just has to wait? Or is Boinc not allowing it long enough to save?

One thing I have noticed, if I shut my computer down while a VM is running, Windows warns me it's not closed, waiting doesn't help. There seems to be some sort of bug in it. I wonder if the same thing happens when Boinc instructs it to close when it swaps to another project?
ID: 42284 · Report as offensive     Reply Quote
Harri Liljeroos
Avatar

Send message
Joined: 28 Sep 04
Posts: 674
Credit: 43,162,423
RAC: 15,821
Message 42285 - Posted: 25 Apr 2020, 22:25:06 UTC - in response to Message 42284.  

I think the GPU apps cannot keep in memory what was happening in GPU if the calculation is interrupted or at least they could not utilize it correctly and it caused some problems. So GPU's memory is released if computation is suspended.

The Virtualbox applications have double saving system: They checkpoint in Boinc and save the virtual machine status in VirtualBox Manager but I don't know if or how the Boinc checkpoint transfers to the Virtualbox when calculation resumes. I know that if you have several VirtualBox machines running from an traditional hard drive and you start or stop Boinc you can swamp the hard drive I/O and some of the Virtual Machines can fail and have unrecoverable error. SSD's can handle this disk I/O better.


I don't see why that would happen. If the disk is busy surely the VM just has to wait? Or is Boinc not allowing it long enough to save?

One thing I have noticed, if I shut my computer down while a VM is running, Windows warns me it's not closed, waiting doesn't help. There seems to be some sort of bug in it. I wonder if the same thing happens when Boinc instructs it to close when it swaps to another project?

For Boinc shutting down there is one minute time to shut everything down, otherwise you get an error (at least from BoincTasks).

Yes, the VirtualBox Interface is usually the one that don't shutdown. I then kill it manually from Windows Task Manager. So far I haven't lost any task because of that.
ID: 42285 · Report as offensive     Reply Quote
Mr P Hucker
Avatar

Send message
Joined: 12 Aug 06
Posts: 418
Credit: 5,667,249
RAC: 48
Message 42288 - Posted: 25 Apr 2020, 23:01:37 UTC - in response to Message 42285.  

Yes, the VirtualBox Interface is usually the one that don't shutdown. I then kill it manually from Windows Task Manager. So far I haven't lost any task because of that.


When I'm shutting down or restarting the computer, I just wait several seconds and also make sure the disk light isn't flashing much, then click "shut down anyway".
ID: 42288 · Report as offensive     Reply Quote
CloverField

Send message
Joined: 17 Oct 06
Posts: 74
Credit: 51,524,587
RAC: 23,197
Message 42323 - Posted: 28 Apr 2020, 11:49:19 UTC - in response to Message 42285.  

Yes, the VirtualBox Interface is usually the one that don't shutdown. I then kill it manually from Windows Task Manager. So far I haven't lost any task because of that.


As long as no virtual box instances are running. You can kill this whenever you want.
ID: 42323 · Report as offensive     Reply Quote
Mr P Hucker
Avatar

Send message
Joined: 12 Aug 06
Posts: 418
Credit: 5,667,249
RAC: 48
Message 42324 - Posted: 28 Apr 2020, 11:54:58 UTC - in response to Message 42323.  
Last modified: 28 Apr 2020, 11:56:36 UTC

Yes, the VirtualBox Interface is usually the one that don't shutdown. I then kill it manually from Windows Task Manager. So far I haven't lost any task because of that.


As long as no virtual box instances are running. You can kill this whenever you want.


I wonder why it never closes by itself. If I try to shut down Windows 10 when 1 or more virtual box WUs are running, virtualbox is still left running, even leaving it for ages. I can only assume that waiting several seconds is enough to let them save their data? If I wait any longer than that Windows gives up waiting and brings me back to the desktop. Yes I could tell Boinc to stop them first, but I still wouldn't know when they're done.
ID: 42324 · Report as offensive     Reply Quote
Mr P Hucker
Avatar

Send message
Joined: 12 Aug 06
Posts: 418
Credit: 5,667,249
RAC: 48
Message 42336 - Posted: 30 Apr 2020, 11:13:18 UTC - in response to Message 42269.  

Sherpa log ends in "integration time: ( 20h 54m 18s elapsed / 2316d 6h 59m 20s left )"
That's 7 years remaining.

Is there a way of me telling if it's going to be a fruitless effort? If so, can they not abort themselves?

The only way we volunteers have is to compare the job description (1st line of running.log - Show Graphics or VM Console ALT-F1) with the Failed jobs list.
If it is in the list, there was no success so far, what however does not mean that it will never succeed.
It's up to you what to do with such a task: Abort or give it a try.


If it runs for 4 days and returns a computing error instead of being validated, has that given the project any useful data? Like "the answer isn't in the first 4 days"?
ID: 42336 · Report as offensive     Reply Quote
NOGOOD

Send message
Joined: 18 Nov 17
Posts: 119
Credit: 51,312,439
RAC: 21,567
Message 42337 - Posted: 30 Apr 2020, 11:28:57 UTC - in response to Message 42336.  

If it runs for 4 days and returns a computing error instead of being validated, has that given the project any useful data?

I think, that has not. That's why I turned 100 hours limit off and let all tasks run.
I did it 10 days ago. Since that moment I've got 1 successful task that have ran for 9 days and no longrunners.
ID: 42337 · Report as offensive     Reply Quote
NOGOOD

Send message
Joined: 18 Nov 17
Posts: 119
Credit: 51,312,439
RAC: 21,567
Message 42338 - Posted: 30 Apr 2020, 11:35:50 UTC - in response to Message 42337.  

Since that moment I've got 1 successful task that have ran for 9 days and no longrunners.

It is interesting, has that given the project any unique useful data?
If such result is not exceptionally useful, I'd turn 100 hours limit on again.
ID: 42338 · Report as offensive     Reply Quote
Mr P Hucker
Avatar

Send message
Joined: 12 Aug 06
Posts: 418
Credit: 5,667,249
RAC: 48
Message 42339 - Posted: 30 Apr 2020, 11:37:36 UTC - in response to Message 42337.  
Last modified: 30 Apr 2020, 11:39:16 UTC

If it runs for 4 days and returns a computing error instead of being validated, has that given the project any useful data?

I think, that has not. That's why I turned 100 hours limit off and let all tasks run.
I did it 10 days ago. Since that moment I've got 1 successful task that have ran for 9 days and no longrunners.


I will try the same, although one might run for 50 years!

How do I change the time limit?
ID: 42339 · Report as offensive     Reply Quote
NOGOOD

Send message
Joined: 18 Nov 17
Posts: 119
Credit: 51,312,439
RAC: 21,567
Message 42340 - Posted: 30 Apr 2020, 11:45:02 UTC - in response to Message 42339.  

If it runs for 4 days and returns a computing error instead of being validated, has that given the project any useful data?

I think, that has not. That's why I turned 100 hours limit off and let all tasks run.
I did it 10 days ago. Since that moment I've got 1 successful task that have ran for 9 days and no longrunners.


I will try the same, although one might run for 50 years!

How do I change the time limit?

In the projects directory there is a file called: Theory_2019_11_13a.xml
Change the value in <job_duration>360000</job_duration> to your needs.

And add a line in the options part of cc_config.xml:
<dont_check_file_sizes>1</dont_check_file_sizes>
ID: 42340 · Report as offensive     Reply Quote
Mr P Hucker
Avatar

Send message
Joined: 12 Aug 06
Posts: 418
Credit: 5,667,249
RAC: 48
Message 42342 - Posted: 30 Apr 2020, 12:27:29 UTC - in response to Message 42340.  
Last modified: 30 Apr 2020, 12:34:27 UTC

In the projects directory there is a file called: Theory_2019_11_13a.xml
Change the value in <job_duration>360000</job_duration> to your needs.

And add a line in the options part of cc_config.xml:
<dont_check_file_sizes>1</dont_check_file_sizes>


Thanks, I've multiplied it by 10, so 40 days. I'll see if any are found in that timescale. If loads go to 40 without completing, I'll put it back. If I see some managing to finish between 4 and 40, I'll leave it at 40.

I think I annoyed the Boinc client - since it thinks they will take 40 days, theories now all go to high priority. Although only after they've run for an hour or so. For some reason before that time it assumes a 1h30m time (my average I assume).

Also, they're now going to go past the deadline if they take that long. Unless I hear otherwise, I'll assume if I get credits for one (no matter if it passed the deadline), that the result was useful to them.
ID: 42342 · Report as offensive     Reply Quote
CloverField

Send message
Joined: 17 Oct 06
Posts: 74
Credit: 51,524,587
RAC: 23,197
Message 42373 - Posted: 4 May 2020, 1:03:39 UTC - in response to Message 42342.  

ID: 42373 · Report as offensive     Reply Quote
Mr P Hucker
Avatar

Send message
Joined: 12 Aug 06
Posts: 418
Credit: 5,667,249
RAC: 48
Message 42375 - Posted: 4 May 2020, 11:53:12 UTC - in response to Message 42373.  

Do we think this https://lhcathome.cern.ch/lhcathome/forum_thread.php?id=5413
will fix our issues?


Sounds like it should. Although 2/3rds of the tasks I download are the old version, so it may take a while to clear the server queues.
ID: 42375 · Report as offensive     Reply Quote
Henry Nebrensky

Send message
Joined: 13 Jul 05
Posts: 165
Credit: 14,925,288
RAC: 34
Message 42376 - Posted: 4 May 2020, 15:25:39 UTC - in response to Message 42373.  

Do we think this https://lhcathome.cern.ch/lhcathome/forum_thread.php?id=5413 will fix our issues?

"This time the update addressed Sherpa event generator and particularly issues with the endless loops which should be significantly reduced or disappear now."
I don't actually know anything about the code...
The problematic Sherpas that sit gobbling CPU but giving no sign of progress might well be in an endless loop, so that is hopefully a thing of that past. Those ones that report an estimated time remaining that shows sudden spikes up to an infeasible value look to me as though something other than "endless loop" is the problem.
I suppose there are release notes on the Web somewhere if I could be bothered to look... I'm sure we'll find out soon enough now that the tasks are coming through.
ID: 42376 · Report as offensive     Reply Quote
maeax

Send message
Joined: 2 May 07
Posts: 2071
Credit: 156,183,262
RAC: 104,796
Message 42380 - Posted: 6 May 2020, 18:55:11 UTC

ID: 42380 · Report as offensive     Reply Quote
Henry Nebrensky

Send message
Joined: 13 Jul 05
Posts: 165
Credit: 14,925,288
RAC: 34
Message 42384 - Posted: 7 May 2020, 22:19:36 UTC - in response to Message 42376.  

Well, 272478919 seems to have been spewing just
ISR_Handler::MakeISR(..): s' out of bounds.
  s'_{min}, s'_{max 1,2} vs. s': 0.0049, 49000000, 49000000 vs. 0.0049
ISR_Handler::MakeISR(..): s' out of bounds.
  s'_{min}, s'_{max 1,2} vs. s': 0.0049, 49000000, 49000000 vs. 0.0049
ISR_Handler::MakeISR(..): s' out of bounds.
  s'_{min}, s'_{max 1,2} vs. s': 0.0049, 49000000, 49000000 vs. 0.0049
ISR_Handler::MakeISR(..): s' out of bounds.
  s'_{min}, s'_{max 1,2} vs. s': 0.0049, 49000000, 49000000 vs. 0.0049
Channel_Elements::GenerateYForward(5.9997117627854e-10,{-8.98847e+307,0,-8.98847e+307,0,0},{-10,10,5.50191}):  Y out of bounds ! 
   ymin, ymax vs. y : -10 10 vs. -10
Setting y to lower bound  ymin=-10
Channel_Elements::GenerateYForward(1.1155e-10,{-8.98847e+307,0,-8.98847e+307,0,0},{-10,10,0.404185}):  Y out of bounds ! 
   ymin, ymax vs. y : -10 10 vs. -10
Setting y to lower bound  ymin=-10
Channel_Elements::GenerateYForward(8.17608e-10,{-8.98847e+307,0,-8.98847e+307,0,0},{-10,10,3.02746}):  Y out of bounds ! 
   ymin, ymax vs. y : -10 10 vs. -10
Setting y to lower bound  ymin=-10
Channel_Elements::GenerateYBackward(1.1046941250437e-10,{-8.98847e+307,0,-8.98847e+307,0,0},{-10,10,-0.0929604}):  Y out of bounds ! 
   ymin, ymax vs. y : -10 10 vs. 10
Setting y to upper bound ymax=10
Channel_Elements::GenerateYBackward(1.7451462098184e-10,{-8.98847e+307,0,-8.98847e+307,0,0},{-10,10,3.69002}):  Y out of bounds ! 
   ymin, ymax vs. y : -10 10 vs. 10
Setting y to upper bound ymax=10
Channel_Elements::GenerateYBackward(0.0021676548016573,{-8.98847e+307,0,-8.98847e+307,0,0},{-10,10,0.551644}):  Y out of bounds ! 
   ymin, ymax vs. y : -3.0670547162051 3.0670547162051 vs. 3.0670547162051
Setting y to upper bound ymax=3.0670547162051
ISR_Handler::MakeISR(..): s' out of bounds.
  s'_{min}, s'_{max 1,2} vs. s': 0.0049, 49000000, 49000000 vs. 0.0049
ISR_Handler::MakeISR(..): s' out of bounds.
  s'_{min}, s'_{max 1,2} vs. s': 0.0049, 49000000, 49000000 vs. 0.0049
for a day and a half now without any sign of useful progress in the log. Those usually go on for ever IME, so I've put a stop to that!
Meanwhile, 272500168 has been sitting gobbling a CPU since it announced that
Initialized the Shower_Handler.
ME_Generator_Base::SetPSMasses(): Massive PS flavours for Internal: (c,cb,b,bb,e-,e+,mu-,mu+,tau-,tau+)
ME_Generator_Base::SetPSMasses(): Massive PS flavours for Comix: (c,cb,b,bb,e-,e+,mu-,mu+,tau-,tau+)
+----------------------------------+
|                                  |
|      CCC  OOO  M   M I X   X     |
|     C    O   O MM MM I  X X      |
|     C    O   O M M M I   X       |
|     C    O   O M   M I  X X      |
|      CCC  OOO  M   M I X   X     |
|                                  |
+==================================+
|  Color dressed  Matrix Elements  |
|     http://comix.freacafe.de     |
|   please cite  JHEP12(2008)039   |
+----------------------------------+
Matrix_Element_Handler::BuildProcesses(): Looking for processes .................................................................................................................................................................................... done ( 36 MB, 23s / 21s ).
Matrix_Element_Handler::InitializeProcesses(): Performing tests .................................................................................................................................................................................... done ( 36 MB, 0s / 0s ).
Initialized the Matrix_Element_Handler for the hard processes.
Initialized the Beam_Remnant_Handler.
Hadron_Decay_Map::Read:   Initializing HadronDecays.dat. This may take some time.
Initialized the Hadron_Decay_Handler, Decay model = Hadrons
Initialized the Soft_Photon_Handler.
Process_Group::CalculateTotalXSec(): Calculate xs for '2_2__j__j__e-__veb' (Comix)
Starting the calculation at 21:19:15. Lean back and enjoy ... .
(yes, that's 21:19 yesterday since it bothered with a progress report) - so I've leant back and enjoyed killing it.

OTOH, 272136516 has chugged along slowly towards 98% done and should finish some time tonight - though whether it's actually produced anything meaningful isn't clear:
97000 events processed
dumping histograms...
Rivet.Analysis.CMS_2017_I1605749: WARN  Skipping histo with null area /CMS_2017_I1605749/d01-x01-y01[4]
Rivet.Analysis.CMS_2017_I1605749: WARN  Skipping histo with null area /CMS_2017_I1605749/d02-x01-y01[4]
Rivet.Analysis.CMS_2017_I1605749: WARN  Skipping histo with null area /CMS_2017_I1605749/d03-x01-y01[4]
Rivet.Analysis.CMS_2017_I1605749: WARN  Skipping histo with null area /CMS_2017_I1605749/d04-x01-y01[4]
Rivet.Analysis.CMS_2017_I1605749: WARN  Skipping histo with null area /CMS_2017_I1605749/d05-x01-y01[4]
Rivet.Analysis.CMS_2017_I1605749: WARN  Skipping histo with null area /CMS_2017_I1605749/d06-x01-y01[4]
Rivet.Analysis.CMS_2017_I1605749: WARN  Skipping histo with null area /CMS_2017_I1605749/d07-x01-y01[4]
Rivet.Analysis.CMS_2017_I1605749: WARN  Skipping histo with null area /CMS_2017_I1605749/d08-x01-y01[4]
Rivet.Analysis.CMS_2017_I1605749: WARN  Skipping histo with null area /CMS_2017_I1605749/d09-x01-y01[4]
Rivet.Analysis.CMS_2017_I1605749: WARN  Skipping histo with null area /CMS_2017_I1605749/d10-x01-y01[4]
Rivet.Analysis.CMS_2017_I1605749: WARN  Skipping histo with null area /CMS_2017_I1605749/d11-x01-y01[4]
Rivet.Analysis.CMS_2017_I1605749: WARN  Skipping histo with null area /CMS_2017_I1605749/d12-x01-y01[4]
Rivet.Analysis.CMS_2017_I1605749: WARN  Skipping histo with null area /CMS_2017_I1605749/d13-x01-y01[4]
Rivet.Analysis.CMS_2017_I1605749: WARN  Skipping histo with null area /CMS_2017_I1605749/d14-x01-y01[4]
Rivet.Analysis.CMS_2017_I1605749: WARN  Skipping histo with null area /CMS_2017_I1605749/d15-x01-y01[4]
Rivet.Analysis.CMS_2017_I1605749: WARN  Skipping histo with null area /CMS_2017_I1605749/d16-x01-y01[4]
Rivet.Analysis.CMS_2017_I1605749: WARN  Skipping histo with null area /CMS_2017_I1605749/d17-x01-y01[4]
Rivet.Analysis.CMS_2017_I1605749: WARN  Skipping histo with null area /CMS_2017_I1605749/d18-x01-y01[4]
97100 events processed
97200 events processed
97300 events processed
...
ID: 42384 · Report as offensive     Reply Quote
Henry Nebrensky

Send message
Joined: 13 Jul 05
Posts: 165
Credit: 14,925,288
RAC: 34
Message 42389 - Posted: 8 May 2020, 12:29:13 UTC - in response to Message 42384.  

OTOH, 272136516 has chugged along slowly towards 98% done and should finish some time tonight...
Completed and validated:
Name Theory_2378-1064671-8_0
Sent 30 Apr 2020, 3:42:10 UTC
Report deadline 11 May 2020, 3:42:10 UTC
Received 8 May 2020, 1:58:29 UTC
Outcome Success
Exit status 0 (0x00000000)
Run time 7 days 21 hours 21 min 28 sec
CPU time 7 days 19 hours 15 min 29 sec
Credit 6,326.46

ID: 42389 · Report as offensive     Reply Quote
NOGOOD

Send message
Joined: 18 Nov 17
Posts: 119
Credit: 51,312,439
RAC: 21,567
Message 42410 - Posted: 11 May 2020, 9:55:27 UTC

I've got one more successful beyond the limit of 100 hours:
https://lhcathome.cern.ch/lhcathome/result.php?resultid=272436500

Run time - 5 d 9 h 0 m 14 s.
CPU time - 5 d 8 h 46 m 57 s.
ID: 42410 · Report as offensive     Reply Quote
Mr P Hucker
Avatar

Send message
Joined: 12 Aug 06
Posts: 418
Credit: 5,667,249
RAC: 48
Message 42421 - Posted: 11 May 2020, 21:05:39 UTC - in response to Message 42410.  

I've got one more successful beyond the limit of 100 hours:
https://lhcathome.cern.ch/lhcathome/result.php?resultid=272436500

Run time - 5 d 9 h 0 m 14 s.
CPU time - 5 d 8 h 46 m 57 s.


I've just set up a new machine (Windows 10), and its theories all have a 10 day limit instead of 4 (I have adjusted nothing in the config for it). Have they changed something again? 10 days conveniently matches the deadline. Note - they only change to say 10 days if they don't complete in the first few hours.
ID: 42421 · Report as offensive     Reply Quote
NOGOOD

Send message
Joined: 18 Nov 17
Posts: 119
Credit: 51,312,439
RAC: 21,567
Message 42423 - Posted: 11 May 2020, 21:22:15 UTC - in response to Message 42421.  

I've got one more successful beyond the limit of 100 hours:
https://lhcathome.cern.ch/lhcathome/result.php?resultid=272436500

Run time - 5 d 9 h 0 m 14 s.
CPU time - 5 d 8 h 46 m 57 s.


I've just set up a new machine (Windows 10), and its theories all have a 10 day limit instead of 4 (I have adjusted nothing in the config for it). Have they changed something again? 10 days conveniently matches the deadline. Note - they only change to say 10 days if they don't complete in the first few hours.

I hope I’ll get the successful one beyond the limit of 10 days soon :-)
ID: 42423 · Report as offensive     Reply Quote
Previous · 1 · 2 · 3 · 4 · 5 · 6 · 7 . . . 8 · Next

Message boards : Theory Application : Tasks run 4 days and finish with error


©2024 CERN