Message boards : Theory Application : New Version v300.05
Message board moderation

To post messages, you must log in.

Previous · 1 · 2 · 3 · 4 · Next

AuthorMessage
Crystal Pellet
Volunteer moderator
Volunteer tester

Send message
Joined: 14 Jan 10
Posts: 1268
Credit: 8,433,416
RAC: 3,056
Message 41360 - Posted: 25 Jan 2020, 13:58:40 UTC - in response to Message 41356.  

here the next one:

https://lhcathome.cern.ch/lhcathome/result.php?resultid=259667718

after 4 days 4 hours !!! Quite some waste of CPU time :-(

What the hell is this caused by?

There is a job limit of 100 hours eqs 360000 seconds.
From your result:
2020-01-25 11:32:28 (1004): Status Report: Job Duration: '360000.000000'
2020-01-25 11:32:28 (1004): Status Report: Elapsed Time: '358080.000000'
2020-01-25 11:32:28 (1004): Status Report: CPU Time: '356907.000000'
2020-01-25 12:04:30 (1004): Powering off VM.

Elapsed 358080 seconds plus (12:04:30 - 11:32:28) 1922 seconds makes 360002 and kills the job :-(
ID: 41360 · Report as offensive     Reply Quote
Erich56

Send message
Joined: 18 Dec 15
Posts: 1686
Credit: 100,486,499
RAC: 104,424
Message 41362 - Posted: 25 Jan 2020, 15:18:28 UTC - in response to Message 41359.  
Last modified: 25 Jan 2020, 15:18:50 UTC

It looks like you get twice the setting in your preferences for Max # CPUs, so I suppose you have set there 1.
It's the same old and odd behaviour for that preference setting. Try 'No limit' when you are only running VBox Theory and no ATLAS. You will get 16 tasks then.
many thanks, CP, your suggestion worked fine.
For some reason, this odd behaviour did not show here before, only since recently. So now it's good to know how to circumvent it :-)
ID: 41362 · Report as offensive     Reply Quote
Brummig
Avatar

Send message
Joined: 9 Feb 16
Posts: 48
Credit: 537,111
RAC: 0
Message 41370 - Posted: 27 Jan 2020, 9:24:00 UTC - in response to Message 41313.  

Following resume from hibernation over the weekend, this long-runner briefly continued on to something over 57,000 events, and then it reset itself and started again from zero:

https://lhcathome.cern.ch/lhcathome/result.php?resultid=259725584

It's possible Theory tasks don't survive hibernation over a weekend. However, I also caught it last week throwing errors/warnings:

PDFVeto warning: Ratio > GtobbbarSudakov:PDFmax (by a factor of 1.10978) for g to b
PDFVeto warning: Ratio > GtobbbarSudakov:PDFmax (by a factor of 1.34534) for g to b
PDFVeto warning: Ratio > GtobbbarSudakov:PDFmax (by a factor of 6.61948) for g to b
The decay Xi(1690)- -> Sigma- KbarO 2.10871 500 is too inefficient for the particle 816 Xi(1690)
- 13312 [601]

0.935 2.078 25 .560 25.718 5 «bs 9
vetoing the decay
PDFVeto warning: Ratio > GtobbbarSudakovu:PDFmax (by a factor of 1.05218) for g to bbar
PDFVeto warning: Ratio > GtobbbarSudakov:PDFmax (by a factor of 5.63208) for g to bbar
PDFVeto warning: Ratio > GtobbbarSudakovu:PDFmax (by a factor of 3.54622) for g to bbar
PDFVeto warning: Ratio > GtobbbarSudakov:PDFmax (by a factor of 1.06896) for g to b
PDFVeto warning: Ratio > GtobbbarSudakov:PDFmax (by a factor of 1.98784) for g to bbar
PDFVeto warning: Ratio > GtobbbarSudakovu:PDFmax (by a factor of 1.04204) for g to bbar
PDFVeto warning: Ratio > GtobbbarSudakov:PDFmax (by a factor of 1.83883) for g to b
PDFVeto warning: Ratio > GtobbbarSudakov:PDFmax (by a factor of 1.85025) for g to bbar
PDFVeto warning: Ratio > GtobbbarSudakov:PDFmax (by a factor of 12.6764) for g to b
PDFVeto warning: Ratio > GtobbbarSudakov:PDFmax (by a factor of 3.83015) for g to b
PDFVeto warning: Ratio > GtobbbarSudakov:PDFmax (by a factor of 1.55048) for g to bbar
PDFVeto warning: Ratio > GtobbbarSudakov:PDFmax (by a factor of 2.53167) for g to b
PDFVeto warning: Ratio > GtobbbarSudakov:PDFmax (by a factor of 1.04879) for g to b
PDFVeto warning: Ratio > GtobbbarSudakov:PDFmax (by a factor of 2.41224) for g to b
PDFVeto warning: Ratio > GtobbbarSudakovu:PDFmax (by a factor of 1.92092) for g to bbar
PDFVeto warning: Ratio > GtobbbarSudakov:PDFmax (by a factor of 1.52194) for g to b
PDFVeto warning: Ratio > GtobbbarSudakov:PDFmax (by a factor of 1.07241) for g to b
PDFVeto warning: Ratio > GtobbbarSudakovu:PDFmax (by a factor of 4.16827) for g to bbar
PDFVeto warning: Ratio > GtobbbarSudakov:PDFmax (by a factor of 1.85123) for g to b
PDFVeto warning: Ratio > GtobbbarSudakov:PDFmax (by a factor of 1.09399) for g to bbar
PDFVeto warning: Ratio > GtobbbarSudakov:PDFmax (by a factor of 2.30685) for g to bbar
PDFVeto warning: Ratio > GtobbbarSudakov:PDFmax (by a factor of 1.31057) for g to bbar
PDFVeto warning: Ratio > GtobbbarSudakov:PDFmax (by a factor of 1.38341) for g to b
a An event exception of type ThePEG: :Exception occurred while generating event number 28880:
Remnant extraction failed in ShowerHandler::cascadeQ) from primary interaction
The event will be discarded.
28900 events processed
29000 events processed
dumping histograms...
ID: 41370 · Report as offensive     Reply Quote
Erich56

Send message
Joined: 18 Dec 15
Posts: 1686
Credit: 100,486,499
RAC: 104,424
Message 41371 - Posted: 27 Jan 2020, 10:18:14 UTC - in response to Message 41370.  

It's possible Theory tasks don't survive hibernation over a weekend.
most probably so. ATLAS tasks are even more susceptible to lengthy interruptions.
In general, experience has shown that VM tasks should not be stopped for too long time.
ID: 41371 · Report as offensive     Reply Quote
Crystal Pellet
Volunteer moderator
Volunteer tester

Send message
Joined: 14 Jan 10
Posts: 1268
Credit: 8,433,416
RAC: 3,056
Message 41376 - Posted: 27 Jan 2020, 12:47:30 UTC - in response to Message 41370.  

It's possible Theory tasks don't survive hibernation over a weekend.
Surviving a longer pause (even a weekend) is possible.

When you stop BOINC / shutdown computer, you have to suspend the VBox-tasks first.

In your local preferences the setting for Leave non-GPU tasks in memory while suspended (LAIM) should not be ticked.
After suspending a VBox-task, the state of the virtual machine will be saved to disk. You may watch the VM-states in Oracle VM VirtualBox Manager.
When you have several VM's running, don't suspend them all at once, but one by one.
ID: 41376 · Report as offensive     Reply Quote
Harri Liljeroos
Avatar

Send message
Joined: 28 Sep 04
Posts: 674
Credit: 43,168,451
RAC: 16,096
Message 41377 - Posted: 27 Jan 2020, 12:47:47 UTC - in response to Message 41371.  

In general I agree with you, but sometimes paused Theory tasks survive. Here is an example of a task that was continued after 18 hour pause: https://lhcathome.cern.ch/lhcathome/result.php?resultid=259747616
I don't know if it ended prematurely though. It ran about three hours after it continued and finished OK.
ID: 41377 · Report as offensive     Reply Quote
Crystal Pellet
Volunteer moderator
Volunteer tester

Send message
Joined: 14 Jan 10
Posts: 1268
Credit: 8,433,416
RAC: 3,056
Message 41378 - Posted: 27 Jan 2020, 13:37:39 UTC - in response to Message 41377.  

@Harri: Your task paused but it was kept in memory during the pause.
ID: 41378 · Report as offensive     Reply Quote
Harri Liljeroos
Avatar

Send message
Joined: 28 Sep 04
Posts: 674
Credit: 43,168,451
RAC: 16,096
Message 41380 - Posted: 27 Jan 2020, 13:53:17 UTC - in response to Message 41378.  

@Harri: Your task paused but it was kept in memory during the pause.

OK. That is good to know. On previous versions I remember that the tasks used to fail on pause because of the communication to the server was lost and server could not cope with that.
ID: 41380 · Report as offensive     Reply Quote
Brummig
Avatar

Send message
Joined: 9 Feb 16
Posts: 48
Credit: 537,111
RAC: 0
Message 41412 - Posted: 28 Jan 2020, 9:18:07 UTC - in response to Message 41376.  

I've now confirmed Theory doesn't survive an overnight hibernation either, even with Leave non-GPU tasks in memory while suspended not selected (I've never had this selected). So that explains the tasks that never complete, but then get completed in a fraction of the time by another host. A task that doesn't complete by the time the host is put into hibernation will restart the following morning, and if it can complete by the end of the working day it should do so. But if it can't complete by the end of the working day, it will just run and run, never completing. I've not yet tried suspending VM tasks before hibernating, but I've never had to do that with Theory tasks in the past.
ID: 41412 · Report as offensive     Reply Quote
Erich56

Send message
Joined: 18 Dec 15
Posts: 1686
Credit: 100,486,499
RAC: 104,424
Message 41428 - Posted: 30 Jan 2020, 11:51:42 UTC - in response to Message 41360.  

here the next one:

https://lhcathome.cern.ch/lhcathome/result.php?resultid=259667718

after 4 days 4 hours !!! Quite some waste of CPU time :-(

What the hell is this caused by?

There is a job limit of 100 hours eqs 360000 seconds.
From your result:
2020-01-25 11:32:28 (1004): Status Report: Job Duration: '360000.000000'
2020-01-25 11:32:28 (1004): Status Report: Elapsed Time: '358080.000000'
2020-01-25 11:32:28 (1004): Status Report: CPU Time: '356907.000000'
2020-01-25 12:04:30 (1004): Powering off VM.

Elapsed 358080 seconds plus (12:04:30 - 11:32:28) 1922 seconds makes 360002 and kills the job :-(

here the next one with "file x-fer error" - again after 4 days and 4 hours, exactly like the above cited task.

What can be done in order to avoid such a waste ??? By now, I am pretty much annoyed that always and again such faulty tasks come up :-(
ID: 41428 · Report as offensive     Reply Quote
Crystal Pellet
Volunteer moderator
Volunteer tester

Send message
Joined: 14 Jan 10
Posts: 1268
Credit: 8,433,416
RAC: 3,056
Message 41429 - Posted: 30 Jan 2020, 12:29:19 UTC - in response to Message 41428.  

here the next one with "file x-fer error" - again after 4 days and 4 hours, exactly like the above cited task.

What can be done in order to avoid such a waste ??? By now, I am pretty much annoyed that always and again such faulty tasks come up :-(
The duration limit is to avoid a faulty task to run endless.
In the past the task-killing was done gracefully and you got credit for it, although the job inside the VM did not finish.
Since there is no sequence of jobs running in the VM, but only one job, the former task duration is extended from 18 hours to 100 hours to give long runners a chance to finish.
100 hours is obviously not enough for some jobs, mostly a sherpa.

When you want to let run a job longer (unlimited), because you believe it will finish someday you have to adjust two files:
In the options part of cc_config.xml have a line <dont_check_file_sizes>1</dont_check_file_sizes> and
delete the line <job_duration>360000</job_duration> from Theory_2019_11_13a.xml.

Disadvantage: You have to abort a seemingly faulty task yourself.
ID: 41429 · Report as offensive     Reply Quote
Erich56

Send message
Joined: 18 Dec 15
Posts: 1686
Credit: 100,486,499
RAC: 104,424
Message 41430 - Posted: 30 Jan 2020, 13:02:42 UTC - in response to Message 41429.  

Thanks you, C.P., for your explanations.

One question I have left: do I have a chance to recognize such a faulty task early? So that I can abort it in time, and wouldn't have to wait that long a time?
Unfortunately, I negcted to look into the VM console - would I have seen signs there that the task was not running okay?
ID: 41430 · Report as offensive     Reply Quote
Crystal Pellet
Volunteer moderator
Volunteer tester

Send message
Joined: 14 Jan 10
Posts: 1268
Credit: 8,433,416
RAC: 3,056
Message 41431 - Posted: 30 Jan 2020, 15:12:36 UTC - in response to Message 41430.  

One question I have left: do I have a chance to recognize such a faulty task early? So that I can abort it in time, and wouldn't have to wait that long a time?
Unfortunately, I negcted to look into the VM console - would I have seen signs there that the task was not running okay?
It's more the opposite. In the consoles (more specific ALT-F2) you can see whether a job is running fine; still making progress and with Sherpa maybe an ETA. Alt-F3 for cpu-usage. Alt-F1 for max number of events - mostly 100000 for sherpa sometimes lower.
When a sherpa is in the final event processing part after optimizing and integrations, it's likely that the job will finish someday, although I have seen jobs suddenly coming in a loop without new event processing messages.
You could also use localhost:portnumber in a webbrowser for displaying the whole running log what is shown in F2-console. Accessable via BOINC Manager -> Show graphics on a running task.
ID: 41431 · Report as offensive     Reply Quote
Erich56

Send message
Joined: 18 Dec 15
Posts: 1686
Credit: 100,486,499
RAC: 104,424
Message 41436 - Posted: 31 Jan 2020, 12:30:45 UTC - in response to Message 41431.  

thank you, C.P., for your thorough explanations.

On one of my PCs, I unfortunately had another such case, where the task failed after 4 hours 4 minutes:
https://lhcathome.cern.ch/lhcathome/result.php?resultid=260540145

Shortly before, I looked at the console and saw that everything was running well (more than 80.000 events processed as seen on console 2), but suddenly I got a phonecall and had subsequently to leave for about 2 hours. When I came back, I noticed that the task was errored out :-( Really annoying after a processing time of more than 4 days.

What I am wondering is why such longrunners are created at all, if on the other hand it's clear that a task stops after 4 hours 4 minutes.
ID: 41436 · Report as offensive     Reply Quote
Crystal Pellet
Volunteer moderator
Volunteer tester

Send message
Joined: 14 Jan 10
Posts: 1268
Credit: 8,433,416
RAC: 3,056
Message 41437 - Posted: 31 Jan 2020, 17:04:49 UTC - in response to Message 41436.  

On one of my PCs, I unfortunately had another such case, where the task failed after 4 hours 4 minutes:
https://lhcathome.cern.ch/lhcathome/result.php?resultid=260540145
Yeah, richtig Schade. (4 days 4 hours you meant)
Interesting that it was a pythia8 ===> [runRivet] Mon Jan 27 07:16:01 UTC 2020 [boinc pp jets 8000 25 - pythia8 8.235 early 100000 18]
But you now know how to let them run unlimited.

What I am wondering is why such longrunners are created at all, if on the other hand it's clear that a task stops after 4 hours 4 minutes.
It´s not clear in advance how long a job will run.
ID: 41437 · Report as offensive     Reply Quote
Erich56

Send message
Joined: 18 Dec 15
Posts: 1686
Credit: 100,486,499
RAC: 104,424
Message 41438 - Posted: 31 Jan 2020, 18:26:12 UTC - in response to Message 41437.  

But you now know how to let them run unlimited. ...
It´s not clear in advance how long a job will run.
I'll definitely make the changes/adaptions you suggested ASAP - thanks again for that!
ID: 41438 · Report as offensive     Reply Quote
Erich56

Send message
Joined: 18 Dec 15
Posts: 1686
Credit: 100,486,499
RAC: 104,424
Message 41477 - Posted: 5 Feb 2020, 15:03:24 UTC - in response to Message 41431.  

Crystal Pellet wrote:
It's more the opposite. In the consoles (more specific ALT-F2) you can see whether a job is running fine; still making progress and with Sherpa maybe an ETA. Alt-F3 for cpu-usage. Alt-F1 for max number of events - mostly 100000 for sherpa sometimes lower.
When a sherpa is in the final event processing part after optimizing and integrations, it's likely that the job will finish someday, although I have seen jobs suddenly coming in a loop without new event processing messages.
You could also use localhost:portnumber in a webbrowser for displaying the whole running log what is shown in F2-console. Accessable via BOINC Manager -> Show graphics on a running task.
Among others, there are currently two Sherpa longrunners on one my machines: one has run for 7 days 6 hours, the other for 4 days an 13 hours.

In the console, both are showing the same:

F1 shows 9 lines with some "cranky" info

F2 shows the screen with the "Comics" sign:
----------------------------------+
| |
| CCC OOO M M I X X |
| C O O MM MM I X X |
| C O O M M M I X |
| C O O M M I X X |
| CCC OOO M M I X X |
| |
+==================================+
| Color dressed Matrix Elements |
| http://comix.freacafe.de |
| please cite JHEP12(2008)039 |
+----------------------------------+
Matrix_Element_Handler::BuildProcesses(): Looking for processes .................................................................................................................................................................................... done ( 47 MB, 31s / 31s ).
Matrix_Element_Handler::InitializeProcesses(): Performing tests .................................................................................................................................................................................... done ( 47 MB, 0s / 0s ).
Initialized the Matrix_Element_Handler for the hard processes.
Initialized the Beam_Remnant_Handler.
Hadron_Decay_Map::Read: Initializing HadronDecays.dat. This may take some time.
Initialized the Hadron_Decay_Handler, Decay model = Hadrons
Initialized the Soft_Photon_Handler.
Variations::InitialiseParametersVector(0 variations){
Named variations:
}
Process_Group::CalculateTotalXSec(): Calculate xs for '2_2__j__j__e-__veb' (Comix)
Starting the calculation at 08:54:50. Lean back and enjoy ... .


and F3 shows information about CPU usage and memory usage - CPU being used by 99% (which also the Windows Task Manager shows).

On none of the three screens I get the information as to how many events have been processed (as F1 would normally show, at least with Pythia), and F1 does NOT show the max. number of events.

My question now is: am I in an endless loop and should therefor cancel the tasks?
ID: 41477 · Report as offensive     Reply Quote
computezrmle
Volunteer moderator
Volunteer developer
Volunteer tester
Help desk expert
Avatar

Send message
Joined: 15 Jun 08
Posts: 2386
Credit: 223,044,027
RAC: 136,850
Message 41478 - Posted: 5 Feb 2020, 15:28:03 UTC - in response to Message 41477.  

Windows Taskmanager is not a good helper in this case as it doesn't show which process inside the VM is using the CPU.
Hence the top output on console ALT-F3 which shows exactly that.

ALT-F1 should show the first line of the tasks running.log which tells you what subtask you got and how many events it will calculate.
The same typical output can also be found in the stderr.txt in your slots folder.
Example:
... cranky: [INFO] ===> [runRivet] Wed Feb  5 12:14:36 UTC 2020 [boinc pp zinclusive 7000 -,-,50,130 - madgraph5amc 2.4.3.atlas lo2jet 100000 22]


ALT-F2 shows the last lines of the running.log from inside the VM.
Unfortunately scrolling is not possible.
Most scientific apps, e.g. pythia, show the event progress there.
Sherpa prints much more output during different calculation phases.
Hence it's output is tricky to interpret.
To view the complete running.log you may select the task in BOINC manager and click on "show graphics" which opens a browser window.
This browser window enables you to navigate through the different logfiles from inside the VM.
ID: 41478 · Report as offensive     Reply Quote
Erich56

Send message
Joined: 18 Dec 15
Posts: 1686
Credit: 100,486,499
RAC: 104,424
Message 41489 - Posted: 7 Feb 2020, 7:05:35 UTC

Sherpa tasks seem to work very badly on my systems.
Right now, among other strange ones, there is one which on console F2 says "Poincare::Poincare(): inaccurate rotation" - it's run for about 17 hours now.
I guess this task is faulty and I should abort it, correct?
ID: 41489 · Report as offensive     Reply Quote
computezrmle
Volunteer moderator
Volunteer developer
Volunteer tester
Help desk expert
Avatar

Send message
Joined: 15 Jun 08
Posts: 2386
Credit: 223,044,027
RAC: 136,850
Message 41490 - Posted: 7 Feb 2020, 7:22:38 UTC - in response to Message 41489.  

The decision is up to you.
17 h is far away from the 100 h limit, so you may give it 1-2 more days and check the log from time to time to see whether the task recovers.
Nobody can guarantee that it will succeed but you will get more familiar with sherpa's output and this will also be a success.
ID: 41490 · Report as offensive     Reply Quote
Previous · 1 · 2 · 3 · 4 · Next

Message boards : Theory Application : New Version v300.05


©2024 CERN