Message boards :
Theory Application :
New Version v300.05
Message board moderation
Previous · 1 · 2 · 3 · 4 · Next
Author | Message |
---|---|
Send message Joined: 14 Jan 10 Posts: 1461 Credit: 9,859,193 RAC: 2,531 ![]() ![]() |
here the next one: There is a job limit of 100 hours eqs 360000 seconds. From your result: 2020-01-25 11:32:28 (1004): Status Report: Job Duration: '360000.000000' 2020-01-25 11:32:28 (1004): Status Report: Elapsed Time: '358080.000000' 2020-01-25 11:32:28 (1004): Status Report: CPU Time: '356907.000000' 2020-01-25 12:04:30 (1004): Powering off VM. Elapsed 358080 seconds plus (12:04:30 - 11:32:28) 1922 seconds makes 360002 and kills the job :-( |
Send message Joined: 18 Dec 15 Posts: 1908 Credit: 144,945,290 RAC: 83,099 ![]() ![]() ![]() |
It looks like you get twice the setting in your preferences for Max # CPUs, so I suppose you have set there 1.many thanks, CP, your suggestion worked fine. For some reason, this odd behaviour did not show here before, only since recently. So now it's good to know how to circumvent it :-) |
![]() Send message Joined: 9 Feb 16 Posts: 50 Credit: 543,905 RAC: 194 ![]() ![]() |
Following resume from hibernation over the weekend, this long-runner briefly continued on to something over 57,000 events, and then it reset itself and started again from zero: https://lhcathome.cern.ch/lhcathome/result.php?resultid=259725584 It's possible Theory tasks don't survive hibernation over a weekend. However, I also caught it last week throwing errors/warnings: PDFVeto warning: Ratio > GtobbbarSudakov:PDFmax (by a factor of 1.10978) for g to b PDFVeto warning: Ratio > GtobbbarSudakov:PDFmax (by a factor of 1.34534) for g to b PDFVeto warning: Ratio > GtobbbarSudakov:PDFmax (by a factor of 6.61948) for g to b The decay Xi(1690)- -> Sigma- KbarO 2.10871 500 is too inefficient for the particle 816 Xi(1690) - 13312 [601] 0.935 2.078 25 .560 25.718 5 «bs 9 vetoing the decay PDFVeto warning: Ratio > GtobbbarSudakovu:PDFmax (by a factor of 1.05218) for g to bbar PDFVeto warning: Ratio > GtobbbarSudakov:PDFmax (by a factor of 5.63208) for g to bbar PDFVeto warning: Ratio > GtobbbarSudakovu:PDFmax (by a factor of 3.54622) for g to bbar PDFVeto warning: Ratio > GtobbbarSudakov:PDFmax (by a factor of 1.06896) for g to b PDFVeto warning: Ratio > GtobbbarSudakov:PDFmax (by a factor of 1.98784) for g to bbar PDFVeto warning: Ratio > GtobbbarSudakovu:PDFmax (by a factor of 1.04204) for g to bbar PDFVeto warning: Ratio > GtobbbarSudakov:PDFmax (by a factor of 1.83883) for g to b PDFVeto warning: Ratio > GtobbbarSudakov:PDFmax (by a factor of 1.85025) for g to bbar PDFVeto warning: Ratio > GtobbbarSudakov:PDFmax (by a factor of 12.6764) for g to b PDFVeto warning: Ratio > GtobbbarSudakov:PDFmax (by a factor of 3.83015) for g to b PDFVeto warning: Ratio > GtobbbarSudakov:PDFmax (by a factor of 1.55048) for g to bbar PDFVeto warning: Ratio > GtobbbarSudakov:PDFmax (by a factor of 2.53167) for g to b PDFVeto warning: Ratio > GtobbbarSudakov:PDFmax (by a factor of 1.04879) for g to b PDFVeto warning: Ratio > GtobbbarSudakov:PDFmax (by a factor of 2.41224) for g to b PDFVeto warning: Ratio > GtobbbarSudakovu:PDFmax (by a factor of 1.92092) for g to bbar PDFVeto warning: Ratio > GtobbbarSudakov:PDFmax (by a factor of 1.52194) for g to b PDFVeto warning: Ratio > GtobbbarSudakov:PDFmax (by a factor of 1.07241) for g to b PDFVeto warning: Ratio > GtobbbarSudakovu:PDFmax (by a factor of 4.16827) for g to bbar PDFVeto warning: Ratio > GtobbbarSudakov:PDFmax (by a factor of 1.85123) for g to b PDFVeto warning: Ratio > GtobbbarSudakov:PDFmax (by a factor of 1.09399) for g to bbar PDFVeto warning: Ratio > GtobbbarSudakov:PDFmax (by a factor of 2.30685) for g to bbar PDFVeto warning: Ratio > GtobbbarSudakov:PDFmax (by a factor of 1.31057) for g to bbar PDFVeto warning: Ratio > GtobbbarSudakov:PDFmax (by a factor of 1.38341) for g to b a An event exception of type ThePEG: :Exception occurred while generating event number 28880: Remnant extraction failed in ShowerHandler::cascadeQ) from primary interaction The event will be discarded. 28900 events processed 29000 events processed dumping histograms... |
Send message Joined: 18 Dec 15 Posts: 1908 Credit: 144,945,290 RAC: 83,099 ![]() ![]() ![]() |
It's possible Theory tasks don't survive hibernation over a weekend.most probably so. ATLAS tasks are even more susceptible to lengthy interruptions. In general, experience has shown that VM tasks should not be stopped for too long time. |
Send message Joined: 14 Jan 10 Posts: 1461 Credit: 9,859,193 RAC: 2,531 ![]() ![]() |
It's possible Theory tasks don't survive hibernation over a weekend.Surviving a longer pause (even a weekend) is possible. When you stop BOINC / shutdown computer, you have to suspend the VBox-tasks first. In your local preferences the setting for Leave non-GPU tasks in memory while suspended (LAIM) should not be ticked. After suspending a VBox-task, the state of the virtual machine will be saved to disk. You may watch the VM-states in Oracle VM VirtualBox Manager. When you have several VM's running, don't suspend them all at once, but one by one. |
![]() Send message Joined: 28 Sep 04 Posts: 780 Credit: 59,994,422 RAC: 47,343 ![]() ![]() ![]() |
In general I agree with you, but sometimes paused Theory tasks survive. Here is an example of a task that was continued after 18 hour pause: https://lhcathome.cern.ch/lhcathome/result.php?resultid=259747616 I don't know if it ended prematurely though. It ran about three hours after it continued and finished OK. ![]() |
Send message Joined: 14 Jan 10 Posts: 1461 Credit: 9,859,193 RAC: 2,531 ![]() ![]() |
@Harri: Your task paused but it was kept in memory during the pause. |
![]() Send message Joined: 28 Sep 04 Posts: 780 Credit: 59,994,422 RAC: 47,343 ![]() ![]() ![]() |
@Harri: Your task paused but it was kept in memory during the pause. OK. That is good to know. On previous versions I remember that the tasks used to fail on pause because of the communication to the server was lost and server could not cope with that. ![]() |
![]() Send message Joined: 9 Feb 16 Posts: 50 Credit: 543,905 RAC: 194 ![]() ![]() |
I've now confirmed Theory doesn't survive an overnight hibernation either, even with Leave non-GPU tasks in memory while suspended not selected (I've never had this selected). So that explains the tasks that never complete, but then get completed in a fraction of the time by another host. A task that doesn't complete by the time the host is put into hibernation will restart the following morning, and if it can complete by the end of the working day it should do so. But if it can't complete by the end of the working day, it will just run and run, never completing. I've not yet tried suspending VM tasks before hibernating, but I've never had to do that with Theory tasks in the past. |
Send message Joined: 18 Dec 15 Posts: 1908 Credit: 144,945,290 RAC: 83,099 ![]() ![]() ![]() |
here the next one: here the next one with "file x-fer error" - again after 4 days and 4 hours, exactly like the above cited task. What can be done in order to avoid such a waste ??? By now, I am pretty much annoyed that always and again such faulty tasks come up :-( |
Send message Joined: 14 Jan 10 Posts: 1461 Credit: 9,859,193 RAC: 2,531 ![]() ![]() |
here the next one with "file x-fer error" - again after 4 days and 4 hours, exactly like the above cited task.The duration limit is to avoid a faulty task to run endless. In the past the task-killing was done gracefully and you got credit for it, although the job inside the VM did not finish. Since there is no sequence of jobs running in the VM, but only one job, the former task duration is extended from 18 hours to 100 hours to give long runners a chance to finish. 100 hours is obviously not enough for some jobs, mostly a sherpa. When you want to let run a job longer (unlimited), because you believe it will finish someday you have to adjust two files: In the options part of cc_config.xml have a line <dont_check_file_sizes>1</dont_check_file_sizes> and delete the line <job_duration>360000</job_duration> from Theory_2019_11_13a.xml. Disadvantage: You have to abort a seemingly faulty task yourself. |
Send message Joined: 18 Dec 15 Posts: 1908 Credit: 144,945,290 RAC: 83,099 ![]() ![]() ![]() |
Thanks you, C.P., for your explanations. One question I have left: do I have a chance to recognize such a faulty task early? So that I can abort it in time, and wouldn't have to wait that long a time? Unfortunately, I negcted to look into the VM console - would I have seen signs there that the task was not running okay? |
Send message Joined: 14 Jan 10 Posts: 1461 Credit: 9,859,193 RAC: 2,531 ![]() ![]() |
One question I have left: do I have a chance to recognize such a faulty task early? So that I can abort it in time, and wouldn't have to wait that long a time?It's more the opposite. In the consoles (more specific ALT-F2) you can see whether a job is running fine; still making progress and with Sherpa maybe an ETA. Alt-F3 for cpu-usage. Alt-F1 for max number of events - mostly 100000 for sherpa sometimes lower. When a sherpa is in the final event processing part after optimizing and integrations, it's likely that the job will finish someday, although I have seen jobs suddenly coming in a loop without new event processing messages. You could also use localhost:portnumber in a webbrowser for displaying the whole running log what is shown in F2-console. Accessable via BOINC Manager -> Show graphics on a running task. |
Send message Joined: 18 Dec 15 Posts: 1908 Credit: 144,945,290 RAC: 83,099 ![]() ![]() ![]() |
thank you, C.P., for your thorough explanations. On one of my PCs, I unfortunately had another such case, where the task failed after 4 hours 4 minutes: https://lhcathome.cern.ch/lhcathome/result.php?resultid=260540145 Shortly before, I looked at the console and saw that everything was running well (more than 80.000 events processed as seen on console 2), but suddenly I got a phonecall and had subsequently to leave for about 2 hours. When I came back, I noticed that the task was errored out :-( Really annoying after a processing time of more than 4 days. What I am wondering is why such longrunners are created at all, if on the other hand it's clear that a task stops after 4 hours 4 minutes. |
Send message Joined: 14 Jan 10 Posts: 1461 Credit: 9,859,193 RAC: 2,531 ![]() ![]() |
On one of my PCs, I unfortunately had another such case, where the task failed after 4 hours 4 minutes:Yeah, richtig Schade. (4 days 4 hours you meant) Interesting that it was a pythia8 ===> [runRivet] Mon Jan 27 07:16:01 UTC 2020 [boinc pp jets 8000 25 - pythia8 8.235 early 100000 18] But you now know how to let them run unlimited. What I am wondering is why such longrunners are created at all, if on the other hand it's clear that a task stops after 4 hours 4 minutes.It´s not clear in advance how long a job will run. |
Send message Joined: 18 Dec 15 Posts: 1908 Credit: 144,945,290 RAC: 83,099 ![]() ![]() ![]() |
But you now know how to let them run unlimited. ...I'll definitely make the changes/adaptions you suggested ASAP - thanks again for that! |
Send message Joined: 18 Dec 15 Posts: 1908 Credit: 144,945,290 RAC: 83,099 ![]() ![]() ![]() |
Crystal Pellet wrote: It's more the opposite. In the consoles (more specific ALT-F2) you can see whether a job is running fine; still making progress and with Sherpa maybe an ETA. Alt-F3 for cpu-usage. Alt-F1 for max number of events - mostly 100000 for sherpa sometimes lower.Among others, there are currently two Sherpa longrunners on one my machines: one has run for 7 days 6 hours, the other for 4 days an 13 hours. In the console, both are showing the same: F1 shows 9 lines with some "cranky" info F2 shows the screen with the "Comics" sign: ----------------------------------+ | | | CCC OOO M M I X X | | C O O MM MM I X X | | C O O M M M I X | | C O O M M I X X | | CCC OOO M M I X X | | | +==================================+ | Color dressed Matrix Elements | | http://comix.freacafe.de | | please cite JHEP12(2008)039 | +----------------------------------+ Matrix_Element_Handler::BuildProcesses(): Looking for processes .................................................................................................................................................................................... done ( 47 MB, 31s / 31s ). Matrix_Element_Handler::InitializeProcesses(): Performing tests .................................................................................................................................................................................... done ( 47 MB, 0s / 0s ). Initialized the Matrix_Element_Handler for the hard processes. Initialized the Beam_Remnant_Handler. Hadron_Decay_Map::Read: Initializing HadronDecays.dat. This may take some time. Initialized the Hadron_Decay_Handler, Decay model = Hadrons Initialized the Soft_Photon_Handler. Variations::InitialiseParametersVector(0 variations){ Named variations: } Process_Group::CalculateTotalXSec(): Calculate xs for '2_2__j__j__e-__veb' (Comix) Starting the calculation at 08:54:50. Lean back and enjoy ... . and F3 shows information about CPU usage and memory usage - CPU being used by 99% (which also the Windows Task Manager shows). On none of the three screens I get the information as to how many events have been processed (as F1 would normally show, at least with Pythia), and F1 does NOT show the max. number of events. My question now is: am I in an endless loop and should therefor cancel the tasks? |
![]() Send message Joined: 15 Jun 08 Posts: 2683 Credit: 286,886,316 RAC: 55,049 ![]() ![]() |
Windows Taskmanager is not a good helper in this case as it doesn't show which process inside the VM is using the CPU. Hence the top output on console ALT-F3 which shows exactly that. ALT-F1 should show the first line of the tasks running.log which tells you what subtask you got and how many events it will calculate. The same typical output can also be found in the stderr.txt in your slots folder. Example: ... cranky: [INFO] ===> [runRivet] Wed Feb 5 12:14:36 UTC 2020 [boinc pp zinclusive 7000 -,-,50,130 - madgraph5amc 2.4.3.atlas lo2jet 100000 22] ALT-F2 shows the last lines of the running.log from inside the VM. Unfortunately scrolling is not possible. Most scientific apps, e.g. pythia, show the event progress there. Sherpa prints much more output during different calculation phases. Hence it's output is tricky to interpret. To view the complete running.log you may select the task in BOINC manager and click on "show graphics" which opens a browser window. This browser window enables you to navigate through the different logfiles from inside the VM. |
Send message Joined: 18 Dec 15 Posts: 1908 Credit: 144,945,290 RAC: 83,099 ![]() ![]() ![]() |
Sherpa tasks seem to work very badly on my systems. Right now, among other strange ones, there is one which on console F2 says "Poincare::Poincare(): inaccurate rotation" - it's run for about 17 hours now. I guess this task is faulty and I should abort it, correct? |
![]() Send message Joined: 15 Jun 08 Posts: 2683 Credit: 286,886,316 RAC: 55,049 ![]() ![]() |
The decision is up to you. 17 h is far away from the 100 h limit, so you may give it 1-2 more days and check the log from time to time to see whether the task recovers. Nobody can guarantee that it will succeed but you will get more familiar with sherpa's output and this will also be a success. |
©2025 CERN