Message boards :
Theory Application :
Sherpa tasks run okay for long time, then they fail
Message board moderation
Author | Message |
---|---|
Send message Joined: 18 Dec 15 Posts: 1735 Credit: 114,064,746 RAC: 84,332 |
in the recent past I have experienced a lot of Sherpa tasks which run okay for even up to several days, VM console 2 shows an increasing number of events processed, but at some point the number stays same, although there is CPU activity. All one can do then is to abort the task manually. So many hours and, in most cases, even days of CPU time are being wasted, which is annoying. Anyone any idea why this is happening lately? |
Send message Joined: 31 Jan 11 Posts: 12 Credit: 3,557,813 RAC: 0 |
Hi Erich56 I agree it's frustrating and I don't actually understand what is happening with these runs. In the past, we had argued that, for some generators, we had to accept a small failure rate since we otherwise could not do comparisons to those generators at all. We had then hoped that updating them to the latest versions would gradually fix the issues we were seeing, but this has not really been the case. Having to operate with a non-negligible rate of jobs that fail is not nice, especially when this fraction does not seem to reduce with time. I regret if we have been too slow to react, but at least now for 2022, we have come up with a plan to revitalize T4T. To start with, we are going to stop sending out jobs for the generators that are problematic, at least until we can sit down for a good proper debugging session with the authors of those codes, and fully iron their issues out so that they would be ready and steady for sending back out in T4T again. During 2022, we plan to start by focusing our attention on getting (back) to the virtual equivalent of what the LHC machine people would call 'stable beams' for the most widely used generator, Pythia, setting a new baseline for future T4T operation. At least for that generator, our team has author-level in-house expertise, so we are confident we can do this, if we put in the hours. At the same time, we think this can allow us to try out some new and possibly even more useful tests, which I hope we will be able to also make some announcements of down the track. So despite the issues you and others have been experiencing, I hope you will choose to stick with our project a little longer and see if things improve during 2022. Best regards Peter Skands |
Send message Joined: 14 Jan 10 Posts: 1352 Credit: 9,085,133 RAC: 2,875 |
This sherpa runs from the 100000 events only 5200 and then it keeps hanging with using 99% CPU for the Sherpa-process and no further progress. ===> [runRivet] Wed Jun 22 12:25:35 UTC 2022 [boinc pp jets 13000 150,-,1860 - sherpa 1.4.1 default 100000 260] . . . Event 5200 ( 18m 40s elapsed / 5h 40m 26s left ) -> ETA: Wed Jun 22 18:59 5200 events processed |
Send message Joined: 2 May 07 Posts: 2181 Credit: 172,554,084 RAC: 49,909 |
https://lhcathome.cern.ch/lhcathome/result.php?resultid=359882300 [runRivet] Wed Jul 6 14:17:04 UTC 2022 [boinc pp ttbar 7000 - - sherpa 2.1.1 default 10000 268] Output_Phase::Output_Phase(): Set output interval 1000000000 events. ---------------------------------------------------------- -- SHERPA generates events with the following structure -- ---------------------------------------------------------- Perturbative : Signal_Processes Perturbative : Hard_Decays Perturbative : Jet_Evolution:CSS Perturbative : Lepton_FS_QED_Corrections:Photons Perturbative : Multiple_Interactions:None Perturbative : Minimum_Bias:Off Hadronization : Beam_Remnants Hadronization : Hadronization:Ahadic Hadronization : Hadron_Decays Analysis : HepMC2 Maybe for a Cray or IBM summit ;-) |
Send message Joined: 2 May 07 Posts: 2181 Credit: 172,554,084 RAC: 49,909 |
Maybe for a Cray or IBM summit ;-) Event 900 ( 1d 9h 2m 16s elapsed / 13d 22h 2m 56s left ) -> ETA: Fri Jul 22 01:53 XS = 38.6281 pb +- ( 1.2874 pb = 3.33 % ) Is it possible to change the end time for this task from now 8d to 13d in Boinc from my side? |
Send message Joined: 14 Jan 10 Posts: 1352 Credit: 9,085,133 RAC: 2,875 |
Yes, it is.Maybe for a Cray or IBM summit ;-) Remove the line <job_duration>864000</job_duration> from Theory_2019_10_01.xml file. Suspend all 'Ready to start' tasks and this sherpa task with 'Leave applications in memory' not selected. Wait until the task is saved to disk and resume the tasks. Do not worry that the server will send a resent to someone else and your task will get the too late status. |
Send message Joined: 2 May 07 Posts: 2181 Credit: 172,554,084 RAC: 49,909 |
Thanks Crystal, will watching the next days. Maybe the 9.day is coming, will do your changes. atm the 2.day is on the way. (1.000 from 10.000 events). A change from 864.000 duration time to a later time is not possible, because the Boinc-Server have this information 864.000? No easy correction for such a beast of Task :-). |
Send message Joined: 14 Jan 10 Posts: 1352 Credit: 9,085,133 RAC: 2,875 |
Be aware you have <dont_check_file_sizes>1</dont_check_file_sizes> in the options part of your cc_config.xml |
Send message Joined: 2 May 07 Posts: 2181 Credit: 172,554,084 RAC: 49,909 |
7d 19 h atm and 6200 events processed from 10k. Have done your instructions. In 3 d when the limit of 864k (10d) is reached will watching the success. https://lhcathome.cern.ch/lhcathome/workunit.php?wuid=191707788 Boinc say, task is running. Will the upload being done after success? |
Send message Joined: 14 Jan 10 Posts: 1352 Credit: 9,085,133 RAC: 2,875 |
Will the upload being done after success?The server doesn't know the setting for the job duration of 864000 seconds. This Theory setting is to prevent tasks run for ever, but if you think your task is still progressing and will finish successful it will upload the result file when finished. The deadline on the server is 1 day longer than the deadline sent to BOINC-clients. When your task is not ready before the server's deadline a resend will be send to an other client (max of 3 tasks not yet reached), but your task (if successful) will be ready before the resend's one. |
Send message Joined: 29 Sep 04 Posts: 281 Credit: 11,866,264 RAC: 1 |
Is the number of events still increasing? If there has been no event progress since you last looked, it may have already died, even though Boinc says it is still running. When did it last write to its log? I currently have a sherpa 1.4.1 on -dev with an ETA in the console of 3hrs ago, using all of its allocated core. It run for 4hrs but seems stuck on 22400 events, with nothing written to log in the last 6 hours. It's got until I finish my dinner to make some progress or else it's going to be terminated. |
Send message Joined: 2 May 07 Posts: 2181 Credit: 172,554,084 RAC: 49,909 |
Now 6.600 events from 10k (8d 06h). Had also some Sherpa's in the last days with stopping process (2 day or less). Have killed them. |
Send message Joined: 14 Jan 10 Posts: 1352 Credit: 9,085,133 RAC: 2,875 |
You could maybe speedup the processing by using the half of your 12 threads. |
Send message Joined: 2 May 07 Posts: 2181 Credit: 172,554,084 RAC: 49,909 |
Thanks Crystal, BUT.... HP-Tower PC (Intel) using other processes too, not Boinc and Virtualbox only. Ok, the Turbo would be, when it running on the Threadripper 3995x (my personal Summit ;-)). 7.100 events (8d 17h) atm. |
Send message Joined: 13 Jul 05 Posts: 169 Credit: 14,982,010 RAC: 209 |
7.100 events (8d 17h) atm.But isn't that going to take more like 14 days to complete, rather than 10? |
Send message Joined: 2 May 07 Posts: 2181 Credit: 172,554,084 RAC: 49,909 |
7300 events processed ZAlign::ZAlign(): p_a*p_b = 152713 vs. 305408, rel. diff. -0.49997 ZAlign::ZAlign(): Q = 579.699 vs. 576.594, rel. diff. 0.0053857 Now exact 9 days. 10.000 is the end, 2.700 are in the future. 22 MByte with this ZALign's. Peter Skands and his Team are waiting for the result ;-). btw they have a lot of work. |
Send message Joined: 24 Oct 04 Posts: 1155 Credit: 51,757,973 RAC: 47,865 |
Peter Skands and his Team are waiting for the result ;-). btw they have a lot of work. Peter just emailed me and said you have too many Threadripper cores for one person Ok I am real busy watching golf at St Andrews) |
Send message Joined: 2 May 07 Posts: 2181 Credit: 172,554,084 RAC: 49,909 |
With this jokes you are a champion :-)). |
Send message Joined: 2 May 07 Posts: 2181 Credit: 172,554,084 RAC: 49,909 |
Task deleted by myself, sorry. |
Send message Joined: 2 May 07 Posts: 2181 Credit: 172,554,084 RAC: 49,909 |
Theory_2390-1143052-268 https://lhcathome.cern.ch/lhcathome/workunit.php?wuid=191707788 Is it possible, to get a restart, but on a faster CPU? |
©2024 CERN