Thread 'Sherpa tasks run okay for long time, then they fail'

Author	Message
Erich56 Send message Joined: 18 Dec 15 Posts: 1984 Credit: 162,084,071 RAC: 86,405	Message 45887 - Posted: 17 Dec 2021, 16:27:01 UTC in the recent past I have experienced a lot of Sherpa tasks which run okay for even up to several days, VM console 2 shows an increasing number of events processed, but at some point the number stays same, although there is CPU activity. All one can do then is to abort the task manually. So many hours and, in most cases, even days of CPU time are being wasted, which is annoying. Anyone any idea why this is happening lately? ID: 45887 · Reply Quote

Peter Skands Send message Joined: 31 Jan 11 Posts: 12 Credit: 3,557,813 RAC: 0	Message 46189 - Posted: 8 Feb 2022, 9:26:54 UTC - in response to Message 45887. Hi Erich56 I agree it's frustrating and I don't actually understand what is happening with these runs. In the past, we had argued that, for some generators, we had to accept a small failure rate since we otherwise could not do comparisons to those generators at all. We had then hoped that updating them to the latest versions would gradually fix the issues we were seeing, but this has not really been the case. Having to operate with a non-negligible rate of jobs that fail is not nice, especially when this fraction does not seem to reduce with time. I regret if we have been too slow to react, but at least now for 2022, we have come up with a plan to revitalize T4T. To start with, we are going to stop sending out jobs for the generators that are problematic, at least until we can sit down for a good proper debugging session with the authors of those codes, and fully iron their issues out so that they would be ready and steady for sending back out in T4T again. During 2022, we plan to start by focusing our attention on getting (back) to the virtual equivalent of what the LHC machine people would call 'stable beams' for the most widely used generator, Pythia, setting a new baseline for future T4T operation. At least for that generator, our team has author-level in-house expertise, so we are confident we can do this, if we put in the hours. At the same time, we think this can allow us to try out some new and possibly even more useful tests, which I hope we will be able to also make some announcements of down the track. So despite the issues you and others have been experiencing, I hope you will choose to stick with our project a little longer and see if things improve during 2022. Best regards Peter Skands ID: 46189 · Reply Quote

Crystal Pellet Volunteer moderator Volunteer tester Send message Joined: 14 Jan 10 Posts: 1556 Credit: 10,100,748 RAC: 1,717	Message 46920 - Posted: 22 Jun 2022, 13:49:59 UTC This sherpa runs from the 100000 events only 5200 and then it keeps hanging with using 99% CPU for the Sherpa-process and no further progress. ===> [runRivet] Wed Jun 22 12:25:35 UTC 2022 [boinc pp jets 13000 150,-,1860 - sherpa 1.4.1 default 100000 260] . . . Event 5200 ( 18m 40s elapsed / 5h 40m 26s left ) -> ETA: Wed Jun 22 18:59 5200 events processed ID: 46920 · Reply Quote

maeax Send message Joined: 2 May 07 Posts: 2304 Credit: 179,727,092 RAC: 20,376	Message 46982 - Posted: 7 Jul 2022, 9:17:36 UTC - in response to Message 46920. Last modified: 7 Jul 2022, 9:44:12 UTC https://lhcathome.cern.ch/lhcathome/result.php?resultid=359882300 [runRivet] Wed Jul 6 14:17:04 UTC 2022 [boinc pp ttbar 7000 - - sherpa 2.1.1 default 10000 268] Output_Phase::Output_Phase(): Set output interval 1000000000 events. ---------------------------------------------------------- -- SHERPA generates events with the following structure -- ---------------------------------------------------------- Perturbative : Signal_Processes Perturbative : Hard_Decays Perturbative : Jet_Evolution:CSS Perturbative : Lepton_FS_QED_Corrections:Photons Perturbative : Multiple_Interactions:None Perturbative : Minimum_Bias:Off Hadronization : Beam_Remnants Hadronization : Hadronization:Ahadic Hadronization : Hadron_Decays Analysis : HepMC2 Maybe for a Cray or IBM summit ;-) ID: 46982 · Reply Quote

maeax Send message Joined: 2 May 07 Posts: 2304 Credit: 179,727,092 RAC: 20,376	Message 46989 - Posted: 8 Jul 2022, 4:07:54 UTC - in response to Message 46982. Maybe for a Cray or IBM summit ;-) Event 900 ( 1d 9h 2m 16s elapsed / 13d 22h 2m 56s left ) -> ETA: Fri Jul 22 01:53 XS = 38.6281 pb +- ( 1.2874 pb = 3.33 % ) Is it possible to change the end time for this task from now 8d to 13d in Boinc from my side? ID: 46989 · Reply Quote

Crystal Pellet Volunteer moderator Volunteer tester Send message Joined: 14 Jan 10 Posts: 1556 Credit: 10,100,748 RAC: 1,717	Message 46990 - Posted: 8 Jul 2022, 7:46:06 UTC - in response to Message 46989. Last modified: 8 Jul 2022, 7:51:41 UTC Maybe for a Cray or IBM summit ;-) Event 900 ( 1d 9h 2m 16s elapsed / 13d 22h 2m 56s left ) -> ETA: Fri Jul 22 01:53 XS = 38.6281 pb +- ( 1.2874 pb = 3.33 % ) Is it possible to change the end time for this task from now 8d to 13d in Boinc from my side? Yes, it is. Remove the line <job_duration>864000</job_duration> from Theory_2019_10_01.xml file. Suspend all 'Ready to start' tasks and this sherpa task with 'Leave applications in memory' not selected. Wait until the task is saved to disk and resume the tasks. Do not worry that the server will send a resent to someone else and your task will get the too late status. ID: 46990 · Reply Quote

maeax Send message Joined: 2 May 07 Posts: 2304 Credit: 179,727,092 RAC: 20,376	Message 46991 - Posted: 8 Jul 2022, 8:14:38 UTC - in response to Message 46990. Thanks Crystal, will watching the next days. Maybe the 9.day is coming, will do your changes. atm the 2.day is on the way. (1.000 from 10.000 events). A change from 864.000 duration time to a later time is not possible, because the Boinc-Server have this information 864.000? No easy correction for such a beast of Task :-). ID: 46991 · Reply Quote

Crystal Pellet Volunteer moderator Volunteer tester Send message Joined: 14 Jan 10 Posts: 1556 Credit: 10,100,748 RAC: 1,717	Message 46992 - Posted: 8 Jul 2022, 11:29:41 UTC - in response to Message 46991. Be aware you have <dont_check_file_sizes>1</dont_check_file_sizes> in the options part of your cc_config.xml ID: 46992 · Reply Quote

maeax Send message Joined: 2 May 07 Posts: 2304 Credit: 179,727,092 RAC: 20,376	Message 47017 - Posted: 14 Jul 2022, 9:49:39 UTC Last modified: 14 Jul 2022, 10:04:11 UTC 7d 19 h atm and 6200 events processed from 10k. Have done your instructions. In 3 d when the limit of 864k (10d) is reached will watching the success. https://lhcathome.cern.ch/lhcathome/workunit.php?wuid=191707788 Boinc say, task is running. Will the upload being done after success? ID: 47017 · Reply Quote

Crystal Pellet Volunteer moderator Volunteer tester Send message Joined: 14 Jan 10 Posts: 1556 Credit: 10,100,748 RAC: 1,717	Message 47019 - Posted: 14 Jul 2022, 13:18:38 UTC - in response to Message 47017. Will the upload being done after success? The server doesn't know the setting for the job duration of 864000 seconds. This Theory setting is to prevent tasks run for ever, but if you think your task is still progressing and will finish successful it will upload the result file when finished. The deadline on the server is 1 day longer than the deadline sent to BOINC-clients. When your task is not ready before the server's deadline a resend will be send to an other client (max of 3 tasks not yet reached), but your task (if successful) will be ready before the resend's one. ID: 47019 · Reply Quote

Ray Murray Volunteer moderator Send message Joined: 29 Sep 04 Posts: 281 Credit: 11,888,115 RAC: 0	Message 47020 - Posted: 14 Jul 2022, 16:57:38 UTC - in response to Message 47017. Last modified: 14 Jul 2022, 17:05:18 UTC Is the number of events still increasing? If there has been no event progress since you last looked, it may have already died, even though Boinc says it is still running. When did it last write to its log? I currently have a sherpa 1.4.1 on -dev with an ETA in the console of 3hrs ago, using all of its allocated core. It run for 4hrs but seems stuck on 22400 events, with nothing written to log in the last 6 hours. It's got until I finish my dinner to make some progress or else it's going to be terminated. ID: 47020 · Reply Quote

maeax Send message Joined: 2 May 07 Posts: 2304 Credit: 179,727,092 RAC: 20,376	Message 47021 - Posted: 14 Jul 2022, 20:46:08 UTC - in response to Message 47020. Now 6.600 events from 10k (8d 06h). Had also some Sherpa's in the last days with stopping process (2 day or less). Have killed them. ID: 47021 · Reply Quote

Crystal Pellet Volunteer moderator Volunteer tester Send message Joined: 14 Jan 10 Posts: 1556 Credit: 10,100,748 RAC: 1,717	Message 47022 - Posted: 15 Jul 2022, 7:43:26 UTC - in response to Message 47021. You could maybe speedup the processing by using the half of your 12 threads. ID: 47022 · Reply Quote

maeax Send message Joined: 2 May 07 Posts: 2304 Credit: 179,727,092 RAC: 20,376	Message 47023 - Posted: 15 Jul 2022, 8:00:58 UTC - in response to Message 47022. Thanks Crystal, BUT.... HP-Tower PC (Intel) using other processes too, not Boinc and Virtualbox only. Ok, the Turbo would be, when it running on the Threadripper 3995x (my personal Summit ;-)). 7.100 events (8d 17h) atm. ID: 47023 · Reply Quote

Henry Nebrensky Send message Joined: 13 Jul 05 Posts: 170 Credit: 15,020,549 RAC: 0	Message 47024 - Posted: 15 Jul 2022, 13:41:39 UTC - in response to Message 47023. 7.100 events (8d 17h) atm. But isn't that going to take more like 14 days to complete, rather than 10? ID: 47024 · Reply Quote

maeax Send message Joined: 2 May 07 Posts: 2304 Credit: 179,727,092 RAC: 20,376	Message 47025 - Posted: 15 Jul 2022, 14:17:16 UTC - in response to Message 47024. 7300 events processed ZAlign::ZAlign(): p_a*p_b = 152713 vs. 305408, rel. diff. -0.49997 ZAlign::ZAlign(): Q = 579.699 vs. 576.594, rel. diff. 0.0053857 Now exact 9 days. 10.000 is the end, 2.700 are in the future. 22 MByte with this ZALign's. Peter Skands and his Team are waiting for the result ;-). btw they have a lot of work. ID: 47025 · Reply Quote

Magic Quantum Mechanic Send message Joined: 24 Oct 04 Posts: 1311 Credit: 97,581,050 RAC: 105,471	Message 47026 - Posted: 15 Jul 2022, 14:40:47 UTC - in response to Message 47025. Peter Skands and his Team are waiting for the result ;-). btw they have a lot of work. Peter just emailed me and said you have too many Threadripper cores for one person Ok I am real busy watching golf at St Andrews) ID: 47026 · Reply Quote

maeax Send message Joined: 2 May 07 Posts: 2304 Credit: 179,727,092 RAC: 20,376	Message 47027 - Posted: 15 Jul 2022, 15:05:37 UTC - in response to Message 47026. With this jokes you are a champion :-)). ID: 47027 · Reply Quote

maeax Send message Joined: 2 May 07 Posts: 2304 Credit: 179,727,092 RAC: 20,376	Message 47028 - Posted: 15 Jul 2022, 21:14:10 UTC - in response to Message 47027. Task deleted by myself, sorry. ID: 47028 · Reply Quote

maeax Send message Joined: 2 May 07 Posts: 2304 Credit: 179,727,092 RAC: 20,376	Message 47046 - Posted: 22 Jul 2022, 17:32:01 UTC - in response to Message 47028. Last modified: 22 Jul 2022, 17:33:40 UTC Theory_2390-1143052-268 https://lhcathome.cern.ch/lhcathome/workunit.php?wuid=191707788 Is it possible, to get a restart, but on a faster CPU? ID: 47046 · Reply Quote