Message boards : Theory Application : Sherpa tasks run okay for long time, then they fail
Message board moderation

To post messages, you must log in.

1 · 2 · Next

AuthorMessage
Erich56

Send message
Joined: 18 Dec 15
Posts: 1688
Credit: 103,784,367
RAC: 122,288
Message 45887 - Posted: 17 Dec 2021, 16:27:01 UTC

in the recent past I have experienced a lot of Sherpa tasks which run okay for even up to several days, VM console 2 shows an increasing number of events processed, but at some point the number stays same, although there is CPU activity.
All one can do then is to abort the task manually.
So many hours and, in most cases, even days of CPU time are being wasted, which is annoying.
Anyone any idea why this is happening lately?
ID: 45887 · Report as offensive     Reply Quote
Peter Skands

Send message
Joined: 31 Jan 11
Posts: 12
Credit: 3,557,813
RAC: 0
Message 46189 - Posted: 8 Feb 2022, 9:26:54 UTC - in response to Message 45887.  

Hi Erich56

I agree it's frustrating and I don't actually understand what is happening with these runs. In the past, we had argued that, for some generators, we had to accept a small failure rate since we otherwise could not do comparisons to those generators at all. We had then hoped that updating them to the latest versions would gradually fix the issues we were seeing, but this has not really been the case. Having to operate with a non-negligible rate of jobs that fail is not nice, especially when this fraction does not seem to reduce with time.

I regret if we have been too slow to react, but at least now for 2022, we have come up with a plan to revitalize T4T. To start with, we are going to stop sending out jobs for the generators that are problematic, at least until we can sit down for a good proper debugging session with the authors of those codes, and fully iron their issues out so that they would be ready and steady for sending back out in T4T again.

During 2022, we plan to start by focusing our attention on getting (back) to the virtual equivalent of what the LHC machine people would call 'stable beams' for the most widely used generator, Pythia, setting a new baseline for future T4T operation. At least for that generator, our team has author-level in-house expertise, so we are confident we can do this, if we put in the hours.

At the same time, we think this can allow us to try out some new and possibly even more useful tests, which I hope we will be able to also make some announcements of down the track. So despite the issues you and others have been experiencing, I hope you will choose to stick with our project a little longer and see if things improve during 2022.

Best regards
Peter Skands
ID: 46189 · Report as offensive     Reply Quote
Crystal Pellet
Volunteer moderator
Volunteer tester

Send message
Joined: 14 Jan 10
Posts: 1280
Credit: 8,491,652
RAC: 2,067
Message 46920 - Posted: 22 Jun 2022, 13:49:59 UTC

This sherpa runs from the 100000 events only 5200 and then it keeps hanging with using 99% CPU for the Sherpa-process and no further progress.



===> [runRivet] Wed Jun 22 12:25:35 UTC 2022 [boinc pp jets 13000 150,-,1860 - sherpa 1.4.1 default 100000 260]
.
.
.
Event 5200 ( 18m 40s elapsed / 5h 40m 26s left ) -> ETA: Wed Jun 22 18:59
5200 events processed
ID: 46920 · Report as offensive     Reply Quote
maeax

Send message
Joined: 2 May 07
Posts: 2099
Credit: 159,815,978
RAC: 139,751
Message 46982 - Posted: 7 Jul 2022, 9:17:36 UTC - in response to Message 46920.  
Last modified: 7 Jul 2022, 9:44:12 UTC

https://lhcathome.cern.ch/lhcathome/result.php?resultid=359882300
[runRivet] Wed Jul 6 14:17:04 UTC 2022 [boinc pp ttbar 7000 - - sherpa 2.1.1 default 10000 268]
Output_Phase::Output_Phase(): Set output interval 1000000000 events.
----------------------------------------------------------
-- SHERPA generates events with the following structure --
----------------------------------------------------------
Perturbative : Signal_Processes
Perturbative : Hard_Decays
Perturbative : Jet_Evolution:CSS
Perturbative : Lepton_FS_QED_Corrections:Photons
Perturbative : Multiple_Interactions:None
Perturbative : Minimum_Bias:Off
Hadronization : Beam_Remnants
Hadronization : Hadronization:Ahadic
Hadronization : Hadron_Decays
Analysis : HepMC2
Maybe for a Cray or IBM summit ;-)
ID: 46982 · Report as offensive     Reply Quote
maeax

Send message
Joined: 2 May 07
Posts: 2099
Credit: 159,815,978
RAC: 139,751
Message 46989 - Posted: 8 Jul 2022, 4:07:54 UTC - in response to Message 46982.  

Maybe for a Cray or IBM summit ;-)

Event 900 ( 1d 9h 2m 16s elapsed / 13d 22h 2m 56s left ) -> ETA: Fri Jul 22 01:53
XS = 38.6281 pb +- ( 1.2874 pb = 3.33 % )
Is it possible to change the end time for this task from now 8d to 13d in Boinc from my side?
ID: 46989 · Report as offensive     Reply Quote
Crystal Pellet
Volunteer moderator
Volunteer tester

Send message
Joined: 14 Jan 10
Posts: 1280
Credit: 8,491,652
RAC: 2,067
Message 46990 - Posted: 8 Jul 2022, 7:46:06 UTC - in response to Message 46989.  
Last modified: 8 Jul 2022, 7:51:41 UTC

Maybe for a Cray or IBM summit ;-)

Event 900 ( 1d 9h 2m 16s elapsed / 13d 22h 2m 56s left ) -> ETA: Fri Jul 22 01:53
XS = 38.6281 pb +- ( 1.2874 pb = 3.33 % )
Is it possible to change the end time for this task from now 8d to 13d in Boinc from my side?
Yes, it is.

Remove the line <job_duration>864000</job_duration> from Theory_2019_10_01.xml file.
Suspend all 'Ready to start' tasks and this sherpa task with 'Leave applications in memory' not selected.
Wait until the task is saved to disk and resume the tasks.
Do not worry that the server will send a resent to someone else and your task will get the too late status.
ID: 46990 · Report as offensive     Reply Quote
maeax

Send message
Joined: 2 May 07
Posts: 2099
Credit: 159,815,978
RAC: 139,751
Message 46991 - Posted: 8 Jul 2022, 8:14:38 UTC - in response to Message 46990.  

Thanks Crystal,
will watching the next days.
Maybe the 9.day is coming, will do your changes.
atm the 2.day is on the way. (1.000 from 10.000 events).
A change from 864.000 duration time to a later time is not possible, because the Boinc-Server have this information 864.000?
No easy correction for such a beast of Task :-).
ID: 46991 · Report as offensive     Reply Quote
Crystal Pellet
Volunteer moderator
Volunteer tester

Send message
Joined: 14 Jan 10
Posts: 1280
Credit: 8,491,652
RAC: 2,067
Message 46992 - Posted: 8 Jul 2022, 11:29:41 UTC - in response to Message 46991.  

Be aware you have

<dont_check_file_sizes>1</dont_check_file_sizes>

in the options part of your cc_config.xml
ID: 46992 · Report as offensive     Reply Quote
maeax

Send message
Joined: 2 May 07
Posts: 2099
Credit: 159,815,978
RAC: 139,751
Message 47017 - Posted: 14 Jul 2022, 9:49:39 UTC
Last modified: 14 Jul 2022, 10:04:11 UTC

7d 19 h atm and 6200 events processed from 10k.
Have done your instructions.
In 3 d when the limit of 864k (10d) is reached will watching the success.
https://lhcathome.cern.ch/lhcathome/workunit.php?wuid=191707788
Boinc say, task is running.
Will the upload being done after success?
ID: 47017 · Report as offensive     Reply Quote
Crystal Pellet
Volunteer moderator
Volunteer tester

Send message
Joined: 14 Jan 10
Posts: 1280
Credit: 8,491,652
RAC: 2,067
Message 47019 - Posted: 14 Jul 2022, 13:18:38 UTC - in response to Message 47017.  

Will the upload being done after success?
The server doesn't know the setting for the job duration of 864000 seconds.
This Theory setting is to prevent tasks run for ever, but if you think your task is still progressing and will finish successful it will upload the result file when finished.
The deadline on the server is 1 day longer than the deadline sent to BOINC-clients.
When your task is not ready before the server's deadline a resend will be send to an other client (max of 3 tasks not yet reached),
but your task (if successful) will be ready before the resend's one.
ID: 47019 · Report as offensive     Reply Quote
Profile Ray Murray
Volunteer moderator
Avatar

Send message
Joined: 29 Sep 04
Posts: 281
Credit: 11,859,285
RAC: 0
Message 47020 - Posted: 14 Jul 2022, 16:57:38 UTC - in response to Message 47017.  
Last modified: 14 Jul 2022, 17:05:18 UTC

Is the number of events still increasing? If there has been no event progress since you last looked, it may have already died, even though Boinc says it is still running. When did it last write to its log?
I currently have a sherpa 1.4.1 on -dev with an ETA in the console of 3hrs ago, using all of its allocated core. It run for 4hrs but seems stuck on 22400 events, with nothing written to log in the last 6 hours. It's got until I finish my dinner to make some progress or else it's going to be terminated.
ID: 47020 · Report as offensive     Reply Quote
maeax

Send message
Joined: 2 May 07
Posts: 2099
Credit: 159,815,978
RAC: 139,751
Message 47021 - Posted: 14 Jul 2022, 20:46:08 UTC - in response to Message 47020.  

Now 6.600 events from 10k (8d 06h).
Had also some Sherpa's in the last days with stopping process (2 day or less). Have killed them.
ID: 47021 · Report as offensive     Reply Quote
Crystal Pellet
Volunteer moderator
Volunteer tester

Send message
Joined: 14 Jan 10
Posts: 1280
Credit: 8,491,652
RAC: 2,067
Message 47022 - Posted: 15 Jul 2022, 7:43:26 UTC - in response to Message 47021.  

You could maybe speedup the processing by using the half of your 12 threads.
ID: 47022 · Report as offensive     Reply Quote
maeax

Send message
Joined: 2 May 07
Posts: 2099
Credit: 159,815,978
RAC: 139,751
Message 47023 - Posted: 15 Jul 2022, 8:00:58 UTC - in response to Message 47022.  

Thanks Crystal,
BUT.... HP-Tower PC (Intel) using other processes too, not Boinc and Virtualbox only.
Ok, the Turbo would be, when it running on the Threadripper 3995x (my personal Summit ;-)).
7.100 events (8d 17h) atm.
ID: 47023 · Report as offensive     Reply Quote
Henry Nebrensky

Send message
Joined: 13 Jul 05
Posts: 167
Credit: 14,945,019
RAC: 511
Message 47024 - Posted: 15 Jul 2022, 13:41:39 UTC - in response to Message 47023.  

7.100 events (8d 17h) atm.
But isn't that going to take more like 14 days to complete, rather than 10?
ID: 47024 · Report as offensive     Reply Quote
maeax

Send message
Joined: 2 May 07
Posts: 2099
Credit: 159,815,978
RAC: 139,751
Message 47025 - Posted: 15 Jul 2022, 14:17:16 UTC - in response to Message 47024.  

7300 events processed
ZAlign::ZAlign(): p_a*p_b = 152713 vs. 305408, rel. diff. -0.49997
ZAlign::ZAlign(): Q = 579.699 vs. 576.594, rel. diff. 0.0053857

Now exact 9 days.
10.000 is the end, 2.700 are in the future.
22 MByte with this ZALign's.

Peter Skands and his Team are waiting for the result ;-). btw they have a lot of work.
ID: 47025 · Report as offensive     Reply Quote
Profile Magic Quantum Mechanic
Avatar

Send message
Joined: 24 Oct 04
Posts: 1127
Credit: 49,750,513
RAC: 9,551
Message 47026 - Posted: 15 Jul 2022, 14:40:47 UTC - in response to Message 47025.  

Peter Skands and his Team are waiting for the result ;-). btw they have a lot of work.

Peter just emailed me and said you have too many Threadripper cores for one person
Ok I am real busy watching golf at St Andrews)
ID: 47026 · Report as offensive     Reply Quote
maeax

Send message
Joined: 2 May 07
Posts: 2099
Credit: 159,815,978
RAC: 139,751
Message 47027 - Posted: 15 Jul 2022, 15:05:37 UTC - in response to Message 47026.  

With this jokes you are a champion :-)).
ID: 47027 · Report as offensive     Reply Quote
maeax

Send message
Joined: 2 May 07
Posts: 2099
Credit: 159,815,978
RAC: 139,751
Message 47028 - Posted: 15 Jul 2022, 21:14:10 UTC - in response to Message 47027.  

Task deleted by myself, sorry.
ID: 47028 · Report as offensive     Reply Quote
maeax

Send message
Joined: 2 May 07
Posts: 2099
Credit: 159,815,978
RAC: 139,751
Message 47046 - Posted: 22 Jul 2022, 17:32:01 UTC - in response to Message 47028.  
Last modified: 22 Jul 2022, 17:33:40 UTC

Theory_2390-1143052-268
https://lhcathome.cern.ch/lhcathome/workunit.php?wuid=191707788
Is it possible, to get a restart, but on a faster CPU?
ID: 47046 · Report as offensive     Reply Quote
1 · 2 · Next

Message boards : Theory Application : Sherpa tasks run okay for long time, then they fail


©2024 CERN