1) Message boards : Theory Application : "inconsist mother/daughter information in Phytia8 event" (Message 46192)
Posted 8 Feb 2022 by Peter Skands
Post:
Hi,

Generator updates and bug fixes are not always as fast as one might like, but I wanted to at least show that this one did get addressed, and thank those that wrote to us about the error messages they were seeing about inconsistent mother/daughter pointers, here from the program's update notes at pythia.org :

8.306: 28 June 2021
- Fixed issue for HepMC output from Vincia, which would previously issue warnings about inconsistent mother/daughter relationships, caused by Vincia's antenna-style bookkeeping by which emitted partons have two mothers instead of one. For status codes 43, 51, and 53, the HepMC interface now ignores the second parent, always using just the first one to define the vertex structure. Minor modifications to Vincia's QCD shower to ensure that the first mother is the one that changed colour and hence would be identified with the "radiator" in a collinear context. Analogous modifications in the QED module so the most collinear parent is the first mother.

Again, thanks as always for reporting issues that you see in the runs, and apologies if we are sometimes (very) slow to react to them; we are not a big group of people but have to cover a lot of ground.

All the best,
Peter Skands, on behalf of the Pythia team.
2) Message boards : Theory Application : Tasks run 4 days and finish with error (Message 46191)
Posted 8 Feb 2022 by Peter Skands
Post:
Hi Greger & others on this thread,

I posted a message on a couple of similar threadsj ust now, but repeat it here. I agree it's frustrating and I don't actually understand what is happening with these runs. In the past, we had argued that, for some generators, we had to accept a small failure rate since we otherwise could not do comparisons to those generators at all. We had then hoped that updating them to the latest versions would gradually fix the issues we were seeing, but this has not really been the case. Having to operate with a non-negligible rate of jobs that fail is not nice, especially when this fraction does not seem to reduce with time.

I regret if we have been too slow to react, but at least now for 2022, we have come up with a plan to revitalize T4T. To start with, we are going to stop sending out jobs for the generators that are problematic, at least until we can sit down for a good proper debugging session with the authors of those codes, and fully iron their issues out so that they would be ready and steady for sending back out in T4T again.

During 2022, we plan to start by focusing our attention on getting (back) to the virtual equivalent of what the LHC machine people would call 'stable beams' for the most widely used generator, Pythia, setting a new baseline for future T4T operation. At least for that generator, our team has author-level in-house expertise, so we are confident we can do this, if we put in the hours.

At the same time, we think this can allow us to try out some new and possibly even more useful tests, which I hope we will be able to also make some announcements of down the track. So despite the issues you and others have been experiencing, I hope you will choose to stick with our project a little longer and see if things improve during 2022.

Best regards
Peter Skands
3) Message boards : Theory Application : Theory Task having crazy time (Message 46190)
Posted 8 Feb 2022 by Peter Skands
Post:
Hi maeax and Crystal,

I posted a message on a similar thread to Erich56 just now, but repeat it here. I agree it's frustrating and I don't actually understand what is happening with these runs. In the past, we had argued that, for some generators, we had to accept a small failure rate since we otherwise could not do comparisons to those generators at all. We had then hoped that updating them to the latest versions would gradually fix the issues we were seeing, but this has not really been the case. Having to operate with a non-negligible rate of jobs that fail is not nice, especially when this fraction does not seem to reduce with time.

I regret if we have been too slow to react, but at least now for 2022, we have come up with a plan to revitalize T4T. To start with, we are going to stop sending out jobs for the generators that are problematic, at least until we can sit down for a good proper debugging session with the authors of those codes, and fully iron their issues out so that they would be ready and steady for sending back out in T4T again.

During 2022, we plan to start by focusing our attention on getting (back) to the virtual equivalent of what the LHC machine people would call 'stable beams' for the most widely used generator, Pythia, setting a new baseline for future T4T operation. At least for that generator, our team has author-level in-house expertise, so we are confident we can do this, if we put in the hours.

At the same time, we think this can allow us to try out some new and possibly even more useful tests, which I hope we will be able to also make some announcements of down the track. So despite the issues you and others have been experiencing, I hope you will choose to stick with our project a little longer and see if things improve during 2022.

Best regards
Peter Skands
4) Message boards : Theory Application : Sherpa tasks run okay for long time, then they fail (Message 46189)
Posted 8 Feb 2022 by Peter Skands
Post:
Hi Erich56

I agree it's frustrating and I don't actually understand what is happening with these runs. In the past, we had argued that, for some generators, we had to accept a small failure rate since we otherwise could not do comparisons to those generators at all. We had then hoped that updating them to the latest versions would gradually fix the issues we were seeing, but this has not really been the case. Having to operate with a non-negligible rate of jobs that fail is not nice, especially when this fraction does not seem to reduce with time.

I regret if we have been too slow to react, but at least now for 2022, we have come up with a plan to revitalize T4T. To start with, we are going to stop sending out jobs for the generators that are problematic, at least until we can sit down for a good proper debugging session with the authors of those codes, and fully iron their issues out so that they would be ready and steady for sending back out in T4T again.

During 2022, we plan to start by focusing our attention on getting (back) to the virtual equivalent of what the LHC machine people would call 'stable beams' for the most widely used generator, Pythia, setting a new baseline for future T4T operation. At least for that generator, our team has author-level in-house expertise, so we are confident we can do this, if we put in the hours.

At the same time, we think this can allow us to try out some new and possibly even more useful tests, which I hope we will be able to also make some announcements of down the track. So despite the issues you and others have been experiencing, I hope you will choose to stick with our project a little longer and see if things improve during 2022.

Best regards
Peter Skands
5) Message boards : Theory Application : Pythia8 - inconsistent mother/daugher information (Message 42371)
Posted 3 May 2020 by Peter Skands
Post:
Thanks computezrmle for alterting us to this.

I've committed a temporary fix to silence these warnings, which should go out in production runs soon.

The problem arises because we introduced a new "bremsstrahlung shower" model into Pythia 8.301, called Vincia. (Previously distributed as a stand-alone plug-in to Pythia.) That model has different "mother-daughter" relationships - "mothers" here referring to particles that act as sources of bremsstrahlung, while "daughters" refer to the emitted quanta. (Vincia is based on a so-called "antenna" model of radiation, while Pythia's default model is based on something more like individual charges.) Internally, the new model is self-consistent, but the consistency check made in the output interface for event analysis has not yet been updated to "accept" these new alternative relationships. So, we think the events you are generating are actually OK - but the output interface doesn't realise it and issues a warning. We will correct that in an upcoming version, probably 8.303 or 8.304, but until then I have activated the flag mentioned in the message you saw, to silence this particular warning.

Thanks again,
Peter Skands
6) Message boards : Theory Application : (Native) Theory - Sherpa looooooong runners (Message 40869)
Posted 9 Dec 2019 by Peter Skands
Post:
Hi Henry,

That's an interesting proposal. Sequestering any jobs that are judged as being not completely stable (for whatever reason) in a dedicated queue that people can opt into, while the main queue would be reserved for more streamlined production runs.

I think that the actual name "Theory-Native" would not be good to retain for the 'development' sub-project. It would be a mis-badging, that would eventually confuse people. But I'm curious to ask other LHC@home developers if it would have merit and be possible to set up something like a "Theory-Beta" sub-project, where we could put those tasks that are problematic.

As I've written about elsewhere, Sherpa is a complex code, with advanced capabilities, and unfortunately also sometimes advanced ways of failing. Well, an infinite loop is not a particularly advanced failure of course, but in its defence typically the things it is doing when it enters those loops you see are things that the other codes are not even trying to do. Moreover, since none of the Test4Theory/LHC@home developers are Sherpa authors, all we can usually do is the same as ordinary users: submit a bug report to the Sherpa authors about what you guys are seeing, and hope that eventually in a future version, some of those loops will get fixed. Meanwhile, we still want to run the existing version of the code, since it is interesting to compare it, for those cases where it succeeds, to the available data. On our side, we can try to find ways of running it as "correctly" as possible, and use tricks to abort jobs that we can somehow detect as loopers, but despite some improvements on that, there clearly are still issues remaining.

I'd welcome feedback from other Test4Theory/LHC@home developers (and volunteers) to comment on this idea, and whether we would be able to implement it?

All the best,
Peter.
7) Message boards : Theory Application : Sherpa: Inaccurate rotation (Message 38519)
Posted 3 Apr 2019 by Peter Skands
Post:
Dear all,

Thanks for all the help charting this issue. Since we do not have any Sherpa authors on the LHC@home team, I have written to the Sherpa authors to ask for their help in understanding the source of these problems. As mentioned in a reply on another thread, we would really like to be able to keep producing comparisons to Sherpa - as this is one of the state-of-the-art Monte Carlo event generators that is being used at LHC. But of course, at the moment that point is kind of moot, since almost all jobs (at least for ee) are failing or lost. I hope to be able to provide another update on this soon.

Peter.
8) Message boards : Theory Application : Native Theory- Sherpa upgrade (Message 38517)
Posted 3 Apr 2019 by Peter Skands
Post:
Hi Maeax,

Just a note that the blog you linked to is not for the Sherpa Monte Carlo Event Generator but a Python modelling and fitting package. The homepage of the Sherpa MC we run is this one:
https://gitlab.com/sherpa-team/sherpa

The most recent release version of the Sherpa MC is 2.2.6. On LHC@home we currently run up to 2.2.5, so we are reasonably up to date. The Sherpa jobs do fail more often than others, and we are investigating what to about it, cf some of the other threads on the forum.

Peter
9) Message boards : Theory Application : 196 (0x000000C4) EXIT_DISK_LIMIT_EXCEEDED - how come? (Message 38516)
Posted 3 Apr 2019 by Peter Skands
Post:
Dear all,

From the project side, this is a difficult problem. We'd like to run Sherpa jobs - it is being used in LHC studies, so it is useful for the LHC@home project to run it. But as many of you are only too well aware of, it is also the source of most of the problematic jobs in the project. We are not ourselves authors of the Sherpa Monte Carlo, so we can only pass along the issues to the authors and hope that they will be addressed in future releases. In the past, Sherpa jobs on LHC@home had significant issues with infinite loops causing jobs to run to the time limit without finishing. Since versions are never patched backwards (bug fixes only go into new releases; old ones are not patched, for reproducibility) we had hoped to address this by deprecating some of the older Sherpa versions and only running the newest releases. I think that changeover happened relatively recently - though Anton Karneyeu can probably say more precisely when and what was done - he has been working on updating the project over the past few months. However, as is clear from this thread and related ones, the new versions have issues of their own. I will try to collate the feedback you have been providing and contact the authors about what we are seeing. I am not myself able to suggest a "hot fix" that could be applied easily to at least detect and kill the problematic jobs. The only thing I could do easily would be to remove Sherpa jobs from the request system - but then we would lose the possibility to compare to Sherpa at all, which would also not be ideal. First of all, I'd like to understand the magnitude of the problem; is this a relatively rare occurrence that happens only for the occasional Sherpa job, or does it happen most of the time? I saw a log posted on this thread that had only one successful Sherpa job for 13 or 14 failed ones. If that is typical, we might as well give up on running Sherpa, at least for the time being. Any feedback on how prevalent this issue is - whether we are still getting a good number of Sherpa jobs finishing or hardly any - would be appreciated.

With best regards, and apologies to those who are (understandably) frustrated about their CPU time going to jobs that are not producing science!

Peter
10) Message boards : Theory Application : Theory's endless looping (Message 35142)
Posted 3 May 2018 by Peter Skands
Post:
Hi all,

Thanks as always for your dedication and patience. As Crystal mentions, the looping issue only affects a few Sherpa jobs, and as I think I've stated elsewhere, we still want to run those Sherpa versions to be able to display comparisons with them - as long as they are being used actively by researchers in the community. In the major new update of the jobs we are planning to start sending out shortly, some of the oldest Sherpa versions that we don't deem are being used any more will be deprecated, and those also appear to be the most frequent loopers. I cannot promise that there will not still be *some* looping jobs happening in the newer versions. At least a partial consolation is that, as far as I know, you are at least still getting credits for them, even though I totally understand that it is frustrating that your CPU is basically idling during those jobs and not contributing to science. The way the particle physics software development works is that once a version is publicly released, it is not changed any more, not even to fix bugs, even serious ones. Bug fixes and patches are of course developed and applied (and the presence of a serious bug can cause us to 'withdraw' a version or at least issue a strong recommendation not to use it), but patches and fixes go into future versions; past ones are never 'patched backwards'. The reason we do it that way is for reproducibility, which is extremely important in science, so that a given public version always produces the same results (even when this includes losing some jobs). That means that not only you guys, but also researchers who run these versions on their own computers, or on clusters, have no choice but to accept basically that there is an 'efficiency' of the jobs which is not 100% (but still rather close to 100% as far as I know). I keep being impressed and amazed at what the volunteer community is capable of; the development of a script that some of you guys have been talking about, that automatically detects the looping run condition and gracefully shuts down the job, is extremely nice. Although we don't have a lot of manpower currently for upgrading the run software, I will try hard to see if we can incorporate such a trigger in our default job setups, so that this could be done automatically. This is really great work!

All the best,
Peter
11) Message boards : Theory Application : Theory's endless looping (Message 33273)
Posted 11 Dec 2017 by Peter Skands
Post:
For the looping Sherpa jobs, we are considering deprecating some of the older Sherpa versions which should improve things somewhat. Generally, patches are not retroactively added to older code versions, so to the extent the underlying problem is fixed by the Sherpa authors, it will appear as improved behaviour in the newer versions, and most of the posts I see here about looping Sherpa jobs cite the older ones.

Ray, thanks for letting us know about the looping of the Pythia job. In this particular run, the "default-MBR" label means that the generator is using an alternative model for so-called diffractive processes, called the Minimum-Bias-Rockefeller model. (Diffraction, in the particle-physics context, occurs when one or both of the colliding protons fluctuate to spit off a little "ball" of gluons (called a Pomeron) which coherently carry some fraction of the proton momentum, and the other beam particle hits that ball of glue rather than the original proton). In order to pinpoint if the looping is associated with that particular model, or with a rare occurrence in Pythia in general, I would be very glad to know if anyone sees another looping Pythia job, and if so if it was again with the default-MBR model, or a different setting.
12) Message boards : Theory Application : fail to run sherpa 2.2.0 (Message 33272)
Posted 11 Dec 2017 by Peter Skands
Post:
Hi Ben, Crystal

I have not managed to understand the nature of this problem. I would be interested to hear if it has been seen again, or if it was a once-off?



©2024 CERN