Thread 'ATLAS WU failed'

Author	Message
computezrmle Volunteer moderator Volunteer developer Volunteer tester Help desk expert Send message Joined: 15 Jun 08 Posts: 2724 Credit: 299,095,947 RAC: 11,595	Message 31947 - Posted: 14 Aug 2017, 13:47:28 UTC I recently got a WU from the 11855278 batch that failed after 5.5 min: https://lhcathome.cern.ch/lhcathome/result.php?resultid=153614343 It's not only annoying that the initial download size raised to 184 MB (!) (do you really care about volunteers with lower bandwidth?) but also that my firewall dropped a connection to pandaserver.cern.ch port 25443 which is not mentioned in the lhc@home FAQ as necessary server/port. It would be nice if the project team could quickly check if there is a misconfigured batch and keep the downloads far lower than now. It would also be nice to get some response here. It helps to decide whether to set NNT for a while. ID: 31947 · Reply Quote

Erich56 Send message Joined: 18 Dec 15 Posts: 1957 Credit: 158,710,976 RAC: 55,924	Message 31952 - Posted: 15 Aug 2017, 6:05:16 UTC What I observed was that yesterday several WUs yesterday were aborted by the server. One example here: https://lhcathome.cern.ch/lhcathome/result.php?resultid=153613677 No idea why this happened. Maybe they found out that there was a bunch of faulty WUs ? ID: 31952 · Reply Quote

David Cameron Project administrator Project developer Project scientist Send message Joined: 13 May 14 Posts: 387 Credit: 15,314,184 RAC: 0	Message 31953 - Posted: 15 Aug 2017, 10:28:11 UTC Indeed this was a misconfigured batch of tasks. We got 7 new batches and one was not configured correctly. All the WU were aborted yesterday, sorry for the inconvenience. The connection to pandaserver.cern.ch was due to the misconfiguration. As mentioned previously, we cannot change the size of the download, but we can increase the number of events processed per WU so that at least there are fewer downloads. ID: 31953 · Reply Quote

computezrmle Volunteer moderator Volunteer developer Volunteer tester Help desk expert Send message Joined: 15 Jun 08 Posts: 2724 Credit: 299,095,947 RAC: 11,595	Message 31954 - Posted: 15 Aug 2017, 10:40:55 UTC - in response to Message 31953. Indeed this was a misconfigured batch of tasks. We got 7 new batches and one was not configured correctly. All the WU were aborted yesterday, sorry for the inconvenience. The connection to pandaserver.cern.ch was due to the misconfiguration. As mentioned previously, we cannot change the size of the download, but we can increase the number of events processed per WU so that at least there are fewer downloads. Thanks David. Recent WUs seem to run better. ID: 31954 · Reply Quote

Crystal Pellet Volunteer moderator Volunteer tester Send message Joined: 14 Jan 10 Posts: 1530 Credit: 10,025,154 RAC: 1,422	Message 31968 - Posted: 16 Aug 2017, 14:51:43 UTC - in response to Message 31953. David Cameron wrote: As mentioned previously, we cannot change the size of the download, but we can increase the number of events processed per WU so that at least there are fewer downloads. I prefer the current number of events and predictable runtimes. Longer run times and today's announced shortened deadlines is not a good idea. Btw: The upload files will grow proportion-able when the number of events is increased. ID: 31968 · Reply Quote

Jim1348 Send message Joined: 15 Nov 14 Posts: 602 Credit: 24,371,321 RAC: 0	Message 31970 - Posted: 16 Aug 2017, 15:24:48 UTC I doubt that this is really an opinion poll, but one week works for me. LHCb takes longer to run than ATLAS, and has a one-week deadline. I have no problems with it. I keep the default 0.10 + 0.50 buffer however; I don't like big buffers anyway. ID: 31970 · Reply Quote