log in

ATLAS WU failed


Advanced search

Message boards : ATLAS application : ATLAS WU failed

Author Message
computezrmle
Send message
Joined: 15 Jun 08
Posts: 347
Credit: 3,399,908
RAC: 3,711
Message 31947 - Posted: 14 Aug 2017, 13:47:28 UTC

I recently got a WU from the 11855278 batch that failed after 5.5 min:
https://lhcathome.cern.ch/lhcathome/result.php?resultid=153614343

It's not only annoying that the initial download size raised to 184 MB (!) (do you really care about volunteers with lower bandwidth?) but also that my firewall dropped a connection to pandaserver.cern.ch port 25443 which is not mentioned in the lhc@home FAQ as necessary server/port.

It would be nice if the project team could quickly check if there is a misconfigured batch and keep the downloads far lower than now.
It would also be nice to get some response here. It helps to decide whether to set NNT for a while.

Erich56
Send message
Joined: 18 Dec 15
Posts: 304
Credit: 3,437,579
RAC: 8,426
Message 31952 - Posted: 15 Aug 2017, 6:05:16 UTC

What I observed was that yesterday several WUs yesterday were aborted by the server.

One example here: https://lhcathome.cern.ch/lhcathome/result.php?resultid=153613677

No idea why this happened. Maybe they found out that there was a bunch of faulty WUs ?

David Cameron
Project administrator
Project developer
Project scientist
Send message
Joined: 13 May 14
Posts: 124
Credit: 2,875,749
RAC: 10,318
Message 31953 - Posted: 15 Aug 2017, 10:28:11 UTC

Indeed this was a misconfigured batch of tasks. We got 7 new batches and one was not configured correctly. All the WU were aborted yesterday, sorry for the inconvenience.

The connection to pandaserver.cern.ch was due to the misconfiguration.

As mentioned previously, we cannot change the size of the download, but we can increase the number of events processed per WU so that at least there are fewer downloads.

computezrmle
Send message
Joined: 15 Jun 08
Posts: 347
Credit: 3,399,908
RAC: 3,711
Message 31954 - Posted: 15 Aug 2017, 10:40:55 UTC - in response to Message 31953.

Indeed this was a misconfigured batch of tasks. We got 7 new batches and one was not configured correctly. All the WU were aborted yesterday, sorry for the inconvenience.

The connection to pandaserver.cern.ch was due to the misconfiguration.

As mentioned previously, we cannot change the size of the download, but we can increase the number of events processed per WU so that at least there are fewer downloads.

Thanks David.

Recent WUs seem to run better.

Crystal Pellet
Volunteer moderator
Volunteer tester
Send message
Joined: 14 Jan 10
Posts: 328
Credit: 2,772,160
RAC: 3,191
Message 31968 - Posted: 16 Aug 2017, 14:51:43 UTC - in response to Message 31953.

David Cameron wrote:
As mentioned previously, we cannot change the size of the download, but we can increase the number of events processed per WU so that at least there are fewer downloads.

I prefer the current number of events and predictable runtimes.
Longer run times and today's announced shortened deadlines is not a good idea.
Btw: The upload files will grow proportion-able when the number of events is increased.

Jim1348
Send message
Joined: 15 Nov 14
Posts: 71
Credit: 3,033,536
RAC: 10,549
Message 31970 - Posted: 16 Aug 2017, 15:24:48 UTC

I doubt that this is really an opinion poll, but one week works for me. LHCb takes longer to run than ATLAS, and has a one-week deadline. I have no problems with it.
I keep the default 0.10 + 0.50 buffer however; I don't like big buffers anyway.

Message boards : ATLAS application : ATLAS WU failed