log in

Download failures


Advanced search

Message boards : ATLAS application : Download failures

1 · 2 · Next
Author Message
Profile thomasroderick
Send message
Joined: 22 May 17
Posts: 10
Credit: 230,327
RAC: 2,637
Message 31711 - Posted: 30 Jul 2017, 0:32:04 UTC

Ever since the day that there was the issue after the cleanup of old files, I have been experiencing the same issues. LHC was running Atlas fine for a long time, where it would download 4 tasks and run them through without issue.. What I am seeing now (and since the day of the server cleanup) is my machine will attempt to download files for tasks, and get stuck retrying on a few for several hours.

7/29/2017 7:22:43 PM | LHC@home | Started download of jf_f3ff3ac08153d0ee04ea606f0dea9a0e
7/29/2017 7:23:05 PM | | Project communication failed: attempting access to reference site
7/29/2017 7:23:05 PM | LHC@home | Temporarily failed download of jf_f3ff3ac08153d0ee04ea606f0dea9a0e: connect() failed
7/29/2017 7:23:05 PM | LHC@home | Backing off 01:05:41 on download of jf_f3ff3ac08153d0ee04ea606f0dea9a0e
7/29/2017 7:23:06 PM | | Internet access OK - project servers may be temporarily down.

I have Updated, Restarted, Removed and re-added the LHC project several times, over several days. Suggestions? Tried 2 different networks (home and work), same issues. Connection to other projects are no issue.

Erich56
Send message
Joined: 18 Dec 15
Posts: 304
Credit: 3,437,579
RAC: 8,426
Message 31712 - Posted: 30 Jul 2017, 3:59:12 UTC - in response to Message 31711.

Ever since the day that there was the issue after the cleanup of old files, I have been experiencing the same issues. LHC was running Atlas fine for a long time, where it would download 4 tasks and run them through without issue.. What I am seeing now (and since the day of the server cleanup) is my machine will attempt to download files for tasks, and get stuck retrying on a few for several hours.

7/29/2017 7:22:43 PM | LHC@home | Started download of jf_f3ff3ac08153d0ee04ea606f0dea9a0e
7/29/2017 7:23:05 PM | | Project communication failed: attempting access to reference site
7/29/2017 7:23:05 PM | LHC@home | Temporarily failed download of jf_f3ff3ac08153d0ee04ea606f0dea9a0e: connect() failed
7/29/2017 7:23:05 PM | LHC@home | Backing off 01:05:41 on download of jf_f3ff3ac08153d0ee04ea606f0dea9a0e
7/29/2017 7:23:06 PM | | Internet access OK - project servers may be temporarily down.

I have Updated, Restarted, Removed and re-added the LHC project several times, over several days. Suggestions? Tried 2 different networks (home and work), same issues. Connection to other projects are no issue.


this is exactly what I am experiencing now.
So all of us who have this problem can at least be sure that is has nothing to do with our systems. It's rather up to CERN to fix this issue.
All we can do is hoping that somone there reads this thread and takes some action.

Crystal Pellet
Volunteer moderator
Volunteer tester
Send message
Joined: 14 Jan 10
Posts: 328
Credit: 2,772,160
RAC: 3,191
Message 31724 - Posted: 30 Jul 2017, 9:58:04 UTC
Last modified: 30 Jul 2017, 11:16:05 UTC

Opening this thread, cause I experienced download issues with the 120MB ATLAS task-files like others did, but mentioned in another thread (will move those posts to here)

<core_client_version>7.7.2</core_client_version>
<![CDATA[
<message>
WU download error: couldn't get input files:
<file_xfer_error>
<file_name>jf_5e60912e104e160658713cac240e41fb</file_name>
<error_code>-119 (md5 checksum failed for file)</error_code>
</file_xfer_error>

I've also noticed today and yesterday when visiting webpages on LHC@home and returning to a previous page, I sometimes get the browser notice: network changed detected.

Erich56
Send message
Joined: 18 Dec 15
Posts: 304
Credit: 3,437,579
RAC: 8,426
Message 31725 - Posted: 30 Jul 2017, 10:39:45 UTC - in response to Message 31724.

Opening this thread ...

great idea, thanks!

Which will be the next steps? My question means in particular how this problem can/will be brought to the attention of those people at CERN who are in charge of repairing such network issues.

XOX
Send message
Joined: 20 Feb 16
Posts: 3
Credit: 36,988
RAC: 958
Message 31726 - Posted: 30 Jul 2017, 10:43:47 UTC

hey guys, i have the same problem. I hope the CERN Team can fixed....

30.07.2017 12:29:34 | LHC@home | Started download of boinc_job_script.8sdJMf
30.07.2017 12:29:35 | LHC@home | Finished download of boinc_job_script.8sdJMf
30.07.2017 12:29:53 | LHC@home | Temporarily failed download of jf_122159ff524343058e02c7137926559d: connect() failed
30.07.2017 12:29:53 | LHC@home | Backing off 00:03:14 on download of jf_122159ff524343058e02c7137926559d
30.07.2017 12:30:04 | | Project communication failed: attempting access to reference site
30.07.2017 12:30:06 | | Internet access OK - project servers may be temporarily down.

Gunde
Send message
Joined: 9 Jan 15
Posts: 5
Credit: 40,049,389
RAC: 184,308
Message 31727 - Posted: 30 Jul 2017, 13:44:27 UTC

Sun 30 Jul 2017 02:42:16 PM CEST | LHC@home | Temporarily failed download of jf_3536bf3e25f337041aca72316e5e0fec: transient HTTP error
Sun 30 Jul 2017 02:42:16 PM CEST | LHC@home | Backing off 00:25:41 on download of jf_3536bf3e25f337041aca72316e5e0fec
Sun 30 Jul 2017 02:42:16 PM CEST | LHC@home | Temporarily failed download of jf_d4b6ce59cac0e54eb4bddb1b2e4b43e2: transient HTTP error
Sun 30 Jul 2017 02:42:16 PM CEST | LHC@home | Backing off 00:16:41 on download of jf_d4b6ce59cac0e54eb4bddb1b2e4b43e2
Sun 30 Jul 2017 02:42:18 PM CEST | | Internet access OK - project servers may be temporarily down.

With debug:
Sun 30 Jul 2017 03:37:54 PM CEST | LHC@home | [http] HTTP_OP::init_get(): http://boincai04.cern.ch/Atlas-test/download/10d/vnPNDmwxevqnSu7Ccp2YYBZmABFKDmABFKDmXNGKDmhFLKDmFy3E7n_EVNT.11266146._002827.pool.root.1
Sun 30 Jul 2017 03:37:54 PM CEST | LHC@home | Started download of jf_3536bf3e25f337041aca72316e5e0fec
Sun 30 Jul 2017 03:37:54 PM CEST | LHC@home | [http] HTTP_OP::init_get(): http://boincai04.cern.ch/Atlas-test/download/13c/GmYNDmcofvqnSu7Ccp2YYBZmABFKDmABFKDmXNGKDmuHLKDmJIpshn_EVNT.11266146._002831.pool.root.1
Sun 30 Jul 2017 03:37:54 PM CEST | LHC@home | Started download of jf_d4b6ce59cac0e54eb4bddb1b2e4b43e2
Sun 30 Jul 2017 03:37:54 PM CEST | LHC@home | [http] [ID#1548] Info: Connection 853 seems to be dead!
Sun 30 Jul 2017 03:37:54 PM CEST | LHC@home | [http] [ID#1548] Info: Closing connection 853
Sun 30 Jul 2017 03:37:54 PM CEST | LHC@home | [http] [ID#1549] Info: Found bundle for host boincai04.cern.ch: 0x559afaf3cfe0 [serially]
Sun 30 Jul 2017 03:37:54 PM CEST | | [network_status] status: online
Sun 30 Jul 2017 03:37:55 PM CEST | LHC@home | [http] [ID#1548] Info: Trying 128.142.202.86...
Sun 30 Jul 2017 03:37:55 PM CEST | LHC@home | [http] [ID#1549] Info: Hostname was found in DNS cache
Sun 30 Jul 2017 03:37:55 PM CEST | LHC@home | [http] [ID#1549] Info: Trying 128.142.202.86...

Erich56
Send message
Joined: 18 Dec 15
Posts: 304
Credit: 3,437,579
RAC: 8,426
Message 31728 - Posted: 30 Jul 2017, 14:12:53 UTC - in response to Message 31727.
Last modified: 30 Jul 2017, 14:20:32 UTC

Info: Hostname was found in DNS cache
Sun 30 Jul 2017 03:37:55 PM CEST | LHC@home | [http] [ID#1549] Info: Trying 128.142.202.86...

pinging 128.142.202.86 yields "request timed out" - what was to be expected :-(

with tracert, the last successful connection is with
e513-e-rbrxl-1-ne0.cern.ch [192.65.184.37]

after this, again "timeout"

djoser
Send message
Joined: 30 Aug 14
Posts: 15
Credit: 1,811,677
RAC: 1,564
Message 31729 - Posted: 30 Jul 2017, 17:31:32 UTC

I got the same download problems with one of my machines, which is dedicated to Atlas tasks.

I have the feeling that this situation is somehow related to Sixtrack. Whenever Sixtrack has thousands of workunits in the queue, Atlas seem to get "hickups". I recall similar problems last time Sixtrack had so much WU's to be distributed a few weeks ago.

Could this be associated?
____________
Why mine when you can research? - GRIDCOIN - Real cryptocurrency without wasting hashes! www.gridcoin.us

Jesse Viviano
Send message
Joined: 12 Feb 14
Posts: 59
Credit: 842,858
RAC: 1,844
Message 31732 - Posted: 30 Jul 2017, 23:01:30 UTC

Did someone move the ATLAS@home download server to another IP address? I noticed that my BOINC client cannot connect to the download server at all in regards to the ATLAS@home tasks, while it is able to download other tasks. If that is the case, the solution could be to wait for the old DNS entry to expire. However, if someone changed the DNS without moving the ATLAS@home server to the new IP address, then either the DNS server's entry for the ATLAS@home download server needs to be changed back or the ATLAS@home server needs to be moved to the new IP address.

Profile HerveUAE
Avatar
Send message
Joined: 18 Dec 16
Posts: 101
Credit: 5,269,439
RAC: 25,064
Message 31734 - Posted: 31 Jul 2017, 0:32:56 UTC
Last modified: 31 Jul 2017, 0:34:27 UTC

From what I saw, the problem only occurs when downloading the biggest file (110 - 120Mbytes), the other files of the task download without problem.
Also the problem occurred progressively, I mean that 2 days ago the download was possible, but extremely slow and after multiple re-tries. Now the download fails systematically, with the message "server backoff".
____________
We are the product of random evolution.

Harri Liljeroos
Avatar
Send message
Joined: 28 Sep 04
Posts: 189
Credit: 6,003,668
RAC: 4,414
Message 31736 - Posted: 31 Jul 2017, 8:09:31 UTC

Atlas downloads are working again, I've got a couple of tasks this morning.
____________

Profile MAGIC Quantum Mechanic
Avatar
Send message
Joined: 24 Oct 04
Posts: 494
Credit: 14,295,892
RAC: 11,839
Message 31737 - Posted: 31 Jul 2017, 9:15:12 UTC

Atlas was down for the weekend but is trying to get back to work now.
____________
Volunteer Mad Scientist For Life

Erich56
Send message
Joined: 18 Dec 15
Posts: 304
Credit: 3,437,579
RAC: 8,426
Message 31738 - Posted: 31 Jul 2017, 10:31:00 UTC - in response to Message 31737.

Atlas was down for the weekend but is trying to get back to work now.

I just pinged 128.142.202.86 - this now worked (in contrast to the past few days);
however, when applying tracert to this IP, the last communication is with
e513-e-rbrxl-1-ne0.cern.ch [192.65.184.37]
after this, there is a timeout.

When pinging 192.65.184.37, there is a timeout as well.

So obviously, the poblem still exists (to some extent)

Profile Yeti
Volunteer moderator
Avatar
Send message
Joined: 2 Sep 04
Posts: 281
Credit: 41,058,368
RAC: 50,570
Message 31739 - Posted: 31 Jul 2017, 11:39:29 UTC - in response to Message 31738.
Last modified: 31 Jul 2017, 11:39:49 UTC

So obviously, the poblem still exists (to some extent)

My clients have succesfull downloaded work and filled up their buffers again, that wouldn't have been possible if there is still a problem.

It is normal for most servers on the I-Net, that traceroute can not trace the whole track to the target
____________


Supporting BOINC, a great concept !

Erich56
Send message
Joined: 18 Dec 15
Posts: 304
Credit: 3,437,579
RAC: 8,426
Message 31740 - Posted: 31 Jul 2017, 12:06:18 UTC - in response to Message 31739.

It is normal for most servers on the I-Net, that traceroute can not trace the whole track to the target

okay, thanks for the Information; I was not aware of that. So I'll try ATLAS still today.

Profile thomasroderick
Send message
Joined: 22 May 17
Posts: 10
Credit: 230,327
RAC: 2,637
Message 31747 - Posted: 1 Aug 2017, 3:48:10 UTC

Coming full circle on the thread... I was able to successfully download Atlas files and tasks this evening. The downloads started off a little slow on the throughput, otherwise there were no issues. All is well again, thank you!

Erich56
Send message
Joined: 18 Dec 15
Posts: 304
Credit: 3,437,579
RAC: 8,426
Message 31748 - Posted: 1 Aug 2017, 5:16:20 UTC

After I could download several ATLAS tasks since yesterday, right now, a new ATLAS task download again got stuck with the 116MB file (all other, smaller files downloaded well).

So, the recent problem seems to be back :-(((
What's going on at CERN?

Eric Mcintosh
Volunteer moderator
Project administrator
Project developer
Project tester
Project scientist
Send message
Joined: 12 Jul 11
Posts: 837
Credit: 1,421,222
RAC: 1,127
Message 31749 - Posted: 1 Aug 2017, 5:34:14 UTC - in response to Message 31748.

Although I am a "sixtrack" man I am following this as , I am sure, are my
colleagues. My PERSONAL position is that there are serious network/server
overload problems, errors are not being recovered, but that is just me..........Eric.


After I could download several ATLAS tasks since yesterday, right now, a new ATLAS task download again got stuck with the 116MB file (all other, smaller files downloaded well).

So, the recent problem seems to be back :-(((
What's going on at CERN?

____________

Crystal Pellet
Volunteer moderator
Volunteer tester
Send message
Joined: 14 Jan 10
Posts: 328
Credit: 2,772,160
RAC: 3,191
Message 31750 - Posted: 1 Aug 2017, 5:42:41 UTC

It's obviously holiday time, so it's for my machine when it wants to run ATLAS.

Again 2 tasks:

<core_client_version>7.7.2</core_client_version>
WU download error: couldn't get input files:
<file_xfer_error>
<file_name>jf_7cd27135204b4d2716c62ba7aab9f41f</file_name>
<error_code>-119 (md5 checksum failed for file)</error_code>
</file_xfer_error>


and

<core_client_version>7.7.2</core_client_version>
WU download error: couldn't get input files:
<file_xfer_error>
<file_name>jf_97f95c9e9dae64907e7b324f5bf84ba1</file_name>
<error_code>-119 (md5 checksum failed for file)</error_code>
</file_xfer_error>

computezrmle
Send message
Joined: 15 Jun 08
Posts: 347
Credit: 3,399,908
RAC: 3,711
Message 31751 - Posted: 1 Aug 2017, 6:09:28 UTC

I see the same errors on both of my hosts:
- Download of the large job file runs into transient http errors several times
- When the download finally succeeded and the job starts, the download of the smaller files is very slow and most of them are downloaded from a spare server (ccfrontier.in2p3.fr, port 23128)
- after all downloads are finished, the job failes with error 65
- increasing the RAM setting for the VM does not solve the problem
- It affects only ATLAS, other vbox projects from CERN run ok.

All together it looks like a network or firewall problem at CERN or it's partners.


Sad to say that since Erich56 pointed out the problem, nobody from the ATLAS responsibles wrote a single word here in the message board.
Are you aware of it?

1 · 2 · Next

Message boards : ATLAS application : Download failures