Thread 'Download failures'

Author	Message
thomasroderick Send message Joined: 22 May 17 Posts: 16 Credit: 1,257,383 RAC: 0	Message 31711 - Posted: 30 Jul 2017, 0:32:04 UTC Ever since the day that there was the issue after the cleanup of old files, I have been experiencing the same issues. LHC was running Atlas fine for a long time, where it would download 4 tasks and run them through without issue.. What I am seeing now (and since the day of the server cleanup) is my machine will attempt to download files for tasks, and get stuck retrying on a few for several hours. 7/29/2017 7:22:43 PM \| LHC@home \| Started download of jf_f3ff3ac08153d0ee04ea606f0dea9a0e 7/29/2017 7:23:05 PM \| \| Project communication failed: attempting access to reference site 7/29/2017 7:23:05 PM \| LHC@home \| Temporarily failed download of jf_f3ff3ac08153d0ee04ea606f0dea9a0e: connect() failed 7/29/2017 7:23:05 PM \| LHC@home \| Backing off 01:05:41 on download of jf_f3ff3ac08153d0ee04ea606f0dea9a0e 7/29/2017 7:23:06 PM \| \| Internet access OK - project servers may be temporarily down. I have Updated, Restarted, Removed and re-added the LHC project several times, over several days. Suggestions? Tried 2 different networks (home and work), same issues. Connection to other projects are no issue. ID: 31711 · Reply Quote

Erich56 Send message Joined: 18 Dec 15 Posts: 1984 Credit: 161,353,716 RAC: 65,686	Message 31712 - Posted: 30 Jul 2017, 3:59:12 UTC - in response to Message 31711. Ever since the day that there was the issue after the cleanup of old files, I have been experiencing the same issues. LHC was running Atlas fine for a long time, where it would download 4 tasks and run them through without issue.. What I am seeing now (and since the day of the server cleanup) is my machine will attempt to download files for tasks, and get stuck retrying on a few for several hours. 7/29/2017 7:22:43 PM \| LHC@home \| Started download of jf_f3ff3ac08153d0ee04ea606f0dea9a0e 7/29/2017 7:23:05 PM \| \| Project communication failed: attempting access to reference site 7/29/2017 7:23:05 PM \| LHC@home \| Temporarily failed download of jf_f3ff3ac08153d0ee04ea606f0dea9a0e: connect() failed 7/29/2017 7:23:05 PM \| LHC@home \| Backing off 01:05:41 on download of jf_f3ff3ac08153d0ee04ea606f0dea9a0e 7/29/2017 7:23:06 PM \| \| Internet access OK - project servers may be temporarily down. I have Updated, Restarted, Removed and re-added the LHC project several times, over several days. Suggestions? Tried 2 different networks (home and work), same issues. Connection to other projects are no issue. this is exactly what I am experiencing now. So all of us who have this problem can at least be sure that is has nothing to do with our systems. It's rather up to CERN to fix this issue. All we can do is hoping that somone there reads this thread and takes some action. ID: 31712 · Reply Quote

Crystal Pellet Volunteer moderator Volunteer tester Send message Joined: 14 Jan 10 Posts: 1553 Credit: 10,091,787 RAC: 1,758	Message 31724 - Posted: 30 Jul 2017, 9:58:04 UTC Last modified: 30 Jul 2017, 11:16:05 UTC Opening this thread, cause I experienced download issues with the 120MB ATLAS task-files like others did, but mentioned in another thread (will move those posts to here) <core_client_version>7.7.2</core_client_version> <![CDATA[ <message> WU download error: couldn't get input files: <file_xfer_error> <file_name>jf_5e60912e104e160658713cac240e41fb</file_name> <error_code>-119 (md5 checksum failed for file)</error_code> </file_xfer_error> I've also noticed today and yesterday when visiting webpages on LHC@home and returning to a previous page, I sometimes get the browser notice: network changed detected. ID: 31724 · Reply Quote

Erich56 Send message Joined: 18 Dec 15 Posts: 1984 Credit: 161,353,716 RAC: 65,686	Message 31725 - Posted: 30 Jul 2017, 10:39:45 UTC - in response to Message 31724. Opening this thread ... great idea, thanks! Which will be the next steps? My question means in particular how this problem can/will be brought to the attention of those people at CERN who are in charge of repairing such network issues. ID: 31725 · Reply Quote

XOX Send message Joined: 20 Feb 16 Posts: 3 Credit: 46,306 RAC: 0	Message 31726 - Posted: 30 Jul 2017, 10:43:47 UTC hey guys, i have the same problem. I hope the CERN Team can fixed.... 30.07.2017 12:29:34 \| LHC@home \| Started download of boinc_job_script.8sdJMf 30.07.2017 12:29:35 \| LHC@home \| Finished download of boinc_job_script.8sdJMf 30.07.2017 12:29:53 \| LHC@home \| Temporarily failed download of jf_122159ff524343058e02c7137926559d: connect() failed 30.07.2017 12:29:53 \| LHC@home \| Backing off 00:03:14 on download of jf_122159ff524343058e02c7137926559d 30.07.2017 12:30:04 \| \| Project communication failed: attempting access to reference site 30.07.2017 12:30:06 \| \| Internet access OK - project servers may be temporarily down. ID: 31726 · Reply Quote

Greger Send message Joined: 9 Jan 15 Posts: 151 Credit: 431,596,822 RAC: 0	Message 31727 - Posted: 30 Jul 2017, 13:44:27 UTC Sun 30 Jul 2017 02:42:16 PM CEST \| LHC@home \| Temporarily failed download of jf_3536bf3e25f337041aca72316e5e0fec: transient HTTP error Sun 30 Jul 2017 02:42:16 PM CEST \| LHC@home \| Backing off 00:25:41 on download of jf_3536bf3e25f337041aca72316e5e0fec Sun 30 Jul 2017 02:42:16 PM CEST \| LHC@home \| Temporarily failed download of jf_d4b6ce59cac0e54eb4bddb1b2e4b43e2: transient HTTP error Sun 30 Jul 2017 02:42:16 PM CEST \| LHC@home \| Backing off 00:16:41 on download of jf_d4b6ce59cac0e54eb4bddb1b2e4b43e2 Sun 30 Jul 2017 02:42:18 PM CEST \| \| Internet access OK - project servers may be temporarily down. With debug: Sun 30 Jul 2017 03:37:54 PM CEST \| LHC@home \| [http] HTTP_OP::init_get(): http://boincai04.cern.ch/Atlas-test/download/10d/vnPNDmwxevqnSu7Ccp2YYBZmABFKDmABFKDmXNGKDmhFLKDmFy3E7n_EVNT.11266146._002827.pool.root.1 Sun 30 Jul 2017 03:37:54 PM CEST \| LHC@home \| Started download of jf_3536bf3e25f337041aca72316e5e0fec Sun 30 Jul 2017 03:37:54 PM CEST \| LHC@home \| [http] HTTP_OP::init_get(): http://boincai04.cern.ch/Atlas-test/download/13c/GmYNDmcofvqnSu7Ccp2YYBZmABFKDmABFKDmXNGKDmuHLKDmJIpshn_EVNT.11266146._002831.pool.root.1 Sun 30 Jul 2017 03:37:54 PM CEST \| LHC@home \| Started download of jf_d4b6ce59cac0e54eb4bddb1b2e4b43e2 Sun 30 Jul 2017 03:37:54 PM CEST \| LHC@home \| [http] [ID#1548] Info: Connection 853 seems to be dead! Sun 30 Jul 2017 03:37:54 PM CEST \| LHC@home \| [http] [ID#1548] Info: Closing connection 853 Sun 30 Jul 2017 03:37:54 PM CEST \| LHC@home \| [http] [ID#1549] Info: Found bundle for host boincai04.cern.ch: 0x559afaf3cfe0 [serially] Sun 30 Jul 2017 03:37:54 PM CEST \| \| [network_status] status: online Sun 30 Jul 2017 03:37:55 PM CEST \| LHC@home \| [http] [ID#1548] Info: Trying 128.142.202.86... Sun 30 Jul 2017 03:37:55 PM CEST \| LHC@home \| [http] [ID#1549] Info: Hostname was found in DNS cache Sun 30 Jul 2017 03:37:55 PM CEST \| LHC@home \| [http] [ID#1549] Info: Trying 128.142.202.86... ID: 31727 · Reply Quote

Erich56 Send message Joined: 18 Dec 15 Posts: 1984 Credit: 161,353,716 RAC: 65,686	Message 31728 - Posted: 30 Jul 2017, 14:12:53 UTC - in response to Message 31727. Last modified: 30 Jul 2017, 14:20:32 UTC Info: Hostname was found in DNS cache Sun 30 Jul 2017 03:37:55 PM CEST \| LHC@home \| [http] [ID#1549] Info: Trying 128.142.202.86... pinging 128.142.202.86 yields "request timed out" - what was to be expected :-( with tracert, the last successful connection is with e513-e-rbrxl-1-ne0.cern.ch [192.65.184.37] after this, again "timeout" ID: 31728 · Reply Quote

djoser Send message Joined: 30 Aug 14 Posts: 145 Credit: 10,847,070 RAC: 0	Message 31729 - Posted: 30 Jul 2017, 17:31:32 UTC I got the same download problems with one of my machines, which is dedicated to Atlas tasks. I have the feeling that this situation is somehow related to Sixtrack. Whenever Sixtrack has thousands of workunits in the queue, Atlas seem to get "hickups". I recall similar problems last time Sixtrack had so much WU's to be distributed a few weeks ago. Could this be associated? Why mine when you can research? - GRIDCOIN - Real cryptocurrency without wasting hashes! https://gridcoin.us ID: 31729 · Reply Quote

Jesse Viviano Send message Joined: 12 Feb 14 Posts: 72 Credit: 4,639,155 RAC: 0	Message 31732 - Posted: 30 Jul 2017, 23:01:30 UTC Did someone move the ATLAS@home download server to another IP address? I noticed that my BOINC client cannot connect to the download server at all in regards to the ATLAS@home tasks, while it is able to download other tasks. If that is the case, the solution could be to wait for the old DNS entry to expire. However, if someone changed the DNS without moving the ATLAS@home server to the new IP address, then either the DNS server's entry for the ATLAS@home download server needs to be changed back or the ATLAS@home server needs to be moved to the new IP address. ID: 31732 · Reply Quote

HerveUAE Send message Joined: 18 Dec 16 Posts: 123 Credit: 37,495,365 RAC: 0	Message 31734 - Posted: 31 Jul 2017, 0:32:56 UTC Last modified: 31 Jul 2017, 0:34:27 UTC From what I saw, the problem only occurs when downloading the biggest file (110 - 120Mbytes), the other files of the task download without problem. Also the problem occurred progressively, I mean that 2 days ago the download was possible, but extremely slow and after multiple re-tries. Now the download fails systematically, with the message "server backoff". We are the product of random evolution. ID: 31734 · Reply Quote

Harri Liljeroos Send message Joined: 28 Sep 04 Posts: 804 Credit: 65,818,891 RAC: 26,588	Message 31736 - Posted: 31 Jul 2017, 8:09:31 UTC Atlas downloads are working again, I've got a couple of tasks this morning. ID: 31736 · Reply Quote

Magic Quantum Mechanic Send message Joined: 24 Oct 04 Posts: 1308 Credit: 96,606,922 RAC: 69,927	Message 31737 - Posted: 31 Jul 2017, 9:15:12 UTC Atlas was down for the weekend but is trying to get back to work now. Volunteer Mad Scientist For Life unbelievable are you trying to promote linux again? ID: 31737 · Reply Quote

Erich56 Send message Joined: 18 Dec 15 Posts: 1984 Credit: 161,353,716 RAC: 65,686	Message 31738 - Posted: 31 Jul 2017, 10:31:00 UTC - in response to Message 31737. Atlas was down for the weekend but is trying to get back to work now. I just pinged 128.142.202.86 - this now worked (in contrast to the past few days); however, when applying tracert to this IP, the last communication is with e513-e-rbrxl-1-ne0.cern.ch [192.65.184.37] after this, there is a timeout. When pinging 192.65.184.37, there is a timeout as well. So obviously, the poblem still exists (to some extent) ID: 31738 · Reply Quote

Yeti Volunteer moderator Send message Joined: 2 Sep 04 Posts: 468 Credit: 224,935,712 RAC: 5,948	Message 31739 - Posted: 31 Jul 2017, 11:39:29 UTC - in response to Message 31738. Last modified: 31 Jul 2017, 11:39:49 UTC So obviously, the poblem still exists (to some extent) My clients have succesfull downloaded work and filled up their buffers again, that wouldn't have been possible if there is still a problem. It is normal for most servers on the I-Net, that traceroute can not trace the whole track to the target Supporting BOINC, a great concept ! ID: 31739 · Reply Quote

Erich56 Send message Joined: 18 Dec 15 Posts: 1984 Credit: 161,353,716 RAC: 65,686	Message 31740 - Posted: 31 Jul 2017, 12:06:18 UTC - in response to Message 31739. It is normal for most servers on the I-Net, that traceroute can not trace the whole track to the target okay, thanks for the Information; I was not aware of that. So I'll try ATLAS still today. ID: 31740 · Reply Quote

thomasroderick Send message Joined: 22 May 17 Posts: 16 Credit: 1,257,383 RAC: 0	Message 31747 - Posted: 1 Aug 2017, 3:48:10 UTC Coming full circle on the thread... I was able to successfully download Atlas files and tasks this evening. The downloads started off a little slow on the throughput, otherwise there were no issues. All is well again, thank you! ID: 31747 · Reply Quote

Erich56 Send message Joined: 18 Dec 15 Posts: 1984 Credit: 161,353,716 RAC: 65,686	Message 31748 - Posted: 1 Aug 2017, 5:16:20 UTC After I could download several ATLAS tasks since yesterday, right now, a new ATLAS task download again got stuck with the 116MB file (all other, smaller files downloaded well). So, the recent problem seems to be back :-((( What's going on at CERN? ID: 31748 · Reply Quote

Eric Mcintosh Volunteer moderator Project administrator Project developer Project tester Project scientist Send message Joined: 12 Jul 11 Posts: 857 Credit: 1,619,050 RAC: 0	Message 31749 - Posted: 1 Aug 2017, 5:34:14 UTC - in response to Message 31748. Although I am a "sixtrack" man I am following this as , I am sure, are my colleagues. My PERSONAL position is that there are serious network/server overload problems, errors are not being recovered, but that is just me..........Eric. After I could download several ATLAS tasks since yesterday, right now, a new ATLAS task download again got stuck with the 116MB file (all other, smaller files downloaded well). So, the recent problem seems to be back :-((( What's going on at CERN? ID: 31749 · Reply Quote

Crystal Pellet Volunteer moderator Volunteer tester Send message Joined: 14 Jan 10 Posts: 1553 Credit: 10,091,787 RAC: 1,758	Message 31750 - Posted: 1 Aug 2017, 5:42:41 UTC It's obviously holiday time, so it's for my machine when it wants to run ATLAS. Again 2 tasks: <core_client_version>7.7.2</core_client_version> WU download error: couldn't get input files: <file_xfer_error> <file_name>jf_7cd27135204b4d2716c62ba7aab9f41f</file_name> <error_code>-119 (md5 checksum failed for file)</error_code> </file_xfer_error> and <core_client_version>7.7.2</core_client_version> WU download error: couldn't get input files: <file_xfer_error> <file_name>jf_97f95c9e9dae64907e7b324f5bf84ba1</file_name> <error_code>-119 (md5 checksum failed for file)</error_code> </file_xfer_error> ID: 31750 · Reply Quote

computezrmle Volunteer moderator Volunteer developer Volunteer tester Help desk expert Send message Joined: 15 Jun 08 Posts: 2753 Credit: 303,228,027 RAC: 95,263	Message 31751 - Posted: 1 Aug 2017, 6:09:28 UTC I see the same errors on both of my hosts: - Download of the large job file runs into transient http errors several times - When the download finally succeeded and the job starts, the download of the smaller files is very slow and most of them are downloaded from a spare server (ccfrontier.in2p3.fr, port 23128) - after all downloads are finished, the job failes with error 65 - increasing the RAM setting for the VM does not solve the problem - It affects only ATLAS, other vbox projects from CERN run ok. All together it looks like a network or firewall problem at CERN or it's partners. Sad to say that since Erich56 pointed out the problem, nobody from the ATLAS responsibles wrote a single word here in the message board. Are you aware of it? ID: 31751 · Reply Quote