Message boards :
ATLAS application :
Download failures
Message board moderation
Author | Message |
---|---|
Send message Joined: 22 May 17 Posts: 15 Credit: 1,226,011 RAC: 569 |
Ever since the day that there was the issue after the cleanup of old files, I have been experiencing the same issues. LHC was running Atlas fine for a long time, where it would download 4 tasks and run them through without issue.. What I am seeing now (and since the day of the server cleanup) is my machine will attempt to download files for tasks, and get stuck retrying on a few for several hours. 7/29/2017 7:22:43 PM | LHC@home | Started download of jf_f3ff3ac08153d0ee04ea606f0dea9a0e 7/29/2017 7:23:05 PM | | Project communication failed: attempting access to reference site 7/29/2017 7:23:05 PM | LHC@home | Temporarily failed download of jf_f3ff3ac08153d0ee04ea606f0dea9a0e: connect() failed 7/29/2017 7:23:05 PM | LHC@home | Backing off 01:05:41 on download of jf_f3ff3ac08153d0ee04ea606f0dea9a0e 7/29/2017 7:23:06 PM | | Internet access OK - project servers may be temporarily down. I have Updated, Restarted, Removed and re-added the LHC project several times, over several days. Suggestions? Tried 2 different networks (home and work), same issues. Connection to other projects are no issue. |
Send message Joined: 18 Dec 15 Posts: 1785 Credit: 117,278,447 RAC: 71,589 |
Ever since the day that there was the issue after the cleanup of old files, I have been experiencing the same issues. LHC was running Atlas fine for a long time, where it would download 4 tasks and run them through without issue.. What I am seeing now (and since the day of the server cleanup) is my machine will attempt to download files for tasks, and get stuck retrying on a few for several hours. this is exactly what I am experiencing now. So all of us who have this problem can at least be sure that is has nothing to do with our systems. It's rather up to CERN to fix this issue. All we can do is hoping that somone there reads this thread and takes some action. |
Send message Joined: 14 Jan 10 Posts: 1411 Credit: 9,433,926 RAC: 11,615 |
Opening this thread, cause I experienced download issues with the 120MB ATLAS task-files like others did, but mentioned in another thread (will move those posts to here) <core_client_version>7.7.2</core_client_version> <![CDATA[ <message> WU download error: couldn't get input files: <file_xfer_error> <file_name>jf_5e60912e104e160658713cac240e41fb</file_name> <error_code>-119 (md5 checksum failed for file)</error_code> </file_xfer_error> I've also noticed today and yesterday when visiting webpages on LHC@home and returning to a previous page, I sometimes get the browser notice: network changed detected. |
Send message Joined: 18 Dec 15 Posts: 1785 Credit: 117,278,447 RAC: 71,589 |
Opening this thread ... great idea, thanks! Which will be the next steps? My question means in particular how this problem can/will be brought to the attention of those people at CERN who are in charge of repairing such network issues. |
Send message Joined: 20 Feb 16 Posts: 3 Credit: 46,306 RAC: 0 |
hey guys, i have the same problem. I hope the CERN Team can fixed.... 30.07.2017 12:29:34 | LHC@home | Started download of boinc_job_script.8sdJMf 30.07.2017 12:29:35 | LHC@home | Finished download of boinc_job_script.8sdJMf 30.07.2017 12:29:53 | LHC@home | Temporarily failed download of jf_122159ff524343058e02c7137926559d: connect() failed 30.07.2017 12:29:53 | LHC@home | Backing off 00:03:14 on download of jf_122159ff524343058e02c7137926559d 30.07.2017 12:30:04 | | Project communication failed: attempting access to reference site 30.07.2017 12:30:06 | | Internet access OK - project servers may be temporarily down. |
Send message Joined: 9 Jan 15 Posts: 151 Credit: 431,596,822 RAC: 0 |
Sun 30 Jul 2017 02:42:16 PM CEST | LHC@home | Temporarily failed download of jf_3536bf3e25f337041aca72316e5e0fec: transient HTTP error Sun 30 Jul 2017 02:42:16 PM CEST | LHC@home | Backing off 00:25:41 on download of jf_3536bf3e25f337041aca72316e5e0fec Sun 30 Jul 2017 02:42:16 PM CEST | LHC@home | Temporarily failed download of jf_d4b6ce59cac0e54eb4bddb1b2e4b43e2: transient HTTP error Sun 30 Jul 2017 02:42:16 PM CEST | LHC@home | Backing off 00:16:41 on download of jf_d4b6ce59cac0e54eb4bddb1b2e4b43e2 Sun 30 Jul 2017 02:42:18 PM CEST | | Internet access OK - project servers may be temporarily down. With debug: Sun 30 Jul 2017 03:37:54 PM CEST | LHC@home | [http] HTTP_OP::init_get(): http://boincai04.cern.ch/Atlas-test/download/10d/vnPNDmwxevqnSu7Ccp2YYBZmABFKDmABFKDmXNGKDmhFLKDmFy3E7n_EVNT.11266146._002827.pool.root.1 Sun 30 Jul 2017 03:37:54 PM CEST | LHC@home | Started download of jf_3536bf3e25f337041aca72316e5e0fec Sun 30 Jul 2017 03:37:54 PM CEST | LHC@home | [http] HTTP_OP::init_get(): http://boincai04.cern.ch/Atlas-test/download/13c/GmYNDmcofvqnSu7Ccp2YYBZmABFKDmABFKDmXNGKDmuHLKDmJIpshn_EVNT.11266146._002831.pool.root.1 Sun 30 Jul 2017 03:37:54 PM CEST | LHC@home | Started download of jf_d4b6ce59cac0e54eb4bddb1b2e4b43e2 Sun 30 Jul 2017 03:37:54 PM CEST | LHC@home | [http] [ID#1548] Info: Connection 853 seems to be dead! Sun 30 Jul 2017 03:37:54 PM CEST | LHC@home | [http] [ID#1548] Info: Closing connection 853 Sun 30 Jul 2017 03:37:54 PM CEST | LHC@home | [http] [ID#1549] Info: Found bundle for host boincai04.cern.ch: 0x559afaf3cfe0 [serially] Sun 30 Jul 2017 03:37:54 PM CEST | | [network_status] status: online Sun 30 Jul 2017 03:37:55 PM CEST | LHC@home | [http] [ID#1548] Info: Trying 128.142.202.86... Sun 30 Jul 2017 03:37:55 PM CEST | LHC@home | [http] [ID#1549] Info: Hostname was found in DNS cache Sun 30 Jul 2017 03:37:55 PM CEST | LHC@home | [http] [ID#1549] Info: Trying 128.142.202.86... |
Send message Joined: 18 Dec 15 Posts: 1785 Credit: 117,278,447 RAC: 71,589 |
Info: Hostname was found in DNS cache pinging 128.142.202.86 yields "request timed out" - what was to be expected :-( with tracert, the last successful connection is with e513-e-rbrxl-1-ne0.cern.ch [192.65.184.37] after this, again "timeout" |
Send message Joined: 30 Aug 14 Posts: 145 Credit: 10,847,070 RAC: 0 |
I got the same download problems with one of my machines, which is dedicated to Atlas tasks. I have the feeling that this situation is somehow related to Sixtrack. Whenever Sixtrack has thousands of workunits in the queue, Atlas seem to get "hickups". I recall similar problems last time Sixtrack had so much WU's to be distributed a few weeks ago. Could this be associated? Why mine when you can research? - GRIDCOIN - Real cryptocurrency without wasting hashes! https://gridcoin.us |
Send message Joined: 12 Feb 14 Posts: 72 Credit: 4,639,155 RAC: 0 |
Did someone move the ATLAS@home download server to another IP address? I noticed that my BOINC client cannot connect to the download server at all in regards to the ATLAS@home tasks, while it is able to download other tasks. If that is the case, the solution could be to wait for the old DNS entry to expire. However, if someone changed the DNS without moving the ATLAS@home server to the new IP address, then either the DNS server's entry for the ATLAS@home download server needs to be changed back or the ATLAS@home server needs to be moved to the new IP address. |
Send message Joined: 18 Dec 16 Posts: 123 Credit: 37,495,365 RAC: 0 |
From what I saw, the problem only occurs when downloading the biggest file (110 - 120Mbytes), the other files of the task download without problem. Also the problem occurred progressively, I mean that 2 days ago the download was possible, but extremely slow and after multiple re-tries. Now the download fails systematically, with the message "server backoff". We are the product of random evolution. |
Send message Joined: 28 Sep 04 Posts: 722 Credit: 48,342,058 RAC: 29,814 |
Atlas downloads are working again, I've got a couple of tasks this morning. |
Send message Joined: 24 Oct 04 Posts: 1169 Credit: 54,079,358 RAC: 51,688 |
Atlas was down for the weekend but is trying to get back to work now. Volunteer Mad Scientist For Life |
Send message Joined: 18 Dec 15 Posts: 1785 Credit: 117,278,447 RAC: 71,589 |
Atlas was down for the weekend but is trying to get back to work now. I just pinged 128.142.202.86 - this now worked (in contrast to the past few days); however, when applying tracert to this IP, the last communication is with e513-e-rbrxl-1-ne0.cern.ch [192.65.184.37] after this, there is a timeout. When pinging 192.65.184.37, there is a timeout as well. So obviously, the poblem still exists (to some extent) |
Send message Joined: 2 Sep 04 Posts: 455 Credit: 200,206,925 RAC: 46,843 |
So obviously, the poblem still exists (to some extent) My clients have succesfull downloaded work and filled up their buffers again, that wouldn't have been possible if there is still a problem. It is normal for most servers on the I-Net, that traceroute can not trace the whole track to the target Supporting BOINC, a great concept ! |
Send message Joined: 18 Dec 15 Posts: 1785 Credit: 117,278,447 RAC: 71,589 |
It is normal for most servers on the I-Net, that traceroute can not trace the whole track to the target okay, thanks for the Information; I was not aware of that. So I'll try ATLAS still today. |
Send message Joined: 22 May 17 Posts: 15 Credit: 1,226,011 RAC: 569 |
Coming full circle on the thread... I was able to successfully download Atlas files and tasks this evening. The downloads started off a little slow on the throughput, otherwise there were no issues. All is well again, thank you! |
Send message Joined: 18 Dec 15 Posts: 1785 Credit: 117,278,447 RAC: 71,589 |
After I could download several ATLAS tasks since yesterday, right now, a new ATLAS task download again got stuck with the 116MB file (all other, smaller files downloaded well). So, the recent problem seems to be back :-((( What's going on at CERN? |
Send message Joined: 12 Jul 11 Posts: 857 Credit: 1,619,050 RAC: 0 |
Although I am a "sixtrack" man I am following this as , I am sure, are my colleagues. My PERSONAL position is that there are serious network/server overload problems, errors are not being recovered, but that is just me..........Eric. After I could download several ATLAS tasks since yesterday, right now, a new ATLAS task download again got stuck with the 116MB file (all other, smaller files downloaded well). |
Send message Joined: 14 Jan 10 Posts: 1411 Credit: 9,433,926 RAC: 11,615 |
It's obviously holiday time, so it's for my machine when it wants to run ATLAS. Again 2 tasks: <core_client_version>7.7.2</core_client_version> WU download error: couldn't get input files: <file_xfer_error> <file_name>jf_7cd27135204b4d2716c62ba7aab9f41f</file_name> <error_code>-119 (md5 checksum failed for file)</error_code> </file_xfer_error> and <core_client_version>7.7.2</core_client_version> WU download error: couldn't get input files: <file_xfer_error> <file_name>jf_97f95c9e9dae64907e7b324f5bf84ba1</file_name> <error_code>-119 (md5 checksum failed for file)</error_code> </file_xfer_error> |
Send message Joined: 15 Jun 08 Posts: 2520 Credit: 251,915,653 RAC: 128,265 |
I see the same errors on both of my hosts: - Download of the large job file runs into transient http errors several times - When the download finally succeeded and the job starts, the download of the smaller files is very slow and most of them are downloaded from a spare server (ccfrontier.in2p3.fr, port 23128) - after all downloads are finished, the job failes with error 65 - increasing the RAM setting for the VM does not solve the problem - It affects only ATLAS, other vbox projects from CERN run ok. All together it looks like a network or firewall problem at CERN or it's partners. Sad to say that since Erich56 pointed out the problem, nobody from the ATLAS responsibles wrote a single word here in the message board. Are you aware of it? |
©2024 CERN