Message boards : ATLAS application : Download failures
Message board moderation

To post messages, you must log in.

1 · 2 · 3 · 4 . . . 8 · Next

AuthorMessage
Profile thomasroderick

Send message
Joined: 22 May 17
Posts: 15
Credit: 1,226,011
RAC: 467
Message 31711 - Posted: 30 Jul 2017, 0:32:04 UTC

Ever since the day that there was the issue after the cleanup of old files, I have been experiencing the same issues. LHC was running Atlas fine for a long time, where it would download 4 tasks and run them through without issue.. What I am seeing now (and since the day of the server cleanup) is my machine will attempt to download files for tasks, and get stuck retrying on a few for several hours.

7/29/2017 7:22:43 PM | LHC@home | Started download of jf_f3ff3ac08153d0ee04ea606f0dea9a0e
7/29/2017 7:23:05 PM | | Project communication failed: attempting access to reference site
7/29/2017 7:23:05 PM | LHC@home | Temporarily failed download of jf_f3ff3ac08153d0ee04ea606f0dea9a0e: connect() failed
7/29/2017 7:23:05 PM | LHC@home | Backing off 01:05:41 on download of jf_f3ff3ac08153d0ee04ea606f0dea9a0e
7/29/2017 7:23:06 PM | | Internet access OK - project servers may be temporarily down.

I have Updated, Restarted, Removed and re-added the LHC project several times, over several days. Suggestions? Tried 2 different networks (home and work), same issues. Connection to other projects are no issue.
ID: 31711 · Report as offensive     Reply Quote
Erich56

Send message
Joined: 18 Dec 15
Posts: 1788
Credit: 117,620,059
RAC: 80,849
Message 31712 - Posted: 30 Jul 2017, 3:59:12 UTC - in response to Message 31711.  

Ever since the day that there was the issue after the cleanup of old files, I have been experiencing the same issues. LHC was running Atlas fine for a long time, where it would download 4 tasks and run them through without issue.. What I am seeing now (and since the day of the server cleanup) is my machine will attempt to download files for tasks, and get stuck retrying on a few for several hours.

7/29/2017 7:22:43 PM | LHC@home | Started download of jf_f3ff3ac08153d0ee04ea606f0dea9a0e
7/29/2017 7:23:05 PM | | Project communication failed: attempting access to reference site
7/29/2017 7:23:05 PM | LHC@home | Temporarily failed download of jf_f3ff3ac08153d0ee04ea606f0dea9a0e: connect() failed
7/29/2017 7:23:05 PM | LHC@home | Backing off 01:05:41 on download of jf_f3ff3ac08153d0ee04ea606f0dea9a0e
7/29/2017 7:23:06 PM | | Internet access OK - project servers may be temporarily down.

I have Updated, Restarted, Removed and re-added the LHC project several times, over several days. Suggestions? Tried 2 different networks (home and work), same issues. Connection to other projects are no issue.


this is exactly what I am experiencing now.
So all of us who have this problem can at least be sure that is has nothing to do with our systems. It's rather up to CERN to fix this issue.
All we can do is hoping that somone there reads this thread and takes some action.
ID: 31712 · Report as offensive     Reply Quote
Crystal Pellet
Volunteer moderator
Volunteer tester

Send message
Joined: 14 Jan 10
Posts: 1413
Credit: 9,434,983
RAC: 9,630
Message 31724 - Posted: 30 Jul 2017, 9:58:04 UTC
Last modified: 30 Jul 2017, 11:16:05 UTC

Opening this thread, cause I experienced download issues with the 120MB ATLAS task-files like others did, but mentioned in another thread (will move those posts to here)

<core_client_version>7.7.2</core_client_version>
<![CDATA[
<message>
WU download error: couldn't get input files:
<file_xfer_error>
<file_name>jf_5e60912e104e160658713cac240e41fb</file_name>
<error_code>-119 (md5 checksum failed for file)</error_code>
</file_xfer_error>

I've also noticed today and yesterday when visiting webpages on LHC@home and returning to a previous page, I sometimes get the browser notice: network changed detected.
ID: 31724 · Report as offensive     Reply Quote
Erich56

Send message
Joined: 18 Dec 15
Posts: 1788
Credit: 117,620,059
RAC: 80,849
Message 31725 - Posted: 30 Jul 2017, 10:39:45 UTC - in response to Message 31724.  

Opening this thread ...

great idea, thanks!

Which will be the next steps? My question means in particular how this problem can/will be brought to the attention of those people at CERN who are in charge of repairing such network issues.
ID: 31725 · Report as offensive     Reply Quote
XOX

Send message
Joined: 20 Feb 16
Posts: 3
Credit: 46,306
RAC: 0
Message 31726 - Posted: 30 Jul 2017, 10:43:47 UTC

hey guys, i have the same problem. I hope the CERN Team can fixed....

30.07.2017 12:29:34 | LHC@home | Started download of boinc_job_script.8sdJMf
30.07.2017 12:29:35 | LHC@home | Finished download of boinc_job_script.8sdJMf
30.07.2017 12:29:53 | LHC@home | Temporarily failed download of jf_122159ff524343058e02c7137926559d: connect() failed
30.07.2017 12:29:53 | LHC@home | Backing off 00:03:14 on download of jf_122159ff524343058e02c7137926559d
30.07.2017 12:30:04 | | Project communication failed: attempting access to reference site
30.07.2017 12:30:06 | | Internet access OK - project servers may be temporarily down.
ID: 31726 · Report as offensive     Reply Quote
Greger

Send message
Joined: 9 Jan 15
Posts: 151
Credit: 431,596,822
RAC: 0
Message 31727 - Posted: 30 Jul 2017, 13:44:27 UTC

Sun 30 Jul 2017 02:42:16 PM CEST | LHC@home | Temporarily failed download of jf_3536bf3e25f337041aca72316e5e0fec: transient HTTP error
Sun 30 Jul 2017 02:42:16 PM CEST | LHC@home | Backing off 00:25:41 on download of jf_3536bf3e25f337041aca72316e5e0fec
Sun 30 Jul 2017 02:42:16 PM CEST | LHC@home | Temporarily failed download of jf_d4b6ce59cac0e54eb4bddb1b2e4b43e2: transient HTTP error
Sun 30 Jul 2017 02:42:16 PM CEST | LHC@home | Backing off 00:16:41 on download of jf_d4b6ce59cac0e54eb4bddb1b2e4b43e2
Sun 30 Jul 2017 02:42:18 PM CEST | | Internet access OK - project servers may be temporarily down.

With debug:
Sun 30 Jul 2017 03:37:54 PM CEST | LHC@home | [http] HTTP_OP::init_get(): http://boincai04.cern.ch/Atlas-test/download/10d/vnPNDmwxevqnSu7Ccp2YYBZmABFKDmABFKDmXNGKDmhFLKDmFy3E7n_EVNT.11266146._002827.pool.root.1
Sun 30 Jul 2017 03:37:54 PM CEST | LHC@home | Started download of jf_3536bf3e25f337041aca72316e5e0fec
Sun 30 Jul 2017 03:37:54 PM CEST | LHC@home | [http] HTTP_OP::init_get(): http://boincai04.cern.ch/Atlas-test/download/13c/GmYNDmcofvqnSu7Ccp2YYBZmABFKDmABFKDmXNGKDmuHLKDmJIpshn_EVNT.11266146._002831.pool.root.1
Sun 30 Jul 2017 03:37:54 PM CEST | LHC@home | Started download of jf_d4b6ce59cac0e54eb4bddb1b2e4b43e2
Sun 30 Jul 2017 03:37:54 PM CEST | LHC@home | [http] [ID#1548] Info: Connection 853 seems to be dead!
Sun 30 Jul 2017 03:37:54 PM CEST | LHC@home | [http] [ID#1548] Info: Closing connection 853
Sun 30 Jul 2017 03:37:54 PM CEST | LHC@home | [http] [ID#1549] Info: Found bundle for host boincai04.cern.ch: 0x559afaf3cfe0 [serially]
Sun 30 Jul 2017 03:37:54 PM CEST | | [network_status] status: online
Sun 30 Jul 2017 03:37:55 PM CEST | LHC@home | [http] [ID#1548] Info: Trying 128.142.202.86...
Sun 30 Jul 2017 03:37:55 PM CEST | LHC@home | [http] [ID#1549] Info: Hostname was found in DNS cache
Sun 30 Jul 2017 03:37:55 PM CEST | LHC@home | [http] [ID#1549] Info: Trying 128.142.202.86...
ID: 31727 · Report as offensive     Reply Quote
Erich56

Send message
Joined: 18 Dec 15
Posts: 1788
Credit: 117,620,059
RAC: 80,849
Message 31728 - Posted: 30 Jul 2017, 14:12:53 UTC - in response to Message 31727.  
Last modified: 30 Jul 2017, 14:20:32 UTC

Info: Hostname was found in DNS cache
Sun 30 Jul 2017 03:37:55 PM CEST | LHC@home | [http] [ID#1549] Info: Trying 128.142.202.86...

pinging 128.142.202.86 yields "request timed out" - what was to be expected :-(

with tracert, the last successful connection is with
e513-e-rbrxl-1-ne0.cern.ch [192.65.184.37]

after this, again "timeout"
ID: 31728 · Report as offensive     Reply Quote
djoser
Avatar

Send message
Joined: 30 Aug 14
Posts: 145
Credit: 10,847,070
RAC: 0
Message 31729 - Posted: 30 Jul 2017, 17:31:32 UTC

I got the same download problems with one of my machines, which is dedicated to Atlas tasks.

I have the feeling that this situation is somehow related to Sixtrack. Whenever Sixtrack has thousands of workunits in the queue, Atlas seem to get "hickups". I recall similar problems last time Sixtrack had so much WU's to be distributed a few weeks ago.

Could this be associated?
Why mine when you can research? - GRIDCOIN - Real cryptocurrency without wasting hashes! https://gridcoin.us
ID: 31729 · Report as offensive     Reply Quote
Jesse Viviano

Send message
Joined: 12 Feb 14
Posts: 72
Credit: 4,639,155
RAC: 0
Message 31732 - Posted: 30 Jul 2017, 23:01:30 UTC

Did someone move the ATLAS@home download server to another IP address? I noticed that my BOINC client cannot connect to the download server at all in regards to the ATLAS@home tasks, while it is able to download other tasks. If that is the case, the solution could be to wait for the old DNS entry to expire. However, if someone changed the DNS without moving the ATLAS@home server to the new IP address, then either the DNS server's entry for the ATLAS@home download server needs to be changed back or the ATLAS@home server needs to be moved to the new IP address.
ID: 31732 · Report as offensive     Reply Quote
Profile HerveUAE
Avatar

Send message
Joined: 18 Dec 16
Posts: 123
Credit: 37,495,365
RAC: 0
Message 31734 - Posted: 31 Jul 2017, 0:32:56 UTC
Last modified: 31 Jul 2017, 0:34:27 UTC

From what I saw, the problem only occurs when downloading the biggest file (110 - 120Mbytes), the other files of the task download without problem.
Also the problem occurred progressively, I mean that 2 days ago the download was possible, but extremely slow and after multiple re-tries. Now the download fails systematically, with the message "server backoff".
We are the product of random evolution.
ID: 31734 · Report as offensive     Reply Quote
Harri Liljeroos
Avatar

Send message
Joined: 28 Sep 04
Posts: 722
Credit: 48,413,457
RAC: 27,593
Message 31736 - Posted: 31 Jul 2017, 8:09:31 UTC

Atlas downloads are working again, I've got a couple of tasks this morning.
ID: 31736 · Report as offensive     Reply Quote
Profile Magic Quantum Mechanic
Avatar

Send message
Joined: 24 Oct 04
Posts: 1169
Credit: 54,342,323
RAC: 59,768
Message 31737 - Posted: 31 Jul 2017, 9:15:12 UTC

Atlas was down for the weekend but is trying to get back to work now.
Volunteer Mad Scientist For Life
ID: 31737 · Report as offensive     Reply Quote
Erich56

Send message
Joined: 18 Dec 15
Posts: 1788
Credit: 117,620,059
RAC: 80,849
Message 31738 - Posted: 31 Jul 2017, 10:31:00 UTC - in response to Message 31737.  

Atlas was down for the weekend but is trying to get back to work now.

I just pinged 128.142.202.86 - this now worked (in contrast to the past few days);
however, when applying tracert to this IP, the last communication is with
e513-e-rbrxl-1-ne0.cern.ch [192.65.184.37]
after this, there is a timeout.

When pinging 192.65.184.37, there is a timeout as well.

So obviously, the poblem still exists (to some extent)
ID: 31738 · Report as offensive     Reply Quote
Profile Yeti
Volunteer moderator
Avatar

Send message
Joined: 2 Sep 04
Posts: 455
Credit: 200,408,411
RAC: 51,496
Message 31739 - Posted: 31 Jul 2017, 11:39:29 UTC - in response to Message 31738.  
Last modified: 31 Jul 2017, 11:39:49 UTC

So obviously, the poblem still exists (to some extent)

My clients have succesfull downloaded work and filled up their buffers again, that wouldn't have been possible if there is still a problem.

It is normal for most servers on the I-Net, that traceroute can not trace the whole track to the target


Supporting BOINC, a great concept !
ID: 31739 · Report as offensive     Reply Quote
Erich56

Send message
Joined: 18 Dec 15
Posts: 1788
Credit: 117,620,059
RAC: 80,849
Message 31740 - Posted: 31 Jul 2017, 12:06:18 UTC - in response to Message 31739.  

It is normal for most servers on the I-Net, that traceroute can not trace the whole track to the target

okay, thanks for the Information; I was not aware of that. So I'll try ATLAS still today.
ID: 31740 · Report as offensive     Reply Quote
Profile thomasroderick

Send message
Joined: 22 May 17
Posts: 15
Credit: 1,226,011
RAC: 467
Message 31747 - Posted: 1 Aug 2017, 3:48:10 UTC

Coming full circle on the thread... I was able to successfully download Atlas files and tasks this evening. The downloads started off a little slow on the throughput, otherwise there were no issues. All is well again, thank you!
ID: 31747 · Report as offensive     Reply Quote
Erich56

Send message
Joined: 18 Dec 15
Posts: 1788
Credit: 117,620,059
RAC: 80,849
Message 31748 - Posted: 1 Aug 2017, 5:16:20 UTC

After I could download several ATLAS tasks since yesterday, right now, a new ATLAS task download again got stuck with the 116MB file (all other, smaller files downloaded well).

So, the recent problem seems to be back :-(((
What's going on at CERN?
ID: 31748 · Report as offensive     Reply Quote
Eric Mcintosh
Volunteer moderator
Project administrator
Project developer
Project tester
Project scientist

Send message
Joined: 12 Jul 11
Posts: 857
Credit: 1,619,050
RAC: 0
Message 31749 - Posted: 1 Aug 2017, 5:34:14 UTC - in response to Message 31748.  

Although I am a "sixtrack" man I am following this as , I am sure, are my
colleagues. My PERSONAL position is that there are serious network/server
overload problems, errors are not being recovered, but that is just me..........Eric.


After I could download several ATLAS tasks since yesterday, right now, a new ATLAS task download again got stuck with the 116MB file (all other, smaller files downloaded well).

So, the recent problem seems to be back :-(((
What's going on at CERN?

ID: 31749 · Report as offensive     Reply Quote
Crystal Pellet
Volunteer moderator
Volunteer tester

Send message
Joined: 14 Jan 10
Posts: 1413
Credit: 9,434,983
RAC: 9,630
Message 31750 - Posted: 1 Aug 2017, 5:42:41 UTC

It's obviously holiday time, so it's for my machine when it wants to run ATLAS.

Again 2 tasks:

<core_client_version>7.7.2</core_client_version>
WU download error: couldn't get input files:
<file_xfer_error>
<file_name>jf_7cd27135204b4d2716c62ba7aab9f41f</file_name>
<error_code>-119 (md5 checksum failed for file)</error_code>
</file_xfer_error>


and

<core_client_version>7.7.2</core_client_version>
WU download error: couldn't get input files:
<file_xfer_error>
<file_name>jf_97f95c9e9dae64907e7b324f5bf84ba1</file_name>
<error_code>-119 (md5 checksum failed for file)</error_code>
</file_xfer_error>
ID: 31750 · Report as offensive     Reply Quote
computezrmle
Volunteer moderator
Volunteer developer
Volunteer tester
Help desk expert
Avatar

Send message
Joined: 15 Jun 08
Posts: 2520
Credit: 252,419,410
RAC: 135,840
Message 31751 - Posted: 1 Aug 2017, 6:09:28 UTC

I see the same errors on both of my hosts:
- Download of the large job file runs into transient http errors several times
- When the download finally succeeded and the job starts, the download of the smaller files is very slow and most of them are downloaded from a spare server (ccfrontier.in2p3.fr, port 23128)
- after all downloads are finished, the job failes with error 65
- increasing the RAM setting for the VM does not solve the problem
- It affects only ATLAS, other vbox projects from CERN run ok.

All together it looks like a network or firewall problem at CERN or it's partners.


Sad to say that since Erich56 pointed out the problem, nobody from the ATLAS responsibles wrote a single word here in the message board.
Are you aware of it?
ID: 31751 · Report as offensive     Reply Quote
1 · 2 · 3 · 4 . . . 8 · Next

Message boards : ATLAS application : Download failures


©2024 CERN