Thread 'Download failures'

Author	Message
Nils Volunteer moderator Project administrator Project developer Project tester Send message Joined: 15 Jul 05 Posts: 257 Credit: 6,001,083 RAC: 0	Message 31752 - Posted: 1 Aug 2017, 6:37:03 UTC From the BOINC point of view it seems ok, my client downloads and uploads ATLAS files: 01-Aug-2017 08:33:12 [LHC@home] Started download of 5W5KDmokewqnSu7Ccp2YYBZmABFKDmABFKDmXNGKDmnFMKDm0GAUfn_input.tar.gz 01-Aug-2017 08:33:12 [LHC@home] Finished upload of w-c8_n20_lhc2016_40_MD-105-16-476-2.5-0.9157__24__s__64.31_59.32__1_2__6__13.5_1_sixvf_boinc5378_1_0 01-Aug-2017 08:33:13 [LHC@home] Finished download of 5W5KDmokewqnSu7Ccp2YYBZmABFKDmABFKDmXNGKDmnFMKDm0GAUfn_input.tar.gz 01-Aug-2017 08:33:13 [LHC@home] Started download of rte_5W5KDmokewqnSu7Ccp2YYBZmABFKDmABFKDmXNGKDmnFMKDm0GAUfn.tar.gz 01-Aug-2017 08:33:14 [LHC@home] Finished download of rte_5W5KDmokewqnSu7Ccp2YYBZmABFKDmABFKDmXNGKDmnFMKDm0GAUfn.tar.gz 01-Aug-2017 08:33:14 [LHC@home] Started download of boinc_job_script.xp0zEy 01-Aug-2017 08:33:15 [LHC@home] Finished download of boinc_job_script.xp0zEy We've notified our ATLAS colleagues about the possible application/Frontier problem. ID: 31752 · Reply Quote

computezrmle Volunteer moderator Volunteer developer Volunteer tester Help desk expert Send message Joined: 15 Jun 08 Posts: 2760 Credit: 306,476,684 RAC: 143,962	Message 31753 - Posted: 1 Aug 2017, 6:53:35 UTC - in response to Message 31752. Thank you Nils. ID: 31753 · Reply Quote

Erich56 Send message Joined: 18 Dec 15 Posts: 1993 Credit: 163,981,727 RAC: 115,250	Message 31754 - Posted: 1 Aug 2017, 6:59:39 UTC - in response to Message 31752. Last modified: 1 Aug 2017, 7:11:38 UTC From the BOINC point of view it seems ok, my client downloads and uploads ATLAS files this seems to be the strange thing: some people do NOT have this problem, others do (here, also ping yields a timeout, again). We've notified our ATLAS colleagues about the possible application/Frontier problem. so let's wait and see what they can/will find out. ID: 31754 · Reply Quote

Jim1348 Send message Joined: 15 Nov 14 Posts: 602 Credit: 24,371,321 RAC: 0	Message 31755 - Posted: 1 Aug 2017, 7:26:37 UTC I now run ATLAS on two CPU cores at a time with no problems for either downloading work units or errors running them. I think CERN is sort of ignoring the single-CPU users, to encourage the multi-core version. It is supposed to save on bandwidth, etc. I liked the single-core version for efficiency, but two cores will work for me. ID: 31755 · Reply Quote

maeax Send message Joined: 2 May 07 Posts: 2305 Credit: 179,727,092 RAC: 3,515	Message 31756 - Posted: 1 Aug 2017, 7:38:21 UTC Cern-IT found the solution yesterday morning for this problem and they find this also today. So take a break. ID: 31756 · Reply Quote

computezrmle Volunteer moderator Volunteer developer Volunteer tester Help desk expert Send message Joined: 15 Jun 08 Posts: 2760 Credit: 306,476,684 RAC: 143,962	Message 31757 - Posted: 1 Aug 2017, 7:55:02 UTC - in response to Message 31754. =Erich56]... also ping yields a timeout, again ...[/quote] Hi Erich, As already stated by Yeti, a failing ping is not a criterion to see if the server is running or not. Ping uses the ICMP protocol and may be dropped/rejected by any of the hops between your host and the target system. The file transfer done by the projects is (mostly) done via HTTP. To check the availability of a service on a distinct server the VMs typically use a command like [pre]nc -z -v -w 5 lhchomeproxy.cern.ch 80[/pre] and an answer like [pre]Connection to lhchomeproxy.cern.ch 80 port [tcp/http] succeeded![/pre] shows that the service is up and also that the route to the server is not blocked. Unfortunately this is only the network part of a connection. Some services are protected by special credentials on a higher comunication level and sometimes also cause failures. ID: 31757 · Reply Quote

Crystal Pellet Volunteer moderator Volunteer tester Send message Joined: 14 Jan 10 Posts: 1561 Credit: 10,118,170 RAC: 1,330	Message 31758 - Posted: 1 Aug 2017, 8:04:42 UTC Last modified: 1 Aug 2017, 8:06:27 UTC It seems that the 120MB files should be downloaded from the boincai04 server and that server is not reachable for me. Maybe for Nils cause he's on CERN's LAN. ID: 31758 · Reply Quote

Magic Quantum Mechanic Send message Joined: 24 Oct 04 Posts: 1320 Credit: 99,770,266 RAC: 137,814	Message 31759 - Posted: 1 Aug 2017, 8:24:43 UTC - in response to Message 31758. Yes the boincai04 server is down once again so you probably should just try one of the other project tasks for now. Volunteer Mad Scientist For Life unbelievable are you trying to promote linux again? ID: 31759 · Reply Quote

AGLT2 Volunteer moderator Project administrator Project developer Project tester Volunteer developer Project scientist Send message Joined: 23 Jun 14 Posts: 12 Credit: 7,356,125,547 RAC: 1,701,180	Message 31762 - Posted: 1 Aug 2017, 8:37:28 UTC - in response to Message 31711. Hi, The server which is hosting this file was down, that is why there was a download error.. Now we have brought back the machine, and the file should be available. Cheers! Ever since the day that there was the issue after the cleanup of old files, I have been experiencing the same issues. LHC was running Atlas fine for a long time, where it would download 4 tasks and run them through without issue.. What I am seeing now (and since the day of the server cleanup) is my machine will attempt to download files for tasks, and get stuck retrying on a few for several hours. 7/29/2017 7:22:43 PM \| LHC@home \| Started download of jf_f3ff3ac08153d0ee04ea606f0dea9a0e 7/29/2017 7:23:05 PM \| \| Project communication failed: attempting access to reference site 7/29/2017 7:23:05 PM \| LHC@home \| Temporarily failed download of jf_f3ff3ac08153d0ee04ea606f0dea9a0e: connect() failed 7/29/2017 7:23:05 PM \| LHC@home \| Backing off 01:05:41 on download of jf_f3ff3ac08153d0ee04ea606f0dea9a0e 7/29/2017 7:23:06 PM \| \| Internet access OK - project servers may be temporarily down. I have Updated, Restarted, Removed and re-added the LHC project several times, over several days. Suggestions? Tried 2 different networks (home and work), same issues. Connection to other projects are no issue. ID: 31762 · Reply Quote

Magic Quantum Mechanic Send message Joined: 24 Oct 04 Posts: 1320 Credit: 99,770,266 RAC: 137,814	Message 31763 - Posted: 1 Aug 2017, 8:40:42 UTC - in response to Message 31762. Last modified: 1 Aug 2017, 9:11:59 UTC Hi, The server which is hosting this file was down, that is why there was a download error.. Now we have brought back the machine, and the file should be available. Cheers! Ever since the day that there was the issue after the cleanup of old files, I have been experiencing the same issues. LHC was running Atlas fine for a long time, where it would download 4 tasks and run them through without issue.. What I am seeing now (and since the day of the server cleanup) is my machine will attempt to download files for tasks, and get stuck retrying on a few for several hours. 7/29/2017 7:22:43 PM \| LHC@home \| Started download of jf_f3ff3ac08153d0ee04ea606f0dea9a0e 7/29/2017 7:23:05 PM \| \| Project communication failed: attempting access to reference site 7/29/2017 7:23:05 PM \| LHC@home \| Temporarily failed download of jf_f3ff3ac08153d0ee04ea606f0dea9a0e: connect() failed 7/29/2017 7:23:05 PM \| LHC@home \| Backing off 01:05:41 on download of jf_f3ff3ac08153d0ee04ea606f0dea9a0e 7/29/2017 7:23:06 PM \| \| Internet access OK - project servers may be temporarily down. I have Updated, Restarted, Removed and re-added the LHC project several times, over several days. Suggestions? Tried 2 different networks (home and work), same issues. Connection to other projects are no issue. Server error: feeder not running Server can't open log file (../log_boincai04/scheduler.log) (did you happen to check your pm over at TEST lately? ) ID: 31763 · Reply Quote

computezrmle Volunteer moderator Volunteer developer Volunteer tester Help desk expert Send message Joined: 15 Jun 08 Posts: 2760 Credit: 306,476,684 RAC: 143,962	Message 31765 - Posted: 1 Aug 2017, 9:21:50 UTC Problems with boincai04 affect the download of the EVNT files. It's good to hear that they are solved. Instead there may be additional problems independent from boincai04 as the heavy use of the spare server ccfrontier.in2p3.fr and the missing output at console 2 point out. Has this also been checked/solved? ID: 31765 · Reply Quote

Erich56 Send message Joined: 18 Dec 15 Posts: 1993 Credit: 163,981,727 RAC: 115,250	Message 31766 - Posted: 1 Aug 2017, 9:26:34 UTC - in response to Message 31757. Last modified: 1 Aug 2017, 9:29:52 UTC ] As already stated by Yeti, a failing ping is not a criterion to see if the server is running or not. Ping uses the ICMP protocol and may be dropped/rejected by any of the hops between your host and the target system.[/quote] well, the experience I have made in the past days was that ping worked fine when the server was alive, and that ping showed a timeout when the server was down. Maybe it was rather a coincidence that it worked that way. I personally am NOT a network specialist at all. To check the availability of a service on a distinct server the VMs typically use a command like [pre]nc -z -v -w 5 lhchomeproxy.cern.ch 80[/pre] I now tried this, but I got the message that "nc" is a wrong command. ID: 31766 · Reply Quote

computezrmle Volunteer moderator Volunteer developer Volunteer tester Help desk expert Send message Joined: 15 Jun 08 Posts: 2760 Credit: 306,476,684 RAC: 143,962	Message 31767 - Posted: 1 Aug 2017, 9:46:17 UTC - in response to Message 31766. Erich56 wrote: I now tried this, but I got the message that "nc" is a wrong command. Well, it's a command that was originally written for unix (short form of netcat) and therefore isn't available by default on windows. ID: 31767 · Reply Quote

Erich56 Send message Joined: 18 Dec 15 Posts: 1993 Credit: 163,981,727 RAC: 115,250	Message 31768 - Posted: 1 Aug 2017, 10:11:03 UTC - in response to Message 31767. netcat "netcat" - sounds sweet :-) ID: 31768 · Reply Quote

AGLT2 Volunteer moderator Project administrator Project developer Project tester Volunteer developer Project scientist Send message Joined: 23 Jun 14 Posts: 12 Credit: 7,356,125,547 RAC: 1,701,180	Message 31781 - Posted: 2 Aug 2017, 3:53:54 UTC - in response to Message 31763. Yes, there was a permission issue with the scheduler file on boincai04, it is fixed now.. Cheers! Hi, The server which is hosting this file was down, that is why there was a download error.. Now we have brought back the machine, and the file should be available. Cheers! Ever since the day that there was the issue after the cleanup of old files, I have been experiencing the same issues. LHC was running Atlas fine for a long time, where it would download 4 tasks and run them through without issue.. What I am seeing now (and since the day of the server cleanup) is my machine will attempt to download files for tasks, and get stuck retrying on a few for several hours. 7/29/2017 7:22:43 PM \| LHC@home \| Started download of jf_f3ff3ac08153d0ee04ea606f0dea9a0e 7/29/2017 7:23:05 PM \| \| Project communication failed: attempting access to reference site 7/29/2017 7:23:05 PM \| LHC@home \| Temporarily failed download of jf_f3ff3ac08153d0ee04ea606f0dea9a0e: connect() failed 7/29/2017 7:23:05 PM \| LHC@home \| Backing off 01:05:41 on download of jf_f3ff3ac08153d0ee04ea606f0dea9a0e 7/29/2017 7:23:06 PM \| \| Internet access OK - project servers may be temporarily down. I have Updated, Restarted, Removed and re-added the LHC project several times, over several days. Suggestions? Tried 2 different networks (home and work), same issues. Connection to other projects are no issue. Server error: feeder not running Server can't open log file (../log_boincai04/scheduler.log) (did you happen to check your pm over at TEST lately? ) ID: 31781 · Reply Quote

AGLT2 Volunteer moderator Project administrator Project developer Project tester Volunteer developer Project scientist Send message Joined: 23 Jun 14 Posts: 12 Credit: 7,356,125,547 RAC: 1,701,180	Message 31783 - Posted: 2 Aug 2017, 3:59:35 UTC - in response to Message 31782. Just to summarize the cause of the failure for some files with ATLAS@home: 1. some of the input files are stored on a test server boincai04, and it was down yesterday due to heavy load. We modified the job submission script, so all the input files are stored on more powerful and reliable servers, which should prevent this from happening again. 2. For people who still attach the hosts to the Atlas-test project, the server (boincai04) was stuck a few times in the past a few days due to the heavy workload on it as a tiny machine. Now we split the workload on different machines, and the boincai04 machine still dispatches a small amount of test jobs.. Cheers! ID: 31783 · Reply Quote

Erich56 Send message Joined: 18 Dec 15 Posts: 1993 Credit: 163,981,727 RAC: 115,250	Message 31784 - Posted: 2 Aug 2017, 5:12:49 UTC - in response to Message 31783. Thanks, Wenjing, for the update; have a nice day :-) ID: 31784 · Reply Quote

Harri Liljeroos Send message Joined: 28 Sep 04 Posts: 814 Credit: 66,468,943 RAC: 29,423	Message 32593 - Posted: 2 Oct 2017, 19:53:04 UTC Just got a download error for an Atlas task: https://lhcathome.cern.ch/lhcathome/workunit.php?wuid=76058390 like my wingman did. ID: 32593 · Reply Quote

maeax Send message Joined: 2 May 07 Posts: 2305 Credit: 179,727,092 RAC: 3,515	Message 32724 - Posted: 9 Oct 2017, 17:47:53 UTC Since about one week, the download-file of Atlas (200MByte) is dropping very slow. The Counter of the network starts with for example 100 kps and reduced up to Zero. It need about 1 hour instead of one minute regulary. ID: 32724 · Reply Quote

Erich56 Send message Joined: 18 Dec 15 Posts: 1993 Credit: 163,981,727 RAC: 115,250	Message 32726 - Posted: 9 Oct 2017, 17:59:39 UTC - in response to Message 32724. Since about one week, the download-file of Atlas (200MByte) is dropping very slow. The Counter of the network starts with for example 100 kps and reduced up to Zero. It need about 1 hour instead of one minute regulary. hm, that's strange. Here, all downloads run with same fast speed as ever before (i.e. in about one minute). ID: 32726 · Reply Quote