Message boards : ATLAS application : Download failures
Message board moderation

To post messages, you must log in.

Previous · 1 · 2 · 3 · 4 · 5 . . . 8 · Next

AuthorMessage
Profile Nils Høimyr
Volunteer moderator
Project administrator
Project developer
Project tester

Send message
Joined: 15 Jul 05
Posts: 249
Credit: 5,974,599
RAC: 0
Message 31752 - Posted: 1 Aug 2017, 6:37:03 UTC

From the BOINC point of view it seems ok, my client downloads and uploads ATLAS files:

01-Aug-2017 08:33:12 [LHC@home] Started download of 5W5KDmokewqnSu7Ccp2YYBZmABFKDmABFKDmXNGKDmnFMKDm0GAUfn_input.tar.gz
01-Aug-2017 08:33:12 [LHC@home] Finished upload of w-c8_n20_lhc2016_40_MD-105-16-476-2.5-0.9157__24__s__64.31_59.32__1_2__6__13.5_1_sixvf_boinc5378_1_0
01-Aug-2017 08:33:13 [LHC@home] Finished download of 5W5KDmokewqnSu7Ccp2YYBZmABFKDmABFKDmXNGKDmnFMKDm0GAUfn_input.tar.gz
01-Aug-2017 08:33:13 [LHC@home] Started download of rte_5W5KDmokewqnSu7Ccp2YYBZmABFKDmABFKDmXNGKDmnFMKDm0GAUfn.tar.gz
01-Aug-2017 08:33:14 [LHC@home] Finished download of rte_5W5KDmokewqnSu7Ccp2YYBZmABFKDmABFKDmXNGKDmnFMKDm0GAUfn.tar.gz
01-Aug-2017 08:33:14 [LHC@home] Started download of boinc_job_script.xp0zEy
01-Aug-2017 08:33:15 [LHC@home] Finished download of boinc_job_script.xp0zEy


We've notified our ATLAS colleagues about the possible application/Frontier problem.
ID: 31752 · Report as offensive     Reply Quote
computezrmle
Volunteer moderator
Volunteer developer
Volunteer tester
Help desk expert
Avatar

Send message
Joined: 15 Jun 08
Posts: 2541
Credit: 254,608,838
RAC: 23,290
Message 31753 - Posted: 1 Aug 2017, 6:53:35 UTC - in response to Message 31752.  

Thank you Nils.
ID: 31753 · Report as offensive     Reply Quote
Erich56

Send message
Joined: 18 Dec 15
Posts: 1821
Credit: 118,983,656
RAC: 18,514
Message 31754 - Posted: 1 Aug 2017, 6:59:39 UTC - in response to Message 31752.  
Last modified: 1 Aug 2017, 7:11:38 UTC

From the BOINC point of view it seems ok, my client downloads and uploads ATLAS files

this seems to be the strange thing: some people do NOT have this problem, others do (here, also ping yields a timeout, again).


We've notified our ATLAS colleagues about the possible application/Frontier problem.

so let's wait and see what they can/will find out.
ID: 31754 · Report as offensive     Reply Quote
Jim1348

Send message
Joined: 15 Nov 14
Posts: 602
Credit: 24,371,321
RAC: 0
Message 31755 - Posted: 1 Aug 2017, 7:26:37 UTC

I now run ATLAS on two CPU cores at a time with no problems for either downloading work units or errors running them.

I think CERN is sort of ignoring the single-CPU users, to encourage the multi-core version. It is supposed to save on bandwidth, etc. I liked the single-core version for efficiency, but two cores will work for me.
ID: 31755 · Report as offensive     Reply Quote
maeax

Send message
Joined: 2 May 07
Posts: 2244
Credit: 173,902,375
RAC: 307
Message 31756 - Posted: 1 Aug 2017, 7:38:21 UTC

Cern-IT found the solution yesterday morning for this problem and they find this also today. So take a break.
ID: 31756 · Report as offensive     Reply Quote
computezrmle
Volunteer moderator
Volunteer developer
Volunteer tester
Help desk expert
Avatar

Send message
Joined: 15 Jun 08
Posts: 2541
Credit: 254,608,838
RAC: 23,290
Message 31757 - Posted: 1 Aug 2017, 7:55:02 UTC - in response to Message 31754.  

Erich56 wrote:
... also ping yields a timeout, again ...

Hi Erich,

As already stated by Yeti, a failing ping is not a criterion to see if the server is running or not.
Ping uses the ICMP protocol and may be dropped/rejected by any of the hops between your host and the target system.

The file transfer done by the projects is (mostly) done via HTTP.
To check the availability of a service on a distinct server the VMs typically use a command like
nc -z -v -w 5 lhchomeproxy.cern.ch 80

and an answer like
Connection to lhchomeproxy.cern.ch 80 port [tcp/http] succeeded!

shows that the service is up and also that the route to the server is not blocked.

Unfortunately this is only the network part of a connection.
Some services are protected by special credentials on a higher comunication level and sometimes also cause failures.
ID: 31757 · Report as offensive     Reply Quote
Crystal Pellet
Volunteer moderator
Volunteer tester

Send message
Joined: 14 Jan 10
Posts: 1422
Credit: 9,484,585
RAC: 852
Message 31758 - Posted: 1 Aug 2017, 8:04:42 UTC
Last modified: 1 Aug 2017, 8:06:27 UTC

It seems that the 120MB files should be downloaded from the boincai04 server and that server is not reachable for me.
Maybe for Nils cause he's on CERN's LAN.
ID: 31758 · Report as offensive     Reply Quote
Profile Magic Quantum Mechanic
Avatar

Send message
Joined: 24 Oct 04
Posts: 1177
Credit: 54,887,670
RAC: 3,877
Message 31759 - Posted: 1 Aug 2017, 8:24:43 UTC - in response to Message 31758.  

Yes the boincai04 server is down once again so you probably should just try one of the other project tasks for now.
Volunteer Mad Scientist For Life
ID: 31759 · Report as offensive     Reply Quote
AGLT2
Volunteer moderator
Project administrator
Project developer
Project tester
Volunteer developer
Project scientist

Send message
Joined: 23 Jun 14
Posts: 12
Credit: 6,352,776,785
RAC: 860,337
Message 31762 - Posted: 1 Aug 2017, 8:37:28 UTC - in response to Message 31711.  

Hi,
The server which is hosting this file was down, that is why there was a download error.. Now we have brought back the machine, and the file should be available.

Cheers!

Ever since the day that there was the issue after the cleanup of old files, I have been experiencing the same issues. LHC was running Atlas fine for a long time, where it would download 4 tasks and run them through without issue.. What I am seeing now (and since the day of the server cleanup) is my machine will attempt to download files for tasks, and get stuck retrying on a few for several hours.

7/29/2017 7:22:43 PM | LHC@home | Started download of jf_f3ff3ac08153d0ee04ea606f0dea9a0e
7/29/2017 7:23:05 PM | | Project communication failed: attempting access to reference site
7/29/2017 7:23:05 PM | LHC@home | Temporarily failed download of jf_f3ff3ac08153d0ee04ea606f0dea9a0e: connect() failed
7/29/2017 7:23:05 PM | LHC@home | Backing off 01:05:41 on download of jf_f3ff3ac08153d0ee04ea606f0dea9a0e
7/29/2017 7:23:06 PM | | Internet access OK - project servers may be temporarily down.

I have Updated, Restarted, Removed and re-added the LHC project several times, over several days. Suggestions? Tried 2 different networks (home and work), same issues. Connection to other projects are no issue.
ID: 31762 · Report as offensive     Reply Quote
Profile Magic Quantum Mechanic
Avatar

Send message
Joined: 24 Oct 04
Posts: 1177
Credit: 54,887,670
RAC: 3,877
Message 31763 - Posted: 1 Aug 2017, 8:40:42 UTC - in response to Message 31762.  
Last modified: 1 Aug 2017, 9:11:59 UTC

Hi,
The server which is hosting this file was down, that is why there was a download error.. Now we have brought back the machine, and the file should be available.

Cheers!

Ever since the day that there was the issue after the cleanup of old files, I have been experiencing the same issues. LHC was running Atlas fine for a long time, where it would download 4 tasks and run them through without issue.. What I am seeing now (and since the day of the server cleanup) is my machine will attempt to download files for tasks, and get stuck retrying on a few for several hours.

7/29/2017 7:22:43 PM | LHC@home | Started download of jf_f3ff3ac08153d0ee04ea606f0dea9a0e
7/29/2017 7:23:05 PM | | Project communication failed: attempting access to reference site
7/29/2017 7:23:05 PM | LHC@home | Temporarily failed download of jf_f3ff3ac08153d0ee04ea606f0dea9a0e: connect() failed
7/29/2017 7:23:05 PM | LHC@home | Backing off 01:05:41 on download of jf_f3ff3ac08153d0ee04ea606f0dea9a0e
7/29/2017 7:23:06 PM | | Internet access OK - project servers may be temporarily down.

I have Updated, Restarted, Removed and re-added the LHC project several times, over several days. Suggestions? Tried 2 different networks (home and work), same issues. Connection to other projects are no issue.


Server error: feeder not running

Server can't open log file (../log_boincai04/scheduler.log)


(did you happen to check your pm over at TEST lately? )
ID: 31763 · Report as offensive     Reply Quote
computezrmle
Volunteer moderator
Volunteer developer
Volunteer tester
Help desk expert
Avatar

Send message
Joined: 15 Jun 08
Posts: 2541
Credit: 254,608,838
RAC: 23,290
Message 31765 - Posted: 1 Aug 2017, 9:21:50 UTC

Problems with boincai04 affect the download of the EVNT files.
It's good to hear that they are solved.

Instead there may be additional problems independent from boincai04 as the heavy use of the spare server ccfrontier.in2p3.fr and the missing output at console 2 point out.
Has this also been checked/solved?
ID: 31765 · Report as offensive     Reply Quote
Erich56

Send message
Joined: 18 Dec 15
Posts: 1821
Credit: 118,983,656
RAC: 18,514
Message 31766 - Posted: 1 Aug 2017, 9:26:34 UTC - in response to Message 31757.  
Last modified: 1 Aug 2017, 9:29:52 UTC

As already stated by Yeti, a failing ping is not a criterion to see if the server is running or not.
Ping uses the ICMP protocol and may be dropped/rejected by any of the hops between your host and the target system.

well, the experience I have made in the past days was that ping worked fine when the server was alive, and that ping showed a timeout when the server was down. Maybe it was rather a coincidence that it worked that way. I personally am NOT a network specialist at all.


To check the availability of a service on a distinct server the VMs typically use a command like
nc -z -v -w 5 lhchomeproxy.cern.ch 80

I now tried this, but I got the message that "nc" is a wrong command.
ID: 31766 · Report as offensive     Reply Quote
computezrmle
Volunteer moderator
Volunteer developer
Volunteer tester
Help desk expert
Avatar

Send message
Joined: 15 Jun 08
Posts: 2541
Credit: 254,608,838
RAC: 23,290
Message 31767 - Posted: 1 Aug 2017, 9:46:17 UTC - in response to Message 31766.  

Erich56 wrote:
I now tried this, but I got the message that "nc" is a wrong command.

Well, it's a command that was originally written for unix (short form of netcat) and therefore isn't available by default on windows.
ID: 31767 · Report as offensive     Reply Quote
Erich56

Send message
Joined: 18 Dec 15
Posts: 1821
Credit: 118,983,656
RAC: 18,514
Message 31768 - Posted: 1 Aug 2017, 10:11:03 UTC - in response to Message 31767.  

netcat

"netcat" - sounds sweet :-)
ID: 31768 · Report as offensive     Reply Quote
AGLT2
Volunteer moderator
Project administrator
Project developer
Project tester
Volunteer developer
Project scientist

Send message
Joined: 23 Jun 14
Posts: 12
Credit: 6,352,776,785
RAC: 860,337
Message 31781 - Posted: 2 Aug 2017, 3:53:54 UTC - in response to Message 31763.  

Yes, there was a permission issue with the scheduler file on boincai04, it is fixed now..

Cheers!

Hi,
The server which is hosting this file was down, that is why there was a download error.. Now we have brought back the machine, and the file should be available.

Cheers!

Ever since the day that there was the issue after the cleanup of old files, I have been experiencing the same issues. LHC was running Atlas fine for a long time, where it would download 4 tasks and run them through without issue.. What I am seeing now (and since the day of the server cleanup) is my machine will attempt to download files for tasks, and get stuck retrying on a few for several hours.

7/29/2017 7:22:43 PM | LHC@home | Started download of jf_f3ff3ac08153d0ee04ea606f0dea9a0e
7/29/2017 7:23:05 PM | | Project communication failed: attempting access to reference site
7/29/2017 7:23:05 PM | LHC@home | Temporarily failed download of jf_f3ff3ac08153d0ee04ea606f0dea9a0e: connect() failed
7/29/2017 7:23:05 PM | LHC@home | Backing off 01:05:41 on download of jf_f3ff3ac08153d0ee04ea606f0dea9a0e
7/29/2017 7:23:06 PM | | Internet access OK - project servers may be temporarily down.

I have Updated, Restarted, Removed and re-added the LHC project several times, over several days. Suggestions? Tried 2 different networks (home and work), same issues. Connection to other projects are no issue.


Server error: feeder not running

Server can't open log file (../log_boincai04/scheduler.log)


(did you happen to check your pm over at TEST lately? )
ID: 31781 · Report as offensive     Reply Quote
AGLT2
Volunteer moderator
Project administrator
Project developer
Project tester
Volunteer developer
Project scientist

Send message
Joined: 23 Jun 14
Posts: 12
Credit: 6,352,776,785
RAC: 860,337
Message 31783 - Posted: 2 Aug 2017, 3:59:35 UTC - in response to Message 31782.  

Just to summarize the cause of the failure for some files with ATLAS@home:
1. some of the input files are stored on a test server boincai04, and it was down yesterday due to heavy load. We modified the job submission script, so all the input files are stored on more powerful and reliable servers, which should prevent this from happening again.
2. For people who still attach the hosts to the Atlas-test project, the server (boincai04) was stuck a few times in the past a few days due to the heavy workload on it as a tiny machine. Now we split the workload on different machines, and the boincai04 machine still dispatches a small amount of test jobs..

Cheers!
ID: 31783 · Report as offensive     Reply Quote
Erich56

Send message
Joined: 18 Dec 15
Posts: 1821
Credit: 118,983,656
RAC: 18,514
Message 31784 - Posted: 2 Aug 2017, 5:12:49 UTC - in response to Message 31783.  

Thanks, Wenjing, for the update; have a nice day :-)
ID: 31784 · Report as offensive     Reply Quote
Harri Liljeroos
Avatar

Send message
Joined: 28 Sep 04
Posts: 732
Credit: 49,373,095
RAC: 13,741
Message 32593 - Posted: 2 Oct 2017, 19:53:04 UTC

Just got a download error for an Atlas task: https://lhcathome.cern.ch/lhcathome/workunit.php?wuid=76058390 like my wingman did.
ID: 32593 · Report as offensive     Reply Quote
maeax

Send message
Joined: 2 May 07
Posts: 2244
Credit: 173,902,375
RAC: 307
Message 32724 - Posted: 9 Oct 2017, 17:47:53 UTC

Since about one week, the download-file of Atlas (200MByte) is dropping very slow.
The Counter of the network starts with for example 100 kps and reduced up to Zero.
It need about 1 hour instead of one minute regulary.
ID: 32724 · Report as offensive     Reply Quote
Erich56

Send message
Joined: 18 Dec 15
Posts: 1821
Credit: 118,983,656
RAC: 18,514
Message 32726 - Posted: 9 Oct 2017, 17:59:39 UTC - in response to Message 32724.  

Since about one week, the download-file of Atlas (200MByte) is dropping very slow.
The Counter of the network starts with for example 100 kps and reduced up to Zero.
It need about 1 hour instead of one minute regulary.

hm, that's strange. Here, all downloads run with same fast speed as ever before (i.e. in about one minute).
ID: 32726 · Report as offensive     Reply Quote
Previous · 1 · 2 · 3 · 4 · 5 . . . 8 · Next

Message boards : ATLAS application : Download failures


©2024 CERN