Message boards : Number crunching : Downloads have stalled
Message board moderation

To post messages, you must log in.

Previous · 1 · 2 · 3 · 4 · Next

AuthorMessage
David Lambert

Send message
Joined: 2 Dec 05
Posts: 11
Credit: 2,607,594
RAC: 0
Message 35767 - Posted: 4 Jul 2018, 16:07:37 UTC - in response to Message 35766.  

My downloads are better...some still get stuck and apparently the "retry" option doesn't actual work...it just throws a "error while downloading". Uploads are starting to get through but still taking hours.

ID: 35767 · Report as offensive     Reply Quote
David Lambert

Send message
Joined: 2 Dec 05
Posts: 11
Credit: 2,607,594
RAC: 0
Message 35768 - Posted: 4 Jul 2018, 16:09:55 UTC - in response to Message 35767.  

'Retry' only works on failed downloads. If the download is active, 'Retry' doesn't do anything.
ID: 35768 · Report as offensive     Reply Quote
BelgianEnthousiast

Send message
Joined: 5 Apr 15
Posts: 18
Credit: 5,910,849
RAC: 0
Message 35775 - Posted: 5 Jul 2018, 10:34:34 UTC - in response to Message 35768.  

To the LHC administrators :

On my machine (Win 10 Pro - see my previous post) the behaviour I observe is the following :

1. Only LHC (Atlas) WU's downloads stall, Rosetta, WorldComGrid, ClimatePrediction, GPUGrid download just fine.
2. The downloads (usually the big ones > 200 MB) start well at 5-7 Mbps but gradually degrade and at around 50-80 %
of the total filesize, the download speed decreases to zero.
3. At that point, all other downloads from LHC are blocked.
4. When de-activating network and re-activating it again, the downloads resume and if it had progressed far enough to
get the whole file in, it continues to download the smaller files as well. However, if too much of the file was left to download,
I observe exactly the same behaviour again : the download speed decreases over time to zero and stalls the download (again).
5. I also saw the same thing when actually suspending all WU's crunching, exiting BOINC and restarting it again, then resuming
all WU's afterwards.

Can you investigate whether this has to do with the BOINC manager/Windows TCP/IP stack/(T)FTP protocol or with the
file transfer software on your end please ?

I lost a whole night of crunching because of this once again... (4th or 5th time in a week)

Many thanks in advance !

B.
ID: 35775 · Report as offensive     Reply Quote
maeax

Send message
Joined: 2 May 07
Posts: 2071
Credit: 156,095,661
RAC: 103,406
Message 35777 - Posted: 5 Jul 2018, 11:54:29 UTC

Saw the same last night on one PC with 7 hours downloading.
After paused and reactivated the network-activity this morning in Boinc it was finished.
Buffer overflow in Network?
Too busy in Networking?
More than 300k Boinc-tasks 3th of July in Atlas?
There is a investigation needed. But it is not easy (very dynamic traffic).
ID: 35777 · Report as offensive     Reply Quote
David Cameron
Project administrator
Project developer
Project scientist

Send message
Joined: 13 May 14
Posts: 387
Credit: 15,314,184
RAC: 0
Message 35779 - Posted: 5 Jul 2018, 12:22:37 UTC - in response to Message 35777.  

We are investigating the problem along with the server admins. It may be due to higher load or a load-balancing issue on the servers but we don't know yet. We are planning to ask some of our power users to switch to the dev project to see if this helps.
ID: 35779 · Report as offensive     Reply Quote
Profile Yeti
Volunteer moderator
Avatar

Send message
Joined: 2 Sep 04
Posts: 453
Credit: 193,369,412
RAC: 10,065
Message 35780 - Posted: 5 Jul 2018, 12:25:46 UTC - in response to Message 35779.  
Last modified: 5 Jul 2018, 12:26:08 UTC

We are investigating the problem along with the server admins. It may be due to higher load or a load-balancing issue on the servers but we don't know yet. We are planning to ask some of our power users to switch to the dev project to see if this helps.

I have done this when the problem came up, but the download-problem happened from dev also


Supporting BOINC, a great concept !
ID: 35780 · Report as offensive     Reply Quote
Profile Laurence
Project administrator
Project developer

Send message
Joined: 20 Jun 14
Posts: 372
Credit: 238,712
RAC: 0
Message 35786 - Posted: 6 Jul 2018, 13:01:48 UTC - in response to Message 35780.  

Did reducing the load help or is the issue still there?
ID: 35786 · Report as offensive     Reply Quote
maeax

Send message
Joined: 2 May 07
Posts: 2071
Credit: 156,095,661
RAC: 103,406
Message 35789 - Posted: 6 Jul 2018, 13:32:50 UTC - in response to Message 35786.  

Had at the moment one Windows-PC with stalling transfer.
From yesterday evening up to now no problems so long.
ID: 35789 · Report as offensive     Reply Quote
Profile Yeti
Volunteer moderator
Avatar

Send message
Joined: 2 Sep 04
Posts: 453
Credit: 193,369,412
RAC: 10,065
Message 35792 - Posted: 6 Jul 2018, 14:41:24 UTC - in response to Message 35786.  

Did reducing the load help or is the issue still there?

The issue is still here, look:



Please, check the switch(es) on the way out of CERN, I had a similar problem and we have been searching several month for the reason. Finally we discovered that a switch had a faulty port and since we changed the switch all our problems are gone


Supporting BOINC, a great concept !
ID: 35792 · Report as offensive     Reply Quote
maeax

Send message
Joined: 2 May 07
Posts: 2071
Credit: 156,095,661
RAC: 103,406
Message 35794 - Posted: 6 Jul 2018, 16:47:19 UTC
Last modified: 6 Jul 2018, 16:51:05 UTC

Two Atlas are downloading together at the moment in one Window-PC.
One got the max.speed of 7.5Mbits up to finishing.
The second had 3.0 Mbits after the first finishing and was not growing with the speed to 7.5 Mbits, but stalled.
A pause and reconnect with Boinc-Network activity let the second finishing.
Otherwihise it came to the picture from Yeti shown in the last message from him.
Edit: Will testing a limiting of the Download-speed in Boinc-preferences this weekend.
ID: 35794 · Report as offensive     Reply Quote
Profile Yeti
Volunteer moderator
Avatar

Send message
Joined: 2 Sep 04
Posts: 453
Credit: 193,369,412
RAC: 10,065
Message 35795 - Posted: 6 Jul 2018, 16:50:55 UTC

hours later, you can see that the client still tries to download the same files as my earlier post:




Supporting BOINC, a great concept !
ID: 35795 · Report as offensive     Reply Quote
Crystal Pellet
Volunteer moderator
Volunteer tester

Send message
Joined: 14 Jan 10
Posts: 1268
Credit: 8,421,616
RAC: 2,139
Message 35796 - Posted: 6 Jul 2018, 17:26:34 UTC

The problem is still there only for ATLAS-downloads, but not every new task is suffering.
When I download the *.root.1 file via my browser there is no problem.
Stopping BOINC, copying the manual downloaded one to the project-dir and restarting BOINC helps.
The stalled download disappears and the task is starting.
ID: 35796 · Report as offensive     Reply Quote
computezrmle
Volunteer moderator
Volunteer developer
Volunteer tester
Help desk expert
Avatar

Send message
Joined: 15 Jun 08
Posts: 2386
Credit: 222,925,441
RAC: 137,691
Message 35797 - Posted: 6 Jul 2018, 17:34:24 UTC

If Yeti's guess is right, all the workarounds mentioned in various threads will sooner or later run into the same problems than the BOINC client.
ID: 35797 · Report as offensive     Reply Quote
Profile Yeti
Volunteer moderator
Avatar

Send message
Joined: 2 Sep 04
Posts: 453
Credit: 193,369,412
RAC: 10,065
Message 35798 - Posted: 6 Jul 2018, 18:59:08 UTC - in response to Message 35792.  

Please, check the switch(es) on the way out of CERN, I had a similar problem and we have been searching several month for the reason. Finally we discovered that a switch had a faulty port and since we changed the switch all our problems are gone

By the way: I don't know the config from CERN so may be it is only a faulty network-card or network-cable


Supporting BOINC, a great concept !
ID: 35798 · Report as offensive     Reply Quote
maeax

Send message
Joined: 2 May 07
Posts: 2071
Credit: 156,095,661
RAC: 103,406
Message 35800 - Posted: 7 Jul 2018, 5:49:15 UTC

This morning a Windows-PC is downloading for 4 hours TWO Atlas-tasks parallel.
After Boinc-network paused and reactivated it finished in 1 Minute.
No limit for download-speed in Boinc.
ID: 35800 · Report as offensive     Reply Quote
bronco

Send message
Joined: 13 Apr 18
Posts: 443
Credit: 8,438,885
RAC: 0
Message 35801 - Posted: 7 Jul 2018, 6:46:27 UTC

It might be a long while before they find the cause of this problem so I've decided to script a kludge and run the kludge as a cron job every 10 minutes. The script uses boinccmd to suspend network activity on my 2 hosts running ATLAS tasks. It sleeps for 10 seconds then resumes network activity then exits. The script is saved to /home/bronco/bin/atlas_dl_kludge.sh

boinccmd --host localhost --passwd <gui_rpc_password> --set_network_mode never
boinccmd --host lappy     --passwd <gui_rpc_password> --set_network_mode never
sleep 10
boinccmd --host localhost --passwd <gui_rpc_password> --set_network_mode always
boinccmd --host lappy     --passwd <gui_rpc_password> --set_network_mode always


The crontab entry:

*/10 * * * * bash /home/bronco/bin/atlas_dl_kludge.sh
ID: 35801 · Report as offensive     Reply Quote
computezrmle
Volunteer moderator
Volunteer developer
Volunteer tester
Help desk expert
Avatar

Send message
Joined: 15 Jun 08
Posts: 2386
Credit: 222,925,441
RAC: 137,691
Message 35802 - Posted: 7 Jul 2018, 7:21:08 UTC - in response to Message 35801.  

Depending on the project mix that runs on the host, this workaround may have unwanted side effects.

As soon as the network activity is suspended, the BOINC client will start other tasks.
The LHC VMs will typically stay in RAM and - if there is not enough RAM - higher swapping activity may be encountered.
The latter is suspect to cause timing problems once the VMs will resume (especially: resume concurrently) and may result in a watchdog error.

In short:
Volunteers may be aware of other errors.
ID: 35802 · Report as offensive     Reply Quote
maeax

Send message
Joined: 2 May 07
Posts: 2071
Credit: 156,095,661
RAC: 103,406
Message 35803 - Posted: 7 Jul 2018, 7:27:45 UTC - in response to Message 35802.  

+1
ID: 35803 · Report as offensive     Reply Quote
Harri Liljeroos
Avatar

Send message
Joined: 28 Sep 04
Posts: 674
Credit: 43,150,492
RAC: 15,942
Message 35804 - Posted: 7 Jul 2018, 8:15:38 UTC - in response to Message 35802.  

Depending on the project mix that runs on the host, this workaround may have unwanted side effects.

As soon as the network activity is suspended, the BOINC client will start other tasks.
The LHC VMs will typically stay in RAM and - if there is not enough RAM - higher swapping activity may be encountered.
The latter is suspect to cause timing problems once the VMs will resume (especially: resume concurrently) and may result in a watchdog error.

In short:
Volunteers may be aware of other errors.

Why do you think this is happening? If I manually stop network activity on a WIndows machine, task swapping is not happening. And I don't think that Boinc network status has any affect on network status inside VM. It may affect native tasks though.
ID: 35804 · Report as offensive     Reply Quote
computezrmle
Volunteer moderator
Volunteer developer
Volunteer tester
Help desk expert
Avatar

Send message
Joined: 15 Jun 08
Posts: 2386
Credit: 222,925,441
RAC: 137,691
Message 35805 - Posted: 7 Jul 2018, 8:41:36 UTC - in response to Message 35804.  

Why do you think this is happening? ...

VMs are configured to always need an active Network. If you suspend the network, the VMs will also be suspended. Depending on the project mix the BOINC client will start/resume tasks from other projects (or SixTrack that doesn't require a permanent network connection).
Those tasks need more or less RAM and - depending on the local preferences - it may cause suspended tasks to be swapped out.

A lot of "ifs", I know. But volunteers should be aware of it.
Therefore:
In short:
Volunteers may be aware of other errors.


ATLAS native behaves different, e.g. it ignores suspend/resume.
This is not 100% BOINC client compatible.
ID: 35805 · Report as offensive     Reply Quote
Previous · 1 · 2 · 3 · 4 · Next

Message boards : Number crunching : Downloads have stalled


©2024 CERN