Downloads have stalled

Author	Message
David Lambert Send message Joined: 2 Dec 05 Posts: 11 Credit: 2,607,594 RAC: 0	Message 35767 - Posted: 4 Jul 2018, 16:07:37 UTC - in response to Message 35766. My downloads are better...some still get stuck and apparently the "retry" option doesn't actual work...it just throws a "error while downloading". Uploads are starting to get through but still taking hours. ID: 35767 · Reply Quote

David Lambert Send message Joined: 2 Dec 05 Posts: 11 Credit: 2,607,594 RAC: 0	Message 35768 - Posted: 4 Jul 2018, 16:09:55 UTC - in response to Message 35767. 'Retry' only works on failed downloads. If the download is active, 'Retry' doesn't do anything. ID: 35768 · Reply Quote

BelgianEnthousiast Send message Joined: 5 Apr 15 Posts: 18 Credit: 5,910,849 RAC: 0	Message 35775 - Posted: 5 Jul 2018, 10:34:34 UTC - in response to Message 35768. To the LHC administrators : On my machine (Win 10 Pro - see my previous post) the behaviour I observe is the following : 1. Only LHC (Atlas) WU's downloads stall, Rosetta, WorldComGrid, ClimatePrediction, GPUGrid download just fine. 2. The downloads (usually the big ones > 200 MB) start well at 5-7 Mbps but gradually degrade and at around 50-80 % of the total filesize, the download speed decreases to zero. 3. At that point, all other downloads from LHC are blocked. 4. When de-activating network and re-activating it again, the downloads resume and if it had progressed far enough to get the whole file in, it continues to download the smaller files as well. However, if too much of the file was left to download, I observe exactly the same behaviour again : the download speed decreases over time to zero and stalls the download (again). 5. I also saw the same thing when actually suspending all WU's crunching, exiting BOINC and restarting it again, then resuming all WU's afterwards. Can you investigate whether this has to do with the BOINC manager/Windows TCP/IP stack/(T)FTP protocol or with the file transfer software on your end please ? I lost a whole night of crunching because of this once again... (4th or 5th time in a week) Many thanks in advance ! B. ID: 35775 · Reply Quote

maeax Send message Joined: 2 May 07 Posts: 2267 Credit: 175,671,719 RAC: 37	Message 35777 - Posted: 5 Jul 2018, 11:54:29 UTC Saw the same last night on one PC with 7 hours downloading. After paused and reactivated the network-activity this morning in Boinc it was finished. Buffer overflow in Network? Too busy in Networking? More than 300k Boinc-tasks 3th of July in Atlas? There is a investigation needed. But it is not easy (very dynamic traffic). ID: 35777 · Reply Quote

David Cameron Project administrator Project developer Project scientist Send message Joined: 13 May 14 Posts: 387 Credit: 15,314,184 RAC: 0	Message 35779 - Posted: 5 Jul 2018, 12:22:37 UTC - in response to Message 35777. We are investigating the problem along with the server admins. It may be due to higher load or a load-balancing issue on the servers but we don't know yet. We are planning to ask some of our power users to switch to the dev project to see if this helps. ID: 35779 · Reply Quote

Yeti Volunteer moderator Send message Joined: 2 Sep 04 Posts: 455 Credit: 213,668,191 RAC: 4,674	Message 35780 - Posted: 5 Jul 2018, 12:25:46 UTC - in response to Message 35779. Last modified: 5 Jul 2018, 12:26:08 UTC We are investigating the problem along with the server admins. It may be due to higher load or a load-balancing issue on the servers but we don't know yet. We are planning to ask some of our power users to switch to the dev project to see if this helps. I have done this when the problem came up, but the download-problem happened from dev also Supporting BOINC, a great concept ! ID: 35780 · Reply Quote

Laurence Project administrator Project developer Send message Joined: 20 Jun 14 Posts: 401 Credit: 238,712 RAC: 0	Message 35786 - Posted: 6 Jul 2018, 13:01:48 UTC - in response to Message 35780. Did reducing the load help or is the issue still there? ID: 35786 · Reply Quote

maeax Send message Joined: 2 May 07 Posts: 2267 Credit: 175,671,719 RAC: 37	Message 35789 - Posted: 6 Jul 2018, 13:32:50 UTC - in response to Message 35786. Had at the moment one Windows-PC with stalling transfer. From yesterday evening up to now no problems so long. ID: 35789 · Reply Quote

Yeti Volunteer moderator Send message Joined: 2 Sep 04 Posts: 455 Credit: 213,668,191 RAC: 4,674	Message 35792 - Posted: 6 Jul 2018, 14:41:24 UTC - in response to Message 35786. Did reducing the load help or is the issue still there? The issue is still here, look: Please, check the switch(es) on the way out of CERN, I had a similar problem and we have been searching several month for the reason. Finally we discovered that a switch had a faulty port and since we changed the switch all our problems are gone Supporting BOINC, a great concept ! ID: 35792 · Reply Quote

maeax Send message Joined: 2 May 07 Posts: 2267 Credit: 175,671,719 RAC: 37	Message 35794 - Posted: 6 Jul 2018, 16:47:19 UTC Last modified: 6 Jul 2018, 16:51:05 UTC Two Atlas are downloading together at the moment in one Window-PC. One got the max.speed of 7.5Mbits up to finishing. The second had 3.0 Mbits after the first finishing and was not growing with the speed to 7.5 Mbits, but stalled. A pause and reconnect with Boinc-Network activity let the second finishing. Otherwihise it came to the picture from Yeti shown in the last message from him. Edit: Will testing a limiting of the Download-speed in Boinc-preferences this weekend. ID: 35794 · Reply Quote

Yeti Volunteer moderator Send message Joined: 2 Sep 04 Posts: 455 Credit: 213,668,191 RAC: 4,674	Message 35795 - Posted: 6 Jul 2018, 16:50:55 UTC hours later, you can see that the client still tries to download the same files as my earlier post: Supporting BOINC, a great concept ! ID: 35795 · Reply Quote

Crystal Pellet Volunteer moderator Volunteer tester Send message Joined: 14 Jan 10 Posts: 1450 Credit: 9,747,164 RAC: 667	Message 35796 - Posted: 6 Jul 2018, 17:26:34 UTC The problem is still there only for ATLAS-downloads, but not every new task is suffering. When I download the *.root.1 file via my browser there is no problem. Stopping BOINC, copying the manual downloaded one to the project-dir and restarting BOINC helps. The stalled download disappears and the task is starting. ID: 35796 · Reply Quote

computezrmle Volunteer moderator Volunteer developer Volunteer tester Help desk expert Send message Joined: 15 Jun 08 Posts: 2636 Credit: 276,716,033 RAC: 147,299	Message 35797 - Posted: 6 Jul 2018, 17:34:24 UTC If Yeti's guess is right, all the workarounds mentioned in various threads will sooner or later run into the same problems than the BOINC client. ID: 35797 · Reply Quote

Yeti Volunteer moderator Send message Joined: 2 Sep 04 Posts: 455 Credit: 213,668,191 RAC: 4,674	Message 35798 - Posted: 6 Jul 2018, 18:59:08 UTC - in response to Message 35792. Please, check the switch(es) on the way out of CERN, I had a similar problem and we have been searching several month for the reason. Finally we discovered that a switch had a faulty port and since we changed the switch all our problems are gone By the way: I don't know the config from CERN so may be it is only a faulty network-card or network-cable Supporting BOINC, a great concept ! ID: 35798 · Reply Quote

maeax Send message Joined: 2 May 07 Posts: 2267 Credit: 175,671,719 RAC: 37	Message 35800 - Posted: 7 Jul 2018, 5:49:15 UTC This morning a Windows-PC is downloading for 4 hours TWO Atlas-tasks parallel. After Boinc-network paused and reactivated it finished in 1 Minute. No limit for download-speed in Boinc. ID: 35800 · Reply Quote

bronco Send message Joined: 13 Apr 18 Posts: 443 Credit: 8,438,885 RAC: 0	Message 35801 - Posted: 7 Jul 2018, 6:46:27 UTC It might be a long while before they find the cause of this problem so I've decided to script a kludge and run the kludge as a cron job every 10 minutes. The script uses boinccmd to suspend network activity on my 2 hosts running ATLAS tasks. It sleeps for 10 seconds then resumes network activity then exits. The script is saved to /home/bronco/bin/atlas_dl_kludge.sh boinccmd --host localhost --passwd <gui_rpc_password> --set_network_mode never boinccmd --host lappy --passwd <gui_rpc_password> --set_network_mode never sleep 10 boinccmd --host localhost --passwd <gui_rpc_password> --set_network_mode always boinccmd --host lappy --passwd <gui_rpc_password> --set_network_mode always The crontab entry: /10 * * * bash /home/bronco/bin/atlas_dl_kludge.sh ID: 35801 · Reply Quote

computezrmle Volunteer moderator Volunteer developer Volunteer tester Help desk expert Send message Joined: 15 Jun 08 Posts: 2636 Credit: 276,716,033 RAC: 147,299	Message 35802 - Posted: 7 Jul 2018, 7:21:08 UTC - in response to Message 35801. Depending on the project mix that runs on the host, this workaround may have unwanted side effects. As soon as the network activity is suspended, the BOINC client will start other tasks. The LHC VMs will typically stay in RAM and - if there is not enough RAM - higher swapping activity may be encountered. The latter is suspect to cause timing problems once the VMs will resume (especially: resume concurrently) and may result in a watchdog error. In short: Volunteers may be aware of other errors. ID: 35802 · Reply Quote

maeax Send message Joined: 2 May 07 Posts: 2267 Credit: 175,671,719 RAC: 37	Message 35803 - Posted: 7 Jul 2018, 7:27:45 UTC - in response to Message 35802. +1 ID: 35803 · Reply Quote

Harri Liljeroos Send message Joined: 28 Sep 04 Posts: 765 Credit: 56,850,401 RAC: 26,919	Message 35804 - Posted: 7 Jul 2018, 8:15:38 UTC - in response to Message 35802. Depending on the project mix that runs on the host, this workaround may have unwanted side effects. As soon as the network activity is suspended, the BOINC client will start other tasks. The LHC VMs will typically stay in RAM and - if there is not enough RAM - higher swapping activity may be encountered. The latter is suspect to cause timing problems once the VMs will resume (especially: resume concurrently) and may result in a watchdog error. In short: Volunteers may be aware of other errors. Why do you think this is happening? If I manually stop network activity on a WIndows machine, task swapping is not happening. And I don't think that Boinc network status has any affect on network status inside VM. It may affect native tasks though. ID: 35804 · Reply Quote

computezrmle Volunteer moderator Volunteer developer Volunteer tester Help desk expert Send message Joined: 15 Jun 08 Posts: 2636 Credit: 276,716,033 RAC: 147,299	Message 35805 - Posted: 7 Jul 2018, 8:41:36 UTC - in response to Message 35804. Why do you think this is happening? ... VMs are configured to always need an active Network. If you suspend the network, the VMs will also be suspended. Depending on the project mix the BOINC client will start/resume tasks from other projects (or SixTrack that doesn't require a permanent network connection). Those tasks need more or less RAM and - depending on the local preferences - it may cause suspended tasks to be swapped out. A lot of "ifs", I know. But volunteers should be aware of it. Therefore: In short: Volunteers may be aware of other errors. ATLAS native behaves different, e.g. it ignores suspend/resume. This is not 100% BOINC client compatible. ID: 35805 · Reply Quote

LHC@home