Message boards :
Number crunching :
Downloads have stalled
Message board moderation
Previous · 1 · 2 · 3 · 4 · Next
Author | Message |
---|---|
Send message Joined: 2 Dec 05 Posts: 11 Credit: 2,607,594 RAC: 0 |
My downloads are better...some still get stuck and apparently the "retry" option doesn't actual work...it just throws a "error while downloading". Uploads are starting to get through but still taking hours. |
Send message Joined: 2 Dec 05 Posts: 11 Credit: 2,607,594 RAC: 0 |
'Retry' only works on failed downloads. If the download is active, 'Retry' doesn't do anything. |
Send message Joined: 5 Apr 15 Posts: 18 Credit: 5,910,849 RAC: 0 |
To the LHC administrators : On my machine (Win 10 Pro - see my previous post) the behaviour I observe is the following : 1. Only LHC (Atlas) WU's downloads stall, Rosetta, WorldComGrid, ClimatePrediction, GPUGrid download just fine. 2. The downloads (usually the big ones > 200 MB) start well at 5-7 Mbps but gradually degrade and at around 50-80 % of the total filesize, the download speed decreases to zero. 3. At that point, all other downloads from LHC are blocked. 4. When de-activating network and re-activating it again, the downloads resume and if it had progressed far enough to get the whole file in, it continues to download the smaller files as well. However, if too much of the file was left to download, I observe exactly the same behaviour again : the download speed decreases over time to zero and stalls the download (again). 5. I also saw the same thing when actually suspending all WU's crunching, exiting BOINC and restarting it again, then resuming all WU's afterwards. Can you investigate whether this has to do with the BOINC manager/Windows TCP/IP stack/(T)FTP protocol or with the file transfer software on your end please ? I lost a whole night of crunching because of this once again... (4th or 5th time in a week) Many thanks in advance ! B. |
Send message Joined: 2 May 07 Posts: 2257 Credit: 174,366,760 RAC: 20,027 |
Saw the same last night on one PC with 7 hours downloading. After paused and reactivated the network-activity this morning in Boinc it was finished. Buffer overflow in Network? Too busy in Networking? More than 300k Boinc-tasks 3th of July in Atlas? There is a investigation needed. But it is not easy (very dynamic traffic). |
Send message Joined: 13 May 14 Posts: 387 Credit: 15,314,184 RAC: 0 |
We are investigating the problem along with the server admins. It may be due to higher load or a load-balancing issue on the servers but we don't know yet. We are planning to ask some of our power users to switch to the dev project to see if this helps. |
Send message Joined: 2 Sep 04 Posts: 455 Credit: 204,238,209 RAC: 130,216 |
We are investigating the problem along with the server admins. It may be due to higher load or a load-balancing issue on the servers but we don't know yet. We are planning to ask some of our power users to switch to the dev project to see if this helps. I have done this when the problem came up, but the download-problem happened from dev also Supporting BOINC, a great concept ! |
Send message Joined: 20 Jun 14 Posts: 381 Credit: 238,712 RAC: 0 |
Did reducing the load help or is the issue still there? |
Send message Joined: 2 May 07 Posts: 2257 Credit: 174,366,760 RAC: 20,027 |
Had at the moment one Windows-PC with stalling transfer. From yesterday evening up to now no problems so long. |
Send message Joined: 2 Sep 04 Posts: 455 Credit: 204,238,209 RAC: 130,216 |
Did reducing the load help or is the issue still there? The issue is still here, look: Please, check the switch(es) on the way out of CERN, I had a similar problem and we have been searching several month for the reason. Finally we discovered that a switch had a faulty port and since we changed the switch all our problems are gone Supporting BOINC, a great concept ! |
Send message Joined: 2 May 07 Posts: 2257 Credit: 174,366,760 RAC: 20,027 |
Two Atlas are downloading together at the moment in one Window-PC. One got the max.speed of 7.5Mbits up to finishing. The second had 3.0 Mbits after the first finishing and was not growing with the speed to 7.5 Mbits, but stalled. A pause and reconnect with Boinc-Network activity let the second finishing. Otherwihise it came to the picture from Yeti shown in the last message from him. Edit: Will testing a limiting of the Download-speed in Boinc-preferences this weekend. |
Send message Joined: 2 Sep 04 Posts: 455 Credit: 204,238,209 RAC: 130,216 |
|
Send message Joined: 14 Jan 10 Posts: 1439 Credit: 9,624,852 RAC: 2,528 |
The problem is still there only for ATLAS-downloads, but not every new task is suffering. When I download the *.root.1 file via my browser there is no problem. Stopping BOINC, copying the manual downloaded one to the project-dir and restarting BOINC helps. The stalled download disappears and the task is starting. |
Send message Joined: 15 Jun 08 Posts: 2568 Credit: 258,728,987 RAC: 119,300 |
If Yeti's guess is right, all the workarounds mentioned in various threads will sooner or later run into the same problems than the BOINC client. |
Send message Joined: 2 Sep 04 Posts: 455 Credit: 204,238,209 RAC: 130,216 |
Please, check the switch(es) on the way out of CERN, I had a similar problem and we have been searching several month for the reason. Finally we discovered that a switch had a faulty port and since we changed the switch all our problems are gone By the way: I don't know the config from CERN so may be it is only a faulty network-card or network-cable Supporting BOINC, a great concept ! |
Send message Joined: 2 May 07 Posts: 2257 Credit: 174,366,760 RAC: 20,027 |
This morning a Windows-PC is downloading for 4 hours TWO Atlas-tasks parallel. After Boinc-network paused and reactivated it finished in 1 Minute. No limit for download-speed in Boinc. |
Send message Joined: 13 Apr 18 Posts: 443 Credit: 8,438,885 RAC: 0 |
It might be a long while before they find the cause of this problem so I've decided to script a kludge and run the kludge as a cron job every 10 minutes. The script uses boinccmd to suspend network activity on my 2 hosts running ATLAS tasks. It sleeps for 10 seconds then resumes network activity then exits. The script is saved to /home/bronco/bin/atlas_dl_kludge.sh boinccmd --host localhost --passwd <gui_rpc_password> --set_network_mode never boinccmd --host lappy --passwd <gui_rpc_password> --set_network_mode never sleep 10 boinccmd --host localhost --passwd <gui_rpc_password> --set_network_mode always boinccmd --host lappy --passwd <gui_rpc_password> --set_network_mode always The crontab entry: */10 * * * * bash /home/bronco/bin/atlas_dl_kludge.sh |
Send message Joined: 15 Jun 08 Posts: 2568 Credit: 258,728,987 RAC: 119,300 |
Depending on the project mix that runs on the host, this workaround may have unwanted side effects. As soon as the network activity is suspended, the BOINC client will start other tasks. The LHC VMs will typically stay in RAM and - if there is not enough RAM - higher swapping activity may be encountered. The latter is suspect to cause timing problems once the VMs will resume (especially: resume concurrently) and may result in a watchdog error. In short: Volunteers may be aware of other errors. |
Send message Joined: 2 May 07 Posts: 2257 Credit: 174,366,760 RAC: 20,027 |
+1 |
Send message Joined: 28 Sep 04 Posts: 739 Credit: 50,747,550 RAC: 37,244 |
Depending on the project mix that runs on the host, this workaround may have unwanted side effects. Why do you think this is happening? If I manually stop network activity on a WIndows machine, task swapping is not happening. And I don't think that Boinc network status has any affect on network status inside VM. It may affect native tasks though. |
Send message Joined: 15 Jun 08 Posts: 2568 Credit: 258,728,987 RAC: 119,300 |
Why do you think this is happening? ... VMs are configured to always need an active Network. If you suspend the network, the VMs will also be suspended. Depending on the project mix the BOINC client will start/resume tasks from other projects (or SixTrack that doesn't require a permanent network connection). Those tasks need more or less RAM and - depending on the local preferences - it may cause suspended tasks to be swapped out. A lot of "ifs", I know. But volunteers should be aware of it. Therefore: In short: ATLAS native behaves different, e.g. it ignores suspend/resume. This is not 100% BOINC client compatible. |
©2025 CERN