Transfer issues

Author	Message
Alessio Mereghetti Volunteer moderator Project administrator Project developer Project tester Project scientist Send message Joined: 29 Feb 16 Posts: 157 Credit: 2,659,975 RAC: 0	Message 33963 - Posted: 20 Jan 2018, 14:40:21 UTC - in response to Message 33926. I think that the IT guys have run some cleaning scripts - I will trigger them again. Thanks for monitoring! Cheers, A. ID: 33963 · Reply Quote

AuxRx Send message Joined: 16 Sep 17 Posts: 100 Credit: 1,618,469 RAC: 0	Message 33965 - Posted: 20 Jan 2018, 14:45:10 UTC - in response to Message 33963. Not sure if this is related, but I was able to upload two stuck results: one yesterday and one today. Thank you! ID: 33965 · Reply Quote

Magic Quantum Mechanic Send message Joined: 24 Oct 04 Posts: 1203 Credit: 69,424,159 RAC: 67,459	Message 33974 - Posted: 20 Jan 2018, 15:58:54 UTC - in response to Message 33965. I am still having this problem on several of my 8-core The one I am on has 3 tasks stuck on the return page for several days and another task just finished and it was sent right up but these other 3 won't leave and one has been there so long it is now past due and the other two are due on the 25th and I have been trying to send them manually over and over every time I check each computer. Volunteer Mad Scientist For Life ID: 33974 · Reply Quote

Gunnar Hjern Send message Joined: 14 Jul 17 Posts: 7 Credit: 260,936 RAC: 0	Message 33977 - Posted: 20 Jan 2018, 18:01:44 UTC - in response to Message 33974. Hi! I currently have 6 or 7 tasks stuck in uploading on some of my computers. At this one that I'm using right now there are the following three: Sat 20 Jan 2018 06:43:13 PM CET \| LHC@home \| Started upload of w-c1_job.B1inj_c1.2158__50__s__64.28_59.31__17.1_18.1__6__10.5_1_sixvf_boinc52931_0_r399750021_0 Sat 20 Jan 2018 06:43:28 PM CET \| LHC@home \| [error] Error reported by file upload server: [w-c1_job.B1inj_c1.2158__50__s__64.28_59.31__17.1_18.1__6__10.5_1_sixvf_boinc52931_0_r399750021_0] locked by file_upload_handler PID=-1 Sat 20 Jan 2018 06:43:28 PM CET \| LHC@home \| Temporarily failed upload of w-c1_job.B1inj_c1.2158__50__s__64.28_59.31__17.1_18.1__6__10.5_1_sixvf_boinc52931_0_r399750021_0: transient upload error Sat 20 Jan 2018 06:43:28 PM CET \| LHC@home \| Backing off 04:49:27 on upload of w-c1_job.B1inj_c1.2158__50__s__64.28_59.31__17.1_18.1__6__10.5_1_sixvf_boinc52931_0_r399750021_0 Sat 20 Jan 2018 06:43:29 PM CET \| LHC@home \| Started upload of LHC_2015_LHC_2015_234_BOINC_errors__15__s__62.31_60.32__3.1_3.2__5__3_1_sixvf_boinc50211_0_r725332866_0 Sat 20 Jan 2018 06:43:46 PM CET \| LHC@home \| Started upload of LHC_2015_LHC_2015_234_BOINC_errors__15__s__62.31_60.32__3.1_3.2__5__15_1_sixvf_boinc50219_0_r867841013_0 Sat 20 Jan 2018 06:43:53 PM CET \| LHC@home \| [error] Error reported by file upload server: [LHC_2015_LHC_2015_234_BOINC_errors__15__s__62.31_60.32__3.1_3.2__5__3_1_sixvf_boinc50211_0_r725332866_0] locked by file_upload_handler PID=-1 Sat 20 Jan 2018 06:43:53 PM CET \| LHC@home \| Temporarily failed upload of LHC_2015_LHC_2015_234_BOINC_errors__15__s__62.31_60.32__3.1_3.2__5__3_1_sixvf_boinc50211_0_r725332866_0: transient upload error Sat 20 Jan 2018 06:43:53 PM CET \| LHC@home \| Backing off 03:09:52 on upload of LHC_2015_LHC_2015_234_BOINC_errors__15__s__62.31_60.32__3.1_3.2__5__3_1_sixvf_boinc50211_0_r725332866_0 Sat 20 Jan 2018 06:44:04 PM CET \| LHC@home \| [error] Error reported by file upload server: [LHC_2015_LHC_2015_234_BOINC_errors__15__s__62.31_60.32__3.1_3.2__5__15_1_sixvf_boinc50219_0_r867841013_0] locked by file_upload_handler PID=-1 Sat 20 Jan 2018 06:44:04 PM CET \| LHC@home \| Temporarily failed upload of LHC_2015_LHC_2015_234_BOINC_errors__15__s__62.31_60.32__3.1_3.2__5__15_1_sixvf_boinc50219_0_r867841013_0: transient upload error Sat 20 Jan 2018 06:44:04 PM CET \| LHC@home \| Backing off 05:53:38 on upload of LHC_2015_LHC_2015_234_BOINC_errors__15__s__62.31_60.32__3.1_3.2__5__15_1_sixvf_boinc50219_0_r867841013_0 I've heard something about some file fragments (on the server) and if that is the case, maybe they can be removed by a command like: > find /correct/path/ -type f -name "sixvf" -size +220c -size -250c -mtime -20 -exec rm -f {} \; (if the files on the server gets the same filenames as on my client computer, or else the -name parameter needs to be changed) This command would clean out files that are between 220 and 250 bytes, which at least seams to be their actual sizes, judging from how much is reported to have been uploaded before it stopped (between 0.46 % and 0.51 % of files of the size of aprox. 44kB.) Hope that you can resolve the issue before the tasks are running out of time. *Have a nice day!!!* Kindest regards, Gunnar ID: 33977 · Reply Quote

Magic Quantum Mechanic Send message Joined: 24 Oct 04 Posts: 1203 Credit: 69,424,159 RAC: 67,459	Message 34002 - Posted: 21 Jan 2018, 9:32:47 UTC Last modified: 21 Jan 2018, 9:51:08 UTC I have had 3 stuck on one of mine for days so I decided to check this pc's stats page under the Error list and there was one of them there saying it is Timed out - no response which means that finished task that was ready to send a couple days before the due date is now Invalid since the server refused to take it back (it was taking other finished tasks) so I am just going to abort this one and keep trying to send in the other 2 finished tasks when I am on this one since they still have 4 days left on that dues date. With 50 cores running these I sure spend a lot of time checking them all and trying to make sure most of them get returned. https://lhcathome.cern.ch/lhcathome/results.php?hostid=10451775 Task 172348631 83568818 12 Jan 2018, 2:43:21 UTC 19 Jan 2018, 18:15:35 UTC Timed out - no response After aborting it just now it disappeared from the ERROR list and is now gone completely from my account on ALL settings Edit: ok now checking the next 8-core here I see I have 6 of those AVX tasks that refuse to hit the road and they are those 10 second to 1 minute versions. One thing that would help with this is if someone at the server would come here and say that we should just abort them and send them back that way so we don't end up with an even longer list of unreturned tasks.(at least with maybe the past due ones that in reality are finished on time) (ok time to check the rest of them) ID: 34002 · Reply Quote

AuxRx Send message Joined: 16 Sep 17 Posts: 100 Credit: 1,618,469 RAC: 0	Message 34003 - Posted: 21 Jan 2018, 10:26:34 UTC - in response to Message 34002. One thing that would help with this is if someone at the server would come here and say that we should just abort them and send them back that way so we don't end up with an even longer list of unreturned tasks.(at least with maybe the past due ones that in reality are finished on time) +1 ID: 34003 · Reply Quote

computezrmle Volunteer moderator Volunteer developer Volunteer tester Help desk expert Send message Joined: 15 Jun 08 Posts: 2636 Credit: 274,287,915 RAC: 103,463	Message 34099 - Posted: 26 Jan 2018, 14:24:41 UTC Fr 26 Jan 2018 15:17:40 CET \| LHC@home \| Temporarily failed download of w-c3_job.B1inj_c3.2158__52__s__64.28_59.31__5.1_6.1__6__78_1_sixvf_boinc54488.zip: transient HTTP error I see a couple of those messages in the log this afternoon. Maybe the servers need some friendly words. ID: 34099 · Reply Quote

Erich56 Send message Joined: 18 Dec 15 Posts: 1874 Credit: 137,457,835 RAC: 50,062	Message 34111 - Posted: 27 Jan 2018, 9:18:46 UTC Although about 770.000 "unsent" tasks are shown on the Project Status Page, I have been unsuccessfully trying to download tasks for hours now. It always says "No tasks are available for Sixtrack" :-( Will this problem ever be solved? ID: 34111 · Reply Quote

AuxRx Send message Joined: 16 Sep 17 Posts: 100 Credit: 1,618,469 RAC: 0	Message 34112 - Posted: 27 Jan 2018, 9:44:41 UTC I've run out of work this morning. My system has been asigned new tasks, but the download is delayed again and again. Not worth it, shutting down for the weekend. ID: 34112 · Reply Quote

Erich56 Send message Joined: 18 Dec 15 Posts: 1874 Credit: 137,457,835 RAC: 50,062	Message 34113 - Posted: 27 Jan 2018, 9:51:12 UTC - in response to Message 34112. I've run out of work this morning. My system has been asigned new tasks, but the download is delayed again and again. and again, as already with ATLAS many times, I am questioning what sense it makes to pump several hunded thousands new tasks into the mills if not even their downloads work, not to talk about the upload problems. ID: 34113 · Reply Quote

marmot Send message Joined: 5 Nov 15 Posts: 144 Credit: 6,301,268 RAC: 0	Message 34114 - Posted: 27 Jan 2018, 14:12:42 UTC - in response to Message 34111. Although about 770.000 "unsent" tasks are shown on the Project Status Page, I have been unsuccessfully trying to download tasks for hours now. It always says "No tasks are available for Sixtrack" :-( Will this problem ever be solved? OK, so it's not just me. Good to know. All the Theory WU's use so much internet bandwidth that I can't even watch a video. I knew there were no Sixtrack coming down because I couldn't watch Democracy Now! this morning. ID: 34114 · Reply Quote

T.J. Send message Joined: 17 Feb 07 Posts: 86 Credit: 968,855 RAC: 0	Message 34116 - Posted: 27 Jan 2018, 16:59:47 UTC The uploading and downloading of WU's is still not going smooth. Sometimes it takes a few hours for only one to upload while the others go quick. But with patience all is going well. Greetings from, TJ ID: 34116 · Reply Quote

Ray Murray Volunteer moderator Send message Joined: 29 Sep 04 Posts: 281 Credit: 11,866,449 RAC: 0	Message 34117 - Posted: 27 Jan 2018, 18:55:58 UTC Last modified: 27 Jan 2018, 19:04:38 UTC While there is still, clearly, some issue with the servers, would it perhaps be sensible to (briefly) suspend the issue of NEW work to allow the servers to catch up with resends? I have 5 that are Validation Pending for over a week but have yet to be sent to a second or third wingman but as these seem to go to the back of the queue, they are taking up vital space on the server for longer than necessary, rather than being send out quickly and allowed to clear. Surely allowing the resend backlog to clear would free up space and then allow a clearer run for the new work. [I've said "clearly", "clear" and "clearer" a few more times than I had intended but hopefully I've made my point clear 8Â¬) ] ID: 34117 · Reply Quote

Magic Quantum Mechanic Send message Joined: 24 Oct 04 Posts: 1203 Credit: 69,424,159 RAC: 67,459	Message 34118 - Posted: 27 Jan 2018, 19:40:56 UTC Clearest? It has been a bit slow but at least this time if I tell it to transfer several times and Boinc update it happens instead of staying there beyond the due date........of course I could also just be getting lucky. ID: 34118 · Reply Quote

AuxRx Send message Joined: 16 Sep 17 Posts: 100 Credit: 1,618,469 RAC: 0	Message 34119 - Posted: 27 Jan 2018, 21:23:46 UTC - in response to Message 34117. Long validation times don't bother me personally. But it would be prudent to control job creation by staggering job creation. AFAIK over-filling the queue with large batches does not speed up production. Inf act, at work we try to achieve the opposite in accordance with Best Practice, Lean Management principles. Can job creation be initiated automatically and staggered in smaller batches? Could job creation be rescheduled so staff is available to deal with hiccups? ID: 34119 · Reply Quote

Alessio Mereghetti Volunteer moderator Project administrator Project developer Project tester Project scientist Send message Joined: 29 Feb 16 Posts: 157 Credit: 2,659,975 RAC: 0	Message 34124 - Posted: 28 Jan 2018, 10:54:21 UTC - in response to Message 34119. Hi, For what concerns upload/download issues, we are in the hands of the IT experts. I fear that the large result files of ATLAS tasks play a role on that... Concerning the big delays in validation, this is due to the fact that there are a lot of queueing tasks - I am wondering about how complicated is to re-issue validation-failing tasks in higher priority. Concerning managing the load of SixTrack tasks, I agree that having large batches does not speed up production. The only real drawbacks are: - long times in validating results, in case of need of re-issuing; - long times in releasing new versions of exes. Thanks for the feedback - I will discuss this with IT. ID: 34124 · Reply Quote

Erich56 Send message Joined: 18 Dec 15 Posts: 1874 Credit: 137,457,835 RAC: 50,062	Message 34125 - Posted: 28 Jan 2018, 11:38:32 UTC - in response to Message 34117. While there is still, clearly, some issue with the servers, would it perhaps be sensible to (briefly) suspend the issue of NEW work to allow the servers to catch up with resends? I have 5 that are Validation Pending for over a week but have yet to be sent to a second or third wingman but as these seem to go to the back of the queue, they are taking up vital space on the server for longer than necessary, rather than being send out quickly and allowed to clear. Surely allowing the resend backlog to clear would free up space and then allow a clearer run for the new work. This is exactly what I have been saying in several postings here in the various message boards, after the whole mess started with far too many ATLAS tasks Mid-December. Because I simply don't understand what sense it makes to pump even more tasks into the mills while, on the other hand, neither the servers nor the network can handle these huge bulks of data. So, I'm once more rather surprised to see that this morning the number of unsent Sixtrack tasks has passed the million (and still is growing). But, perhaps my thinking is totally wrong and one of the mods or someone else in charge could enlighten me. ID: 34125 · Reply Quote

Toby Broom Volunteer moderator Send message Joined: 27 Sep 08 Posts: 869 Credit: 725,289,794 RAC: 168,810	Message 34128 - Posted: 28 Jan 2018, 14:34:27 UTC Hi Alessio, There is an option in BOINC to priotrise re-issues, called https://boinc.berkeley.edu/trac/wiki/ProjectOptions#Acceleratingretries May you can check the settings with IT? ID: 34128 · Reply Quote

LHC@home