File upload issues

Author	Message
computezrmle Volunteer moderator Volunteer developer Volunteer tester Help desk expert Send message Joined: 15 Jun 08 Posts: 2710 Credit: 291,971,137 RAC: 145,162	Message 33899 - Posted: 17 Jan 2018, 20:06:23 UTC - in response to Message 33895. ...Those "partly uploaded file", are they on my machine or on the server? It's the result file on your computer and (only partly) on the server. Your computer will automatically retry the upload until it's finished completely. ...Do I need to take any actions, or is the problem going to solve itself when the servers are less busy? No action required on your side, as Erich56 already stated. ...I currently have half a dozen or so tasks that are stuck in uploading state, and they represent together several days of hard computing so I'd hate to have to abort them! :-( If you abort it, the work will be lost. So ... ...As I can see on the server stat page, there are several thousands of items in the tasks and WU's "waiting for deletion" queues, and a whopping 768973 tasks to send!! :-O Nothing to worry about. It's not more than an info for the server admins. ...Will this issue be solved by itself once they are crunched and validated? (hopefully before the deadlines expires) Violating the deadline is the only critical point. ID: 33899 · Reply Quote

Gunnar Hjern Send message Joined: 14 Jul 17 Posts: 7 Credit: 260,936 RAC: 0	Message 33902 - Posted: 17 Jan 2018, 21:16:49 UTC - in response to Message 33899. Thanks for your explanations! Have a nice day!! /Gunnar ID: 33902 · Reply Quote

AuxRx Send message Joined: 16 Sep 17 Posts: 100 Credit: 1,618,469 RAC: 0	Message 33903 - Posted: 17 Jan 2018, 22:57:02 UTC - in response to Message 33899. Violating the deadline is the only critical point. I'd like to know where this information is coming from. If literally nothing changes, how is this mark critical? ID: 33903 · Reply Quote

computezrmle Volunteer moderator Volunteer developer Volunteer tester Help desk expert Send message Joined: 15 Jun 08 Posts: 2710 Credit: 291,971,137 RAC: 145,162	Message 33906 - Posted: 18 Jan 2018, 10:21:49 UTC - in response to Message 33903. Violating the deadline is the only critical point. I'd like to know where this information is coming from. If literally nothing changes, how is this mark critical? My comment only makes sense related to Gunnar's posts. A more precise explanation can be found in the BOINC documentation, e.g.: https://boinc.berkeley.edu/trac/wiki/JobReplication https://boinc.berkeley.edu/trac/wiki/ProjectOptions Be aware that the JobReplication page explains it using "min_quorum = 2" and "target_nresults = 3" while LHC projects use different values. Results that are cancelled or reported after "client_deadline + grace_period" will never be rewarded. This can be seen in the project database as long as the records are available. normal WU -> https://lhcathome.cern.ch/lhcathome/workunit.php?wuid=83734849 WU with aborted task -> https://lhcathome.cern.ch/lhcathome/workunit.php?wuid=83608902 WU with deadline violation -> https://lhcathome.cern.ch/lhcathome/workunit.php?wuid=82393750 ID: 33906 · Reply Quote

AuxRx Send message Joined: 16 Sep 17 Posts: 100 Credit: 1,618,469 RAC: 0	Message 33910 - Posted: 18 Jan 2018, 13:57:58 UTC - in response to Message 33906. Results that are cancelled or reported after "client_deadline + grace_period" will never be rewarded. This precisely is in question. Where do you get this? <report_grace_period>x</report_grace_period> <grace_period_hours>x</grace_period_hours> A "grace period" (in seconds or hours respectively) for task reporting. A task is considered time-out (and a new replica generated) if it is not reported by client_deadline + x. ... does not suggest the initially missing result is precluded from validation if the task is replicated for a third wingman. I guess it depends what the grace period is, but I'm pretty sure I have seen WU finished by the timed-out result - not the recreated WU. Which would suggest LHC is setup like most projects, which is to accept and validate the first results to be returned. If what you say is true, why wouldn't LHC abort/cancel the WU? I can only think of advantages to cancelling a WU that the project will not consider for validation. For one, the volunteer would stop clogging the servers, could get new work, would free disk space, and so on. ID: 33910 · Reply Quote

maeax Send message Joined: 2 May 07 Posts: 2278 Credit: 178,775,457 RAC: 1,891	Message 33911 - Posted: 18 Jan 2018, 16:40:29 UTC My Computer was for ten days down. So, the following sixtrack-task was over the deadline at 18/1/5. In the stats of LHC was this task marked as not finished with the date 18/1/5. For me it was the best to delete the running task, after the Computer was back. But, after i deleted this running task in Boinc, you can see, the deadline was refreshed to the time, when the task was deleted from me. https://lhcathome.cern.ch/lhcathome/workunit.php?wuid=82397136 It seams, that LHC had accept the later time for finishing? ID: 33911 · Reply Quote

Yeti Volunteer moderator Send message Joined: 2 Sep 04 Posts: 468 Credit: 215,197,406 RAC: 1,612	Message 33912 - Posted: 18 Jan 2018, 16:44:51 UTC - in response to Message 33911. My Computer was for ten days down. So, the following sixtrack-task was over the deadline at 18/1/5. In the stats of LHC was this task marked as not finished with the date 18/1/5. For me it was the best to delete the running task, after the Computer was back. But, after i deleted this running task in Boinc, you can see, the deadline was refreshed to the time, when the task was deleted from me. https://lhcathome.cern.ch/lhcathome/workunit.php?wuid=82397136 It seams, that LHC had accept the later time for finishing? Nope The Field first contains the Deadline, as long as the task has not come back and then changes to the date when the client responses what has happened with the WU Supporting BOINC, a great concept ! ID: 33912 · Reply Quote

Gunnar Hjern Send message Joined: 14 Jul 17 Posts: 7 Credit: 260,936 RAC: 0	Message 33975 - Posted: 20 Jan 2018, 16:00:26 UTC - in response to Message 33874. Hi! A lot of upload-stuck tasks are soon hitting deadline and many hours of computer work will be waisted! :-( Would it be possible for some sys-admin to manually erase those faulty file fragments on the server? For example with some command like: > find /correct/path/ -size +220c -size -250c -mtime -20 -exec rm -f {} \; (Afaik they are about 220 to 250 bytes, and they should be younger than 20 days. If some common substring "sub" of the file names are known, you can of course add *-name "sub"* to the params for find.) Not only would it save the work done by us clients, but I think it would lessen the workload of the servers too, as far less client computers will then frequently retry to upload the stuck files. Have a nice day!!! Kindest regards, Gunnar Hjern ID: 33975 · Reply Quote

grumpy Send message Joined: 1 Sep 04 Posts: 57 Credit: 2,835,005 RAC: 0	Message 33993 - Posted: 21 Jan 2018, 4:57:26 UTC 2018-01-20 10:51:51 PM \| LHC@home \| Started upload of h4SMDmPA9vrnDDn7oo6G73TpABFKDmABFKDmsvFKDmABFKDm10zPwn_0_r849762899_ATLAS_result 2018-01-20 10:55:54 PM \| LHC@home \| [error] Error reported by file upload server: can't write file h4SMDmPA9vrnDDn7oo6G73TpABFKDmABFKDmsvFKDmABFKDm10zPwn_0_r849762899_ATLAS_result: Disk quota exceeded 2018-01-20 10:55:54 PM \| LHC@home \| Temporarily failed upload of h4SMDmPA9vrnDDn7oo6G73TpABFKDmABFKDmsvFKDmABFKDm10zPwn_0_r849762899_ATLAS_result: transient upload error 2018-01-20 10:55:54 PM \| LHC@home \| Backing off 00:54:55 on upload of h4SMDmPA9vrnDDn7oo6G73TpABFKDmABFKDmsvFKDmABFKDm10zPwn_0_r849762899_ATLAS_result 2018-01-20 11:47:49 PM \| LHC@home \| Started upload of h4SMDmPA9vrnDDn7oo6G73TpABFKDmABFKDmsvFKDmABFKDm10zPwn_0_r849762899_ATLAS_result 2018-01-20 11:47:51 PM \| LHC@home \| [error] Error reported by file upload server: Server is out of disk space 2018-01-20 11:47:51 PM \| LHC@home \| Temporarily failed upload of h4SMDmPA9vrnDDn7oo6G73TpABFKDmABFKDmsvFKDmABFKDm10zPwn_0_r849762899_ATLAS_result: transient upload error 2018-01-20 11:47:51 PM \| LHC@home \| Backing off 01:31:27 on upload of h4SMDmPA9vrnDDn7oo6G73TpABFKDmABFKDmsvFKDmABFKDm10zPwn_0_r849762899_ATLAS_result I don't think it really takes weeks to solve this kind of problems. wake up.!! ID: 33993 · Reply Quote

Harri Liljeroos Send message Joined: 28 Sep 04 Posts: 790 Credit: 61,751,415 RAC: 50,771	Message 34004 - Posted: 21 Jan 2018, 10:32:01 UTC I am also getting 'Server disk full' error on sixtrack and Atlas tasks. ID: 34004 · Reply Quote

Nils Volunteer moderator Project administrator Project developer Project tester Send message Joined: 15 Jul 05 Posts: 251 Credit: 6,001,083 RAC: 30	Message 34022 - Posted: 22 Jan 2018, 9:34:47 UTC Our disk server again has problems cleaning up files behind the scenes. My own tasks uploaded correctly during the night, once any remaining half-uploaded files have been deleted, your tasks should finally upload too. We are sorry about these enduring problems, please just be patient with transfers until we migrate to a new storage back-end. ID: 34022 · Reply Quote

Erich56 Send message Joined: 18 Dec 15 Posts: 1923 Credit: 149,414,659 RAC: 145,247	Message 34024 - Posted: 22 Jan 2018, 10:15:42 UTC - in response to Message 34022. ...please just be patient with transfers until we migrate to a new storage back-end. when will this take place? Wouldn't it be wise to reduce the number of distributed tasks until then, in order to alleviate the burden on the servers? ID: 34024 · Reply Quote

Gunnar Hjern Send message Joined: 14 Jul 17 Posts: 7 Credit: 260,936 RAC: 0	Message 34025 - Posted: 22 Jan 2018, 10:49:31 UTC - in response to Message 34024. It seems that the servers are down for the moment. I got the following in my log: Mon 22 Jan 2018 11:09:33 AM CET \| LHC@home \| Started upload of w-c9_job.B1inj_c9.2158__26__s__64.28_59.31__16.1_17.1__6__60_1_sixvf_boinc27416_0_r905081043_0 Mon 22 Jan 2018 11:09:33 AM CET \| LHC@home \| Started upload of w-c1_job.B1inj_c1.2158__42__s__64.28_59.31__13.1_14.1__6__82.5_1_sixvf_boinc44247_0_r387987546_0 Mon 22 Jan 2018 11:11:34 AM CET \| \| Project communication failed: attempting access to reference site Mon 22 Jan 2018 11:11:34 AM CET \| LHC@home \| Temporarily failed upload of w-c9_job.B1inj_c9.2158__26__s__64.28_59.31__16.1_17.1__6__60_1_sixvf_boinc27416_0_r905081043_0: transient HTTP error Mon 22 Jan 2018 11:11:34 AM CET \| LHC@home \| Backing off 03:27:00 on upload of w-c9_job.B1inj_c9.2158__26__s__64.28_59.31__16.1_17.1__6__60_1_sixvf_boinc27416_0_r905081043_0 Mon 22 Jan 2018 11:11:36 AM CET \| \| Internet access OK - project servers may be temporarily down. Mon 22 Jan 2018 11:11:50 AM CET \| \| Project communication failed: attempting access to reference site Mon 22 Jan 2018 11:11:50 AM CET \| LHC@home \| Temporarily failed upload of w-c1_job.B1inj_c1.2158__42__s__64.28_59.31__13.1_14.1__6__82.5_1_sixvf_boinc44247_0_r387987546_0: transient HTTP error Mon 22 Jan 2018 11:11:50 AM CET \| LHC@home \| Backing off 03:47:25 on upload of w-c1_job.B1inj_c1.2158__42__s__64.28_59.31__13.1_14.1__6__82.5_1_sixvf_boinc44247_0_r387987546_0 Mon 22 Jan 2018 11:11:52 AM CET \| \| Internet access OK - project servers may be temporarily down. Mon 22 Jan 2018 11:30:47 AM CET \| LHC@home \| Started upload of LHC_2015_LHC_2015_234_BOINC_errors__23__s__62.31_60.32__2.2_2.3__5__55.5_1_sixvf_boinc78035_1_r2013299009_0 Mon 22 Jan 2018 11:30:47 AM CET \| LHC@home \| Started upload of LHC_2015_LHC_2015_234_BOINC_errors__23__s__62.31_60.32__5.7_5.8__5__42_1_sixvf_boinc80091_0_r419032409_0 Mon 22 Jan 2018 11:32:47 AM CET \| \| Project communication failed: attempting access to reference site Mon 22 Jan 2018 11:32:47 AM CET \| LHC@home \| Temporarily failed upload of LHC_2015_LHC_2015_234_BOINC_errors__23__s__62.31_60.32__2.2_2.3__5__55.5_1_sixvf_boinc78035_1_r2013299009_0: transient HTTP error Mon 22 Jan 2018 11:32:47 AM CET \| LHC@home \| Backing off 05:40:47 on upload of LHC_2015_LHC_2015_234_BOINC_errors__23__s__62.31_60.32__2.2_2.3__5__55.5_1_sixvf_boinc78035_1_r2013299009_0 Mon 22 Jan 2018 11:32:47 AM CET \| LHC@home \| Temporarily failed upload of LHC_2015_LHC_2015_234_BOINC_errors__23__s__62.31_60.32__5.7_5.8__5__42_1_sixvf_boinc80091_0_r419032409_0: transient HTTP error Mon 22 Jan 2018 11:32:47 AM CET \| LHC@home \| Backing off 03:51:20 on upload of LHC_2015_LHC_2015_234_BOINC_errors__23__s__62.31_60.32__5.7_5.8__5__42_1_sixvf_boinc80091_0_r419032409_0 Mon 22 Jan 2018 11:32:49 AM CET \| \| Internet access OK - project servers may be temporarily down. //Gunnar ID: 34025 · Reply Quote

maeax Send message Joined: 2 May 07 Posts: 2278 Credit: 178,775,457 RAC: 1,891	Message 34028 - Posted: 22 Jan 2018, 12:12:24 UTC Nils told us at 18/1/12, that the upgrate will be done in one or two months: https://lhcathome.cern.ch/lhcathome/forum_thread.php?id=4572&postid=33809#33809 The team do the best to help us volunteers. ID: 34028 · Reply Quote

computezrmle Volunteer moderator Volunteer developer Volunteer tester Help desk expert Send message Joined: 15 Jun 08 Posts: 2710 Credit: 291,971,137 RAC: 145,162	Message 34030 - Posted: 22 Jan 2018, 12:28:02 UTC @Nils and the team Are you sure it's not just a lack of available network sockets your servers are suffering from? ID: 34030 · Reply Quote

Erich56 Send message Joined: 18 Dec 15 Posts: 1923 Credit: 149,414,659 RAC: 145,247	Message 34031 - Posted: 22 Jan 2018, 12:37:58 UTC - in response to Message 34028. Last modified: 22 Jan 2018, 12:38:23 UTC Nils told us at 18/1/12, that the upgrate will be done in one or two months: https://lhcathome.cern.ch/lhcathome/forum_thread.php?id=4572&postid=33809#33809 The team do the best to help us volunteers. Well, while Nils was saying "Regarding I/O, our NFS server is now in better shape, so transfer problems should be mostly ironed out" this, unfortunately, has turned out not be the case :-( The transfer problems are mostly the same as before. ID: 34031 · Reply Quote

ivan Volunteer moderator Project tester Volunteer developer Volunteer tester Project scientist Send message Joined: 29 Aug 05 Posts: 1119 Credit: 10,387,066 RAC: 19,596	Message 34034 - Posted: 22 Jan 2018, 14:39:24 UTC - in response to Message 34025. It seems that the servers are down for the moment. I got the following in my log: Mon 22 Jan 2018 11:09:33 AM CET \| LHC@home \| Started upload of w-c9_job.B1inj_c9.2158__26__s__64.28_59.31__16.1_17.1__6__60_1_sixvf_boinc27416_0_r905081043_0 ... Mon 22 Jan 2018 11:32:47 AM CET \| LHC@home \| Backing off 03:51:20 on upload of LHC_2015_LHC_2015_234_BOINC_errors__23__s__62.31_60.32__5.7_5.8__5__42_1_sixvf_boinc80091_0_r419032409_0 Mon 22 Jan 2018 11:32:49 AM CET \| \| Internet access OK - project servers may be temporarily down. //Gunnar There's a rolling campaign of hypervisor, etc., upgrades/reboots at CERN at the moment (Meltdown/Spectre/what-have-you). The CMS@home WMAgent was affected briefly this morning, so maybe this was as well. ID: 34034 · Reply Quote

Nils Volunteer moderator Project administrator Project developer Project tester Send message Joined: 15 Jul 05 Posts: 251 Credit: 6,001,083 RAC: 30	Message 34038 - Posted: 22 Jan 2018, 15:39:12 UTC Last modified: 22 Jan 2018, 15:45:59 UTC As part of our cleanup campaign, BOINC antique_file_deleter made our NFS server hit the limit of maximum number of open files. Now the NFS server should accept connections again. We are trying to debug this intermittent file upload issue. During our debugging, we will stop upload for short periods. Files will eventually upload, please remain patient and sorry for this. We will also have more server reboots over the next days as Ivan mentions. ID: 34038 · Reply Quote

Nils Volunteer moderator Project administrator Project developer Project tester Send message Joined: 15 Jul 05 Posts: 251 Credit: 6,001,083 RAC: 30	Message 34045 - Posted: 23 Jan 2018, 9:25:12 UTC The underlying cause of the NFS server saturation is that files are left open when the BOINC file upload handler script times out. When a number of BOINC clients retry failed uploads frequently, the effect on our file servers is similar to a denial of service attack. It seems that our move to a load-balanced cluster some time back to increase capacity simply moved the bottleneck to the NFS storage layer. We will need to change our system architecture to get a permanent fix. ID: 34045 · Reply Quote

Yeti Volunteer moderator Send message Joined: 2 Sep 04 Posts: 468 Credit: 215,197,406 RAC: 1,612	Message 34090 - Posted: 26 Jan 2018, 8:29:16 UTC Are you aware that the file-server again actual reports: LHC@home 26-01-2018 09:25 [error] Error reported by file upload server: [Lm9LDm3i3yrnDDn7oo6G73TpABFKDmABFKDmpdKKDmABFKDm0Izqbn_0_r1695784418_ATLAS_result] locked by file_upload_handler PID=-1 Supporting BOINC, a great concept ! ID: 34090 · Reply Quote

LHC@home