Message boards :
News :
File upload issues
Message board moderation
Previous · 1 · 2 · 3 · 4 · Next
Author | Message |
---|---|
Send message Joined: 15 Jun 08 Posts: 2520 Credit: 252,427,586 RAC: 135,630 |
...Those "partly uploaded file", are they on my machine or on the server? It's the result file on your computer and (only partly) on the server. Your computer will automatically retry the upload until it's finished completely. ...Do I need to take any actions, or is the problem going to solve itself when the servers are less busy? No action required on your side, as Erich56 already stated. ...I currently have half a dozen or so tasks that are stuck in uploading state, and they represent If you abort it, the work will be lost. So ... ...As I can see on the server stat page, there are several thousands of items in the tasks and WU's Nothing to worry about. It's not more than an info for the server admins. ...Will this issue be solved by itself once they are crunched and validated? Violating the deadline is the only critical point. |
Send message Joined: 14 Jul 17 Posts: 7 Credit: 260,936 RAC: 0 |
Thanks for your explanations! Have a nice day!! /Gunnar |
Send message Joined: 16 Sep 17 Posts: 100 Credit: 1,618,469 RAC: 0 |
Violating the deadline is the only critical point. I'd like to know where this information is coming from. If literally nothing changes, how is this mark critical? |
Send message Joined: 15 Jun 08 Posts: 2520 Credit: 252,427,586 RAC: 135,630 |
Violating the deadline is the only critical point. My comment only makes sense related to Gunnar's posts. A more precise explanation can be found in the BOINC documentation, e.g.: https://boinc.berkeley.edu/trac/wiki/JobReplication https://boinc.berkeley.edu/trac/wiki/ProjectOptions Be aware that the JobReplication page explains it using "min_quorum = 2" and "target_nresults = 3" while LHC projects use different values. Results that are cancelled or reported after "client_deadline + grace_period" will never be rewarded. This can be seen in the project database as long as the records are available. normal WU -> https://lhcathome.cern.ch/lhcathome/workunit.php?wuid=83734849 WU with aborted task -> https://lhcathome.cern.ch/lhcathome/workunit.php?wuid=83608902 WU with deadline violation -> https://lhcathome.cern.ch/lhcathome/workunit.php?wuid=82393750 |
Send message Joined: 16 Sep 17 Posts: 100 Credit: 1,618,469 RAC: 0 |
Results that are cancelled or reported after "client_deadline + grace_period" will never be rewarded. This precisely is in question. Where do you get this? <report_grace_period>x</report_grace_period> <grace_period_hours>x</grace_period_hours> A "grace period" (in seconds or hours respectively) for task reporting. A task is considered time-out (and a new replica generated) if it is not reported by client_deadline + x. ... does not suggest the initially missing result is precluded from validation if the task is replicated for a third wingman. I guess it depends what the grace period is, but I'm pretty sure I have seen WU finished by the timed-out result - not the recreated WU. Which would suggest LHC is setup like most projects, which is to accept and validate the first results to be returned. If what you say is true, why wouldn't LHC abort/cancel the WU? I can only think of advantages to cancelling a WU that the project will not consider for validation. For one, the volunteer would stop clogging the servers, could get new work, would free disk space, and so on. |
Send message Joined: 2 May 07 Posts: 2230 Credit: 173,850,078 RAC: 17,550 |
My Computer was for ten days down. So, the following sixtrack-task was over the deadline at 18/1/5. In the stats of LHC was this task marked as not finished with the date 18/1/5. For me it was the best to delete the running task, after the Computer was back. But, after i deleted this running task in Boinc, you can see, the deadline was refreshed to the time, when the task was deleted from me. https://lhcathome.cern.ch/lhcathome/workunit.php?wuid=82397136 It seams, that LHC had accept the later time for finishing? |
Send message Joined: 2 Sep 04 Posts: 455 Credit: 200,411,535 RAC: 51,563 |
My Computer was for ten days down. So, the following sixtrack-task was over the deadline at 18/1/5. Nope The Field first contains the Deadline, as long as the task has not come back and then changes to the date when the client responses what has happened with the WU Supporting BOINC, a great concept ! |
Send message Joined: 14 Jul 17 Posts: 7 Credit: 260,936 RAC: 0 |
Hi! A lot of upload-stuck tasks are soon hitting deadline and many hours of computer work will be waisted! :-( Would it be possible for some sys-admin to manually erase those faulty file fragments on the server? For example with some command like: > find /correct/path/ -size +220c -size -250c -mtime -20 -exec rm -f {} \; (Afaik they are about 220 to 250 bytes, and they should be younger than 20 days. If some common substring "sub" of the file names are known, you can of course add -name "*sub*" to the params for find.) Not only would it save the work done by us clients, but I think it would lessen the workload of the servers too, as far less client computers will then frequently retry to upload the stuck files. Have a nice day!!! Kindest regards, Gunnar Hjern |
Send message Joined: 1 Sep 04 Posts: 57 Credit: 2,835,005 RAC: 0 |
2018-01-20 10:51:51 PM | LHC@home | Started upload of h4SMDmPA9vrnDDn7oo6G73TpABFKDmABFKDmsvFKDmABFKDm10zPwn_0_r849762899_ATLAS_result 2018-01-20 10:55:54 PM | LHC@home | [error] Error reported by file upload server: can't write file h4SMDmPA9vrnDDn7oo6G73TpABFKDmABFKDmsvFKDmABFKDm10zPwn_0_r849762899_ATLAS_result: Disk quota exceeded 2018-01-20 10:55:54 PM | LHC@home | Temporarily failed upload of h4SMDmPA9vrnDDn7oo6G73TpABFKDmABFKDmsvFKDmABFKDm10zPwn_0_r849762899_ATLAS_result: transient upload error 2018-01-20 10:55:54 PM | LHC@home | Backing off 00:54:55 on upload of h4SMDmPA9vrnDDn7oo6G73TpABFKDmABFKDmsvFKDmABFKDm10zPwn_0_r849762899_ATLAS_result 2018-01-20 11:47:49 PM | LHC@home | Started upload of h4SMDmPA9vrnDDn7oo6G73TpABFKDmABFKDmsvFKDmABFKDm10zPwn_0_r849762899_ATLAS_result 2018-01-20 11:47:51 PM | LHC@home | [error] Error reported by file upload server: Server is out of disk space 2018-01-20 11:47:51 PM | LHC@home | Temporarily failed upload of h4SMDmPA9vrnDDn7oo6G73TpABFKDmABFKDmsvFKDmABFKDm10zPwn_0_r849762899_ATLAS_result: transient upload error 2018-01-20 11:47:51 PM | LHC@home | Backing off 01:31:27 on upload of h4SMDmPA9vrnDDn7oo6G73TpABFKDmABFKDmsvFKDmABFKDm10zPwn_0_r849762899_ATLAS_result I don't think it really takes weeks to solve this kind of problems. wake up.!! |
Send message Joined: 28 Sep 04 Posts: 722 Credit: 48,414,670 RAC: 27,571 |
I am also getting 'Server disk full' error on sixtrack and Atlas tasks. |
Send message Joined: 15 Jul 05 Posts: 247 Credit: 5,974,599 RAC: 0 |
Our disk server again has problems cleaning up files behind the scenes. My own tasks uploaded correctly during the night, once any remaining half-uploaded files have been deleted, your tasks should finally upload too. We are sorry about these enduring problems, please just be patient with transfers until we migrate to a new storage back-end. |
Send message Joined: 18 Dec 15 Posts: 1788 Credit: 117,626,501 RAC: 80,924 |
...please just be patient with transfers until we migrate to a new storage back-end.when will this take place? Wouldn't it be wise to reduce the number of distributed tasks until then, in order to alleviate the burden on the servers? |
Send message Joined: 14 Jul 17 Posts: 7 Credit: 260,936 RAC: 0 |
It seems that the servers are down for the moment. I got the following in my log: Mon 22 Jan 2018 11:09:33 AM CET | LHC@home | Started upload of w-c9_job.B1inj_c9.2158__26__s__64.28_59.31__16.1_17.1__6__60_1_sixvf_boinc27416_0_r905081043_0 Mon 22 Jan 2018 11:09:33 AM CET | LHC@home | Started upload of w-c1_job.B1inj_c1.2158__42__s__64.28_59.31__13.1_14.1__6__82.5_1_sixvf_boinc44247_0_r387987546_0 Mon 22 Jan 2018 11:11:34 AM CET | | Project communication failed: attempting access to reference site Mon 22 Jan 2018 11:11:34 AM CET | LHC@home | Temporarily failed upload of w-c9_job.B1inj_c9.2158__26__s__64.28_59.31__16.1_17.1__6__60_1_sixvf_boinc27416_0_r905081043_0: transient HTTP error Mon 22 Jan 2018 11:11:34 AM CET | LHC@home | Backing off 03:27:00 on upload of w-c9_job.B1inj_c9.2158__26__s__64.28_59.31__16.1_17.1__6__60_1_sixvf_boinc27416_0_r905081043_0 Mon 22 Jan 2018 11:11:36 AM CET | | Internet access OK - project servers may be temporarily down. Mon 22 Jan 2018 11:11:50 AM CET | | Project communication failed: attempting access to reference site Mon 22 Jan 2018 11:11:50 AM CET | LHC@home | Temporarily failed upload of w-c1_job.B1inj_c1.2158__42__s__64.28_59.31__13.1_14.1__6__82.5_1_sixvf_boinc44247_0_r387987546_0: transient HTTP error Mon 22 Jan 2018 11:11:50 AM CET | LHC@home | Backing off 03:47:25 on upload of w-c1_job.B1inj_c1.2158__42__s__64.28_59.31__13.1_14.1__6__82.5_1_sixvf_boinc44247_0_r387987546_0 Mon 22 Jan 2018 11:11:52 AM CET | | Internet access OK - project servers may be temporarily down. Mon 22 Jan 2018 11:30:47 AM CET | LHC@home | Started upload of LHC_2015_LHC_2015_234_BOINC_errors__23__s__62.31_60.32__2.2_2.3__5__55.5_1_sixvf_boinc78035_1_r2013299009_0 Mon 22 Jan 2018 11:30:47 AM CET | LHC@home | Started upload of LHC_2015_LHC_2015_234_BOINC_errors__23__s__62.31_60.32__5.7_5.8__5__42_1_sixvf_boinc80091_0_r419032409_0 Mon 22 Jan 2018 11:32:47 AM CET | | Project communication failed: attempting access to reference site Mon 22 Jan 2018 11:32:47 AM CET | LHC@home | Temporarily failed upload of LHC_2015_LHC_2015_234_BOINC_errors__23__s__62.31_60.32__2.2_2.3__5__55.5_1_sixvf_boinc78035_1_r2013299009_0: transient HTTP error Mon 22 Jan 2018 11:32:47 AM CET | LHC@home | Backing off 05:40:47 on upload of LHC_2015_LHC_2015_234_BOINC_errors__23__s__62.31_60.32__2.2_2.3__5__55.5_1_sixvf_boinc78035_1_r2013299009_0 Mon 22 Jan 2018 11:32:47 AM CET | LHC@home | Temporarily failed upload of LHC_2015_LHC_2015_234_BOINC_errors__23__s__62.31_60.32__5.7_5.8__5__42_1_sixvf_boinc80091_0_r419032409_0: transient HTTP error Mon 22 Jan 2018 11:32:47 AM CET | LHC@home | Backing off 03:51:20 on upload of LHC_2015_LHC_2015_234_BOINC_errors__23__s__62.31_60.32__5.7_5.8__5__42_1_sixvf_boinc80091_0_r419032409_0 Mon 22 Jan 2018 11:32:49 AM CET | | Internet access OK - project servers may be temporarily down. //Gunnar |
Send message Joined: 2 May 07 Posts: 2230 Credit: 173,850,078 RAC: 17,550 |
Nils told us at 18/1/12, that the upgrate will be done in one or two months: https://lhcathome.cern.ch/lhcathome/forum_thread.php?id=4572&postid=33809#33809 The team do the best to help us volunteers. |
Send message Joined: 15 Jun 08 Posts: 2520 Credit: 252,427,586 RAC: 135,630 |
@Nils and the team Are you sure it's not just a lack of available network sockets your servers are suffering from? |
Send message Joined: 18 Dec 15 Posts: 1788 Credit: 117,626,501 RAC: 80,924 |
Nils told us at 18/1/12, that the upgrate will be done in one or two months:Well, while Nils was saying "Regarding I/O, our NFS server is now in better shape, so transfer problems should be mostly ironed out" this, unfortunately, has turned out not be the case :-( The transfer problems are mostly the same as before. |
Send message Joined: 29 Aug 05 Posts: 1056 Credit: 7,686,131 RAC: 6,852 |
It seems that the servers are down for the moment. There's a rolling campaign of hypervisor, etc., upgrades/reboots at CERN at the moment (Meltdown/Spectre/what-have-you). The CMS@home WMAgent was affected briefly this morning, so maybe this was as well. |
Send message Joined: 15 Jul 05 Posts: 247 Credit: 5,974,599 RAC: 0 |
As part of our cleanup campaign, BOINC antique_file_deleter made our NFS server hit the limit of maximum number of open files. Now the NFS server should accept connections again. We are trying to debug this intermittent file upload issue. During our debugging, we will stop upload for short periods. Files will eventually upload, please remain patient and sorry for this. We will also have more server reboots over the next days as Ivan mentions. |
Send message Joined: 15 Jul 05 Posts: 247 Credit: 5,974,599 RAC: 0 |
The underlying cause of the NFS server saturation is that files are left open when the BOINC file upload handler script times out. When a number of BOINC clients retry failed uploads frequently, the effect on our file servers is similar to a denial of service attack. It seems that our move to a load-balanced cluster some time back to increase capacity simply moved the bottleneck to the NFS storage layer. We will need to change our system architecture to get a permanent fix. |
Send message Joined: 2 Sep 04 Posts: 455 Credit: 200,411,535 RAC: 51,563 |
|
©2024 CERN