Message boards : News : File upload issues
Message board moderation

To post messages, you must log in.

Previous · 1 · 2 · 3 · 4 · Next

AuthorMessage
Erich56

Send message
Joined: 18 Dec 15
Posts: 629
Credit: 4,454,205
RAC: 6,562
Message 33897 - Posted: 17 Jan 2018, 19:16:00 UTC - in response to Message 33895.  

Do I need to take any actions, or is the problem going to solve itself when the servers are less busy?
There is nothing you can do than waiting and hoping that the tasks will be uploaded before the expiration date.
ID: 33897 · Report as offensive     Reply Quote
AuxRx

Send message
Joined: 16 Sep 17
Posts: 67
Credit: 540,945
RAC: 610
Message 33898 - Posted: 17 Jan 2018, 20:02:56 UTC - in response to Message 33897.  

This answer needs clarification. Most tasks have a chance of being returned and validated with credit even after the deadline has passed. The first (or first two, depending on the quorum) results to be *returned* will receive credit, regardless of deadline.

Therefore the answer should be that nothing can be done other than hoping they will upload before the minimum quorum has been reached.
ID: 33898 · Report as offensive     Reply Quote
computezrmle

Send message
Joined: 15 Jun 08
Posts: 436
Credit: 4,492,343
RAC: 10,881
Message 33899 - Posted: 17 Jan 2018, 20:06:23 UTC - in response to Message 33895.  

...Those "partly uploaded file", are they on my machine or on the server?

It's the result file on your computer and (only partly) on the server.
Your computer will automatically retry the upload until it's finished completely.

...Do I need to take any actions, or is the problem going to solve itself when the servers are less busy?

No action required on your side, as Erich56 already stated.

...I currently have half a dozen or so tasks that are stuck in uploading state, and they represent
together several days of hard computing so I'd hate to have to abort them! :-(

If you abort it, the work will be lost. So ...

...As I can see on the server stat page, there are several thousands of items in the tasks and WU's
"waiting for deletion" queues, and a whopping 768973 tasks to send!! :-O

Nothing to worry about. It's not more than an info for the server admins.

...Will this issue be solved by itself once they are crunched and validated?
(hopefully before the deadlines expires)

Violating the deadline is the only critical point.
ID: 33899 · Report as offensive     Reply Quote
Gunnar Hjern

Send message
Joined: 14 Jul 17
Posts: 7
Credit: 257,743
RAC: 1,226
Message 33902 - Posted: 17 Jan 2018, 21:16:49 UTC - in response to Message 33899.  

Thanks for your explanations!
Have a nice day!!
/Gunnar
ID: 33902 · Report as offensive     Reply Quote
AuxRx

Send message
Joined: 16 Sep 17
Posts: 67
Credit: 540,945
RAC: 610
Message 33903 - Posted: 17 Jan 2018, 22:57:02 UTC - in response to Message 33899.  

Violating the deadline is the only critical point.


I'd like to know where this information is coming from. If literally nothing changes, how is this mark critical?
ID: 33903 · Report as offensive     Reply Quote
computezrmle

Send message
Joined: 15 Jun 08
Posts: 436
Credit: 4,492,343
RAC: 10,881
Message 33906 - Posted: 18 Jan 2018, 10:21:49 UTC - in response to Message 33903.  

Violating the deadline is the only critical point.


I'd like to know where this information is coming from. If literally nothing changes, how is this mark critical?

My comment only makes sense related to Gunnar's posts.

A more precise explanation can be found in the BOINC documentation, e.g.:
https://boinc.berkeley.edu/trac/wiki/JobReplication
https://boinc.berkeley.edu/trac/wiki/ProjectOptions

Be aware that the JobReplication page explains it using "min_quorum = 2" and "target_nresults = 3" while LHC projects use different values.

Results that are cancelled or reported after "client_deadline + grace_period" will never be rewarded.
This can be seen in the project database as long as the records are available.
normal WU -> https://lhcathome.cern.ch/lhcathome/workunit.php?wuid=83734849
WU with aborted task -> https://lhcathome.cern.ch/lhcathome/workunit.php?wuid=83608902
WU with deadline violation -> https://lhcathome.cern.ch/lhcathome/workunit.php?wuid=82393750
ID: 33906 · Report as offensive     Reply Quote
AuxRx

Send message
Joined: 16 Sep 17
Posts: 67
Credit: 540,945
RAC: 610
Message 33910 - Posted: 18 Jan 2018, 13:57:58 UTC - in response to Message 33906.  

Results that are cancelled or reported after "client_deadline + grace_period" will never be rewarded.


This precisely is in question. Where do you get this?

<report_grace_period>x</report_grace_period>
<grace_period_hours>x</grace_period_hours>
A "grace period" (in seconds or hours respectively) for task reporting. A task is considered time-out (and a new replica generated) if it is not reported by client_deadline + x.

... does not suggest the initially missing result is precluded from validation if the task is replicated for a third wingman.

I guess it depends what the grace period is, but I'm pretty sure I have seen WU finished by the timed-out result - not the recreated WU. Which would suggest LHC is setup like most projects, which is to accept and validate the first results to be returned.

If what you say is true, why wouldn't LHC abort/cancel the WU? I can only think of advantages to cancelling a WU that the project will not consider for validation. For one, the volunteer would stop clogging the servers, could get new work, would free disk space, and so on.
ID: 33910 · Report as offensive     Reply Quote
maeax

Send message
Joined: 2 May 07
Posts: 301
Credit: 12,625,969
RAC: 3,162
Message 33911 - Posted: 18 Jan 2018, 16:40:29 UTC

My Computer was for ten days down. So, the following sixtrack-task was over the deadline at 18/1/5.
In the stats of LHC was this task marked as not finished with the date 18/1/5.
For me it was the best to delete the running task, after the Computer was back.
But,
after i deleted this running task in Boinc, you can see, the deadline was refreshed to the time, when the task was deleted from me.
https://lhcathome.cern.ch/lhcathome/workunit.php?wuid=82397136
It seams, that LHC had accept the later time for finishing?
ID: 33911 · Report as offensive     Reply Quote
Profile Yeti
Volunteer moderator
Avatar

Send message
Joined: 2 Sep 04
Posts: 315
Credit: 43,707,464
RAC: 34,923
Message 33912 - Posted: 18 Jan 2018, 16:44:51 UTC - in response to Message 33911.  

My Computer was for ten days down. So, the following sixtrack-task was over the deadline at 18/1/5.
In the stats of LHC was this task marked as not finished with the date 18/1/5.
For me it was the best to delete the running task, after the Computer was back.
But,
after i deleted this running task in Boinc, you can see, the deadline was refreshed to the time, when the task was deleted from me.
https://lhcathome.cern.ch/lhcathome/workunit.php?wuid=82397136
It seams, that LHC had accept the later time for finishing?

Nope

The Field first contains the Deadline, as long as the task has not come back and then changes to the date when the client responses what has happened with the WU


Supporting BOINC, a great concept !
ID: 33912 · Report as offensive     Reply Quote
entigy

Send message
Joined: 24 Oct 04
Posts: 5
Credit: 122,339
RAC: 464
Message 33925 - Posted: 19 Jan 2018, 7:55:39 UTC

Sigh.

19/01/2018 07:52:04 | LHC@home | Started upload of 3JHODmR7WwrnDDn7oo6G73TpABFKDmABFKDmMoKKDmABFKDmlYOlVm_0_r1334438241_ATLAS_result
19/01/2018 07:52:18 | LHC@home | [error] Error reported by file upload server: [3JHODmR7WwrnDDn7oo6G73TpABFKDmABFKDmMoKKDmABFKDmlYOlVm_0_r1334438241_ATLAS_result] locked by file_upload_handler PID=-1
19/01/2018 07:52:18 | LHC@home | Temporarily failed upload of 3JHODmR7WwrnDDn7oo6G73TpABFKDmABFKDmMoKKDmABFKDmlYOlVm_0_r1334438241_ATLAS_result: transient upload error
19/01/2018 07:52:18 | LHC@home | Backing off 04:36:31 on upload of 3JHODmR7WwrnDDn7oo6G73TpABFKDmABFKDmMoKKDmABFKDmlYOlVm_0_r1334438241_ATLAS_result
ID: 33925 · Report as offensive     Reply Quote
Gunnar Hjern

Send message
Joined: 14 Jul 17
Posts: 7
Credit: 257,743
RAC: 1,226
Message 33975 - Posted: 20 Jan 2018, 16:00:26 UTC - in response to Message 33874.  

Hi!

A lot of upload-stuck tasks are soon hitting deadline and many hours of computer work will be waisted! :-(

Would it be possible for some sys-admin to manually erase those faulty file fragments on the server?
For example with some command like:
> find /correct/path/ -size +220c -size -250c -mtime -20 -exec rm -f {} \;

(Afaik they are about 220 to 250 bytes, and they should be younger than 20 days.
If some common substring "sub" of the file names are known, you can of course add -name "*sub*" to the params for find.)

Not only would it save the work done by us clients, but I think it would lessen the workload of the servers too, as far less client computers will then frequently retry to upload the stuck files.

Have a nice day!!!

Kindest regards,
Gunnar Hjern
ID: 33975 · Report as offensive     Reply Quote
grumpy

Send message
Joined: 1 Sep 04
Posts: 52
Credit: 1,558,351
RAC: 1,122
Message 33993 - Posted: 21 Jan 2018, 4:57:26 UTC

2018-01-20 10:51:51 PM | LHC@home | Started upload of h4SMDmPA9vrnDDn7oo6G73TpABFKDmABFKDmsvFKDmABFKDm10zPwn_0_r849762899_ATLAS_result
2018-01-20 10:55:54 PM | LHC@home | [error] Error reported by file upload server: can't write file h4SMDmPA9vrnDDn7oo6G73TpABFKDmABFKDmsvFKDmABFKDm10zPwn_0_r849762899_ATLAS_result: Disk quota exceeded
2018-01-20 10:55:54 PM | LHC@home | Temporarily failed upload of h4SMDmPA9vrnDDn7oo6G73TpABFKDmABFKDmsvFKDmABFKDm10zPwn_0_r849762899_ATLAS_result: transient upload error
2018-01-20 10:55:54 PM | LHC@home | Backing off 00:54:55 on upload of h4SMDmPA9vrnDDn7oo6G73TpABFKDmABFKDmsvFKDmABFKDm10zPwn_0_r849762899_ATLAS_result
2018-01-20 11:47:49 PM | LHC@home | Started upload of h4SMDmPA9vrnDDn7oo6G73TpABFKDmABFKDmsvFKDmABFKDm10zPwn_0_r849762899_ATLAS_result
2018-01-20 11:47:51 PM | LHC@home | [error] Error reported by file upload server: Server is out of disk space
2018-01-20 11:47:51 PM | LHC@home | Temporarily failed upload of h4SMDmPA9vrnDDn7oo6G73TpABFKDmABFKDmsvFKDmABFKDm10zPwn_0_r849762899_ATLAS_result: transient upload error
2018-01-20 11:47:51 PM | LHC@home | Backing off 01:31:27 on upload of h4SMDmPA9vrnDDn7oo6G73TpABFKDmABFKDmsvFKDmABFKDm10zPwn_0_r849762899_ATLAS_result


I don't think it really takes weeks to solve this kind of problems.
wake up.!!
ID: 33993 · Report as offensive     Reply Quote
Harri Liljeroos
Avatar

Send message
Joined: 28 Sep 04
Posts: 251
Credit: 7,342,289
RAC: 10,601
Message 34004 - Posted: 21 Jan 2018, 10:32:01 UTC

I am also getting 'Server disk full' error on sixtrack and Atlas tasks.
ID: 34004 · Report as offensive     Reply Quote
Profile Nils Høimyr
Volunteer moderator
Project administrator
Project developer
Project tester

Send message
Joined: 15 Jul 05
Posts: 158
Credit: 1,613,374
RAC: 1,638
Message 34022 - Posted: 22 Jan 2018, 9:34:47 UTC

Our disk server again has problems cleaning up files behind the scenes. My own tasks uploaded correctly during the night, once any remaining half-uploaded files have been deleted, your tasks should finally upload too.

We are sorry about these enduring problems, please just be patient with transfers until we migrate to a new storage back-end.
ID: 34022 · Report as offensive     Reply Quote
Erich56

Send message
Joined: 18 Dec 15
Posts: 629
Credit: 4,454,205
RAC: 6,562
Message 34024 - Posted: 22 Jan 2018, 10:15:42 UTC - in response to Message 34022.  

...please just be patient with transfers until we migrate to a new storage back-end.
when will this take place?

Wouldn't it be wise to reduce the number of distributed tasks until then, in order to alleviate the burden on the servers?
ID: 34024 · Report as offensive     Reply Quote
Gunnar Hjern

Send message
Joined: 14 Jul 17
Posts: 7
Credit: 257,743
RAC: 1,226
Message 34025 - Posted: 22 Jan 2018, 10:49:31 UTC - in response to Message 34024.  

It seems that the servers are down for the moment.
I got the following in my log:

Mon 22 Jan 2018 11:09:33 AM CET | LHC@home | Started upload of w-c9_job.B1inj_c9.2158__26__s__64.28_59.31__16.1_17.1__6__60_1_sixvf_boinc27416_0_r905081043_0
Mon 22 Jan 2018 11:09:33 AM CET | LHC@home | Started upload of w-c1_job.B1inj_c1.2158__42__s__64.28_59.31__13.1_14.1__6__82.5_1_sixvf_boinc44247_0_r387987546_0
Mon 22 Jan 2018 11:11:34 AM CET | | Project communication failed: attempting access to reference site
Mon 22 Jan 2018 11:11:34 AM CET | LHC@home | Temporarily failed upload of w-c9_job.B1inj_c9.2158__26__s__64.28_59.31__16.1_17.1__6__60_1_sixvf_boinc27416_0_r905081043_0: transient HTTP error
Mon 22 Jan 2018 11:11:34 AM CET | LHC@home | Backing off 03:27:00 on upload of w-c9_job.B1inj_c9.2158__26__s__64.28_59.31__16.1_17.1__6__60_1_sixvf_boinc27416_0_r905081043_0
Mon 22 Jan 2018 11:11:36 AM CET | | Internet access OK - project servers may be temporarily down.
Mon 22 Jan 2018 11:11:50 AM CET | | Project communication failed: attempting access to reference site
Mon 22 Jan 2018 11:11:50 AM CET | LHC@home | Temporarily failed upload of w-c1_job.B1inj_c1.2158__42__s__64.28_59.31__13.1_14.1__6__82.5_1_sixvf_boinc44247_0_r387987546_0: transient HTTP error
Mon 22 Jan 2018 11:11:50 AM CET | LHC@home | Backing off 03:47:25 on upload of w-c1_job.B1inj_c1.2158__42__s__64.28_59.31__13.1_14.1__6__82.5_1_sixvf_boinc44247_0_r387987546_0
Mon 22 Jan 2018 11:11:52 AM CET | | Internet access OK - project servers may be temporarily down.
Mon 22 Jan 2018 11:30:47 AM CET | LHC@home | Started upload of LHC_2015_LHC_2015_234_BOINC_errors__23__s__62.31_60.32__2.2_2.3__5__55.5_1_sixvf_boinc78035_1_r2013299009_0
Mon 22 Jan 2018 11:30:47 AM CET | LHC@home | Started upload of LHC_2015_LHC_2015_234_BOINC_errors__23__s__62.31_60.32__5.7_5.8__5__42_1_sixvf_boinc80091_0_r419032409_0
Mon 22 Jan 2018 11:32:47 AM CET | | Project communication failed: attempting access to reference site
Mon 22 Jan 2018 11:32:47 AM CET | LHC@home | Temporarily failed upload of LHC_2015_LHC_2015_234_BOINC_errors__23__s__62.31_60.32__2.2_2.3__5__55.5_1_sixvf_boinc78035_1_r2013299009_0: transient HTTP error
Mon 22 Jan 2018 11:32:47 AM CET | LHC@home | Backing off 05:40:47 on upload of LHC_2015_LHC_2015_234_BOINC_errors__23__s__62.31_60.32__2.2_2.3__5__55.5_1_sixvf_boinc78035_1_r2013299009_0
Mon 22 Jan 2018 11:32:47 AM CET | LHC@home | Temporarily failed upload of LHC_2015_LHC_2015_234_BOINC_errors__23__s__62.31_60.32__5.7_5.8__5__42_1_sixvf_boinc80091_0_r419032409_0: transient HTTP error
Mon 22 Jan 2018 11:32:47 AM CET | LHC@home | Backing off 03:51:20 on upload of LHC_2015_LHC_2015_234_BOINC_errors__23__s__62.31_60.32__5.7_5.8__5__42_1_sixvf_boinc80091_0_r419032409_0
Mon 22 Jan 2018 11:32:49 AM CET | | Internet access OK - project servers may be temporarily down.

//Gunnar
ID: 34025 · Report as offensive     Reply Quote
maeax

Send message
Joined: 2 May 07
Posts: 301
Credit: 12,625,969
RAC: 3,162
Message 34028 - Posted: 22 Jan 2018, 12:12:24 UTC

Nils told us at 18/1/12, that the upgrate will be done in one or two months:
https://lhcathome.cern.ch/lhcathome/forum_thread.php?id=4572&postid=33809#33809

The team do the best to help us volunteers.
ID: 34028 · Report as offensive     Reply Quote
computezrmle

Send message
Joined: 15 Jun 08
Posts: 436
Credit: 4,492,343
RAC: 10,881
Message 34030 - Posted: 22 Jan 2018, 12:28:02 UTC

@Nils and the team

Are you sure it's not just a lack of available network sockets your servers are suffering from?
ID: 34030 · Report as offensive     Reply Quote
Erich56

Send message
Joined: 18 Dec 15
Posts: 629
Credit: 4,454,205
RAC: 6,562
Message 34031 - Posted: 22 Jan 2018, 12:37:58 UTC - in response to Message 34028.  
Last modified: 22 Jan 2018, 12:38:23 UTC

Nils told us at 18/1/12, that the upgrate will be done in one or two months:
https://lhcathome.cern.ch/lhcathome/forum_thread.php?id=4572&postid=33809#33809

The team do the best to help us volunteers.
Well, while Nils was saying

"Regarding I/O, our NFS server is now in better shape, so transfer problems should be mostly ironed out"

this, unfortunately, has turned out not be the case :-(
The transfer problems are mostly the same as before.
ID: 34031 · Report as offensive     Reply Quote
ivan
Volunteer moderator
Project tester
Volunteer developer
Volunteer tester
Project scientist
Avatar

Send message
Joined: 29 Aug 05
Posts: 427
Credit: 2,851,186
RAC: 5,603
Message 34034 - Posted: 22 Jan 2018, 14:39:24 UTC - in response to Message 34025.  

It seems that the servers are down for the moment.
I got the following in my log:

Mon 22 Jan 2018 11:09:33 AM CET | LHC@home | Started upload of w-c9_job.B1inj_c9.2158__26__s__64.28_59.31__16.1_17.1__6__60_1_sixvf_boinc27416_0_r905081043_0
...
Mon 22 Jan 2018 11:32:47 AM CET | LHC@home | Backing off 03:51:20 on upload of LHC_2015_LHC_2015_234_BOINC_errors__23__s__62.31_60.32__5.7_5.8__5__42_1_sixvf_boinc80091_0_r419032409_0
Mon 22 Jan 2018 11:32:49 AM CET | | Internet access OK - project servers may be temporarily down.

//Gunnar

There's a rolling campaign of hypervisor, etc., upgrades/reboots at CERN at the moment (Meltdown/Spectre/what-have-you). The CMS@home WMAgent was affected briefly this morning, so maybe this was as well.
ID: 34034 · Report as offensive     Reply Quote
Previous · 1 · 2 · 3 · 4 · Next

Message boards : News : File upload issues


©2018 CERN