Message boards : News : File upload issues
Message board moderation

To post messages, you must log in.

Previous · 1 · 2 · 3 · 4 · Next

AuthorMessage
computezrmle
Volunteer moderator
Volunteer developer
Volunteer tester
Help desk expert
Avatar

Send message
Joined: 15 Jun 08
Posts: 2488
Credit: 247,533,973
RAC: 121,894
Message 33899 - Posted: 17 Jan 2018, 20:06:23 UTC - in response to Message 33895.  

...Those "partly uploaded file", are they on my machine or on the server?

It's the result file on your computer and (only partly) on the server.
Your computer will automatically retry the upload until it's finished completely.

...Do I need to take any actions, or is the problem going to solve itself when the servers are less busy?

No action required on your side, as Erich56 already stated.

...I currently have half a dozen or so tasks that are stuck in uploading state, and they represent
together several days of hard computing so I'd hate to have to abort them! :-(

If you abort it, the work will be lost. So ...

...As I can see on the server stat page, there are several thousands of items in the tasks and WU's
"waiting for deletion" queues, and a whopping 768973 tasks to send!! :-O

Nothing to worry about. It's not more than an info for the server admins.

...Will this issue be solved by itself once they are crunched and validated?
(hopefully before the deadlines expires)

Violating the deadline is the only critical point.
ID: 33899 · Report as offensive     Reply Quote
Gunnar Hjern

Send message
Joined: 14 Jul 17
Posts: 7
Credit: 260,936
RAC: 0
Message 33902 - Posted: 17 Jan 2018, 21:16:49 UTC - in response to Message 33899.  

Thanks for your explanations!
Have a nice day!!
/Gunnar
ID: 33902 · Report as offensive     Reply Quote
AuxRx

Send message
Joined: 16 Sep 17
Posts: 100
Credit: 1,618,469
RAC: 0
Message 33903 - Posted: 17 Jan 2018, 22:57:02 UTC - in response to Message 33899.  

Violating the deadline is the only critical point.


I'd like to know where this information is coming from. If literally nothing changes, how is this mark critical?
ID: 33903 · Report as offensive     Reply Quote
computezrmle
Volunteer moderator
Volunteer developer
Volunteer tester
Help desk expert
Avatar

Send message
Joined: 15 Jun 08
Posts: 2488
Credit: 247,533,973
RAC: 121,894
Message 33906 - Posted: 18 Jan 2018, 10:21:49 UTC - in response to Message 33903.  

Violating the deadline is the only critical point.


I'd like to know where this information is coming from. If literally nothing changes, how is this mark critical?

My comment only makes sense related to Gunnar's posts.

A more precise explanation can be found in the BOINC documentation, e.g.:
https://boinc.berkeley.edu/trac/wiki/JobReplication
https://boinc.berkeley.edu/trac/wiki/ProjectOptions

Be aware that the JobReplication page explains it using "min_quorum = 2" and "target_nresults = 3" while LHC projects use different values.

Results that are cancelled or reported after "client_deadline + grace_period" will never be rewarded.
This can be seen in the project database as long as the records are available.
normal WU -> https://lhcathome.cern.ch/lhcathome/workunit.php?wuid=83734849
WU with aborted task -> https://lhcathome.cern.ch/lhcathome/workunit.php?wuid=83608902
WU with deadline violation -> https://lhcathome.cern.ch/lhcathome/workunit.php?wuid=82393750
ID: 33906 · Report as offensive     Reply Quote
AuxRx

Send message
Joined: 16 Sep 17
Posts: 100
Credit: 1,618,469
RAC: 0
Message 33910 - Posted: 18 Jan 2018, 13:57:58 UTC - in response to Message 33906.  

Results that are cancelled or reported after "client_deadline + grace_period" will never be rewarded.


This precisely is in question. Where do you get this?

<report_grace_period>x</report_grace_period>
<grace_period_hours>x</grace_period_hours>
A "grace period" (in seconds or hours respectively) for task reporting. A task is considered time-out (and a new replica generated) if it is not reported by client_deadline + x.

... does not suggest the initially missing result is precluded from validation if the task is replicated for a third wingman.

I guess it depends what the grace period is, but I'm pretty sure I have seen WU finished by the timed-out result - not the recreated WU. Which would suggest LHC is setup like most projects, which is to accept and validate the first results to be returned.

If what you say is true, why wouldn't LHC abort/cancel the WU? I can only think of advantages to cancelling a WU that the project will not consider for validation. For one, the volunteer would stop clogging the servers, could get new work, would free disk space, and so on.
ID: 33910 · Report as offensive     Reply Quote
maeax

Send message
Joined: 2 May 07
Posts: 2184
Credit: 172,751,088
RAC: 42,089
Message 33911 - Posted: 18 Jan 2018, 16:40:29 UTC

My Computer was for ten days down. So, the following sixtrack-task was over the deadline at 18/1/5.
In the stats of LHC was this task marked as not finished with the date 18/1/5.
For me it was the best to delete the running task, after the Computer was back.
But,
after i deleted this running task in Boinc, you can see, the deadline was refreshed to the time, when the task was deleted from me.
https://lhcathome.cern.ch/lhcathome/workunit.php?wuid=82397136
It seams, that LHC had accept the later time for finishing?
ID: 33911 · Report as offensive     Reply Quote
Profile Yeti
Volunteer moderator
Avatar

Send message
Joined: 2 Sep 04
Posts: 455
Credit: 198,097,514
RAC: 88,596
Message 33912 - Posted: 18 Jan 2018, 16:44:51 UTC - in response to Message 33911.  

My Computer was for ten days down. So, the following sixtrack-task was over the deadline at 18/1/5.
In the stats of LHC was this task marked as not finished with the date 18/1/5.
For me it was the best to delete the running task, after the Computer was back.
But,
after i deleted this running task in Boinc, you can see, the deadline was refreshed to the time, when the task was deleted from me.
https://lhcathome.cern.ch/lhcathome/workunit.php?wuid=82397136
It seams, that LHC had accept the later time for finishing?

Nope

The Field first contains the Deadline, as long as the task has not come back and then changes to the date when the client responses what has happened with the WU


Supporting BOINC, a great concept !
ID: 33912 · Report as offensive     Reply Quote
Gunnar Hjern

Send message
Joined: 14 Jul 17
Posts: 7
Credit: 260,936
RAC: 0
Message 33975 - Posted: 20 Jan 2018, 16:00:26 UTC - in response to Message 33874.  

Hi!

A lot of upload-stuck tasks are soon hitting deadline and many hours of computer work will be waisted! :-(

Would it be possible for some sys-admin to manually erase those faulty file fragments on the server?
For example with some command like:
> find /correct/path/ -size +220c -size -250c -mtime -20 -exec rm -f {} \;

(Afaik they are about 220 to 250 bytes, and they should be younger than 20 days.
If some common substring "sub" of the file names are known, you can of course add -name "*sub*" to the params for find.)

Not only would it save the work done by us clients, but I think it would lessen the workload of the servers too, as far less client computers will then frequently retry to upload the stuck files.

Have a nice day!!!

Kindest regards,
Gunnar Hjern
ID: 33975 · Report as offensive     Reply Quote
grumpy

Send message
Joined: 1 Sep 04
Posts: 57
Credit: 2,835,005
RAC: 0
Message 33993 - Posted: 21 Jan 2018, 4:57:26 UTC

2018-01-20 10:51:51 PM | LHC@home | Started upload of h4SMDmPA9vrnDDn7oo6G73TpABFKDmABFKDmsvFKDmABFKDm10zPwn_0_r849762899_ATLAS_result
2018-01-20 10:55:54 PM | LHC@home | [error] Error reported by file upload server: can't write file h4SMDmPA9vrnDDn7oo6G73TpABFKDmABFKDmsvFKDmABFKDm10zPwn_0_r849762899_ATLAS_result: Disk quota exceeded
2018-01-20 10:55:54 PM | LHC@home | Temporarily failed upload of h4SMDmPA9vrnDDn7oo6G73TpABFKDmABFKDmsvFKDmABFKDm10zPwn_0_r849762899_ATLAS_result: transient upload error
2018-01-20 10:55:54 PM | LHC@home | Backing off 00:54:55 on upload of h4SMDmPA9vrnDDn7oo6G73TpABFKDmABFKDmsvFKDmABFKDm10zPwn_0_r849762899_ATLAS_result
2018-01-20 11:47:49 PM | LHC@home | Started upload of h4SMDmPA9vrnDDn7oo6G73TpABFKDmABFKDmsvFKDmABFKDm10zPwn_0_r849762899_ATLAS_result
2018-01-20 11:47:51 PM | LHC@home | [error] Error reported by file upload server: Server is out of disk space
2018-01-20 11:47:51 PM | LHC@home | Temporarily failed upload of h4SMDmPA9vrnDDn7oo6G73TpABFKDmABFKDmsvFKDmABFKDm10zPwn_0_r849762899_ATLAS_result: transient upload error
2018-01-20 11:47:51 PM | LHC@home | Backing off 01:31:27 on upload of h4SMDmPA9vrnDDn7oo6G73TpABFKDmABFKDmsvFKDmABFKDm10zPwn_0_r849762899_ATLAS_result


I don't think it really takes weeks to solve this kind of problems.
wake up.!!
ID: 33993 · Report as offensive     Reply Quote
Harri Liljeroos
Avatar

Send message
Joined: 28 Sep 04
Posts: 707
Credit: 47,240,852
RAC: 29,709
Message 34004 - Posted: 21 Jan 2018, 10:32:01 UTC

I am also getting 'Server disk full' error on sixtrack and Atlas tasks.
ID: 34004 · Report as offensive     Reply Quote
Profile Nils Høimyr
Volunteer moderator
Project administrator
Project developer
Project tester

Send message
Joined: 15 Jul 05
Posts: 246
Credit: 5,974,599
RAC: 0
Message 34022 - Posted: 22 Jan 2018, 9:34:47 UTC

Our disk server again has problems cleaning up files behind the scenes. My own tasks uploaded correctly during the night, once any remaining half-uploaded files have been deleted, your tasks should finally upload too.

We are sorry about these enduring problems, please just be patient with transfers until we migrate to a new storage back-end.
ID: 34022 · Report as offensive     Reply Quote
Erich56

Send message
Joined: 18 Dec 15
Posts: 1737
Credit: 114,773,591
RAC: 91,362
Message 34024 - Posted: 22 Jan 2018, 10:15:42 UTC - in response to Message 34022.  

...please just be patient with transfers until we migrate to a new storage back-end.
when will this take place?

Wouldn't it be wise to reduce the number of distributed tasks until then, in order to alleviate the burden on the servers?
ID: 34024 · Report as offensive     Reply Quote
Gunnar Hjern

Send message
Joined: 14 Jul 17
Posts: 7
Credit: 260,936
RAC: 0
Message 34025 - Posted: 22 Jan 2018, 10:49:31 UTC - in response to Message 34024.  

It seems that the servers are down for the moment.
I got the following in my log:

Mon 22 Jan 2018 11:09:33 AM CET | LHC@home | Started upload of w-c9_job.B1inj_c9.2158__26__s__64.28_59.31__16.1_17.1__6__60_1_sixvf_boinc27416_0_r905081043_0
Mon 22 Jan 2018 11:09:33 AM CET | LHC@home | Started upload of w-c1_job.B1inj_c1.2158__42__s__64.28_59.31__13.1_14.1__6__82.5_1_sixvf_boinc44247_0_r387987546_0
Mon 22 Jan 2018 11:11:34 AM CET | | Project communication failed: attempting access to reference site
Mon 22 Jan 2018 11:11:34 AM CET | LHC@home | Temporarily failed upload of w-c9_job.B1inj_c9.2158__26__s__64.28_59.31__16.1_17.1__6__60_1_sixvf_boinc27416_0_r905081043_0: transient HTTP error
Mon 22 Jan 2018 11:11:34 AM CET | LHC@home | Backing off 03:27:00 on upload of w-c9_job.B1inj_c9.2158__26__s__64.28_59.31__16.1_17.1__6__60_1_sixvf_boinc27416_0_r905081043_0
Mon 22 Jan 2018 11:11:36 AM CET | | Internet access OK - project servers may be temporarily down.
Mon 22 Jan 2018 11:11:50 AM CET | | Project communication failed: attempting access to reference site
Mon 22 Jan 2018 11:11:50 AM CET | LHC@home | Temporarily failed upload of w-c1_job.B1inj_c1.2158__42__s__64.28_59.31__13.1_14.1__6__82.5_1_sixvf_boinc44247_0_r387987546_0: transient HTTP error
Mon 22 Jan 2018 11:11:50 AM CET | LHC@home | Backing off 03:47:25 on upload of w-c1_job.B1inj_c1.2158__42__s__64.28_59.31__13.1_14.1__6__82.5_1_sixvf_boinc44247_0_r387987546_0
Mon 22 Jan 2018 11:11:52 AM CET | | Internet access OK - project servers may be temporarily down.
Mon 22 Jan 2018 11:30:47 AM CET | LHC@home | Started upload of LHC_2015_LHC_2015_234_BOINC_errors__23__s__62.31_60.32__2.2_2.3__5__55.5_1_sixvf_boinc78035_1_r2013299009_0
Mon 22 Jan 2018 11:30:47 AM CET | LHC@home | Started upload of LHC_2015_LHC_2015_234_BOINC_errors__23__s__62.31_60.32__5.7_5.8__5__42_1_sixvf_boinc80091_0_r419032409_0
Mon 22 Jan 2018 11:32:47 AM CET | | Project communication failed: attempting access to reference site
Mon 22 Jan 2018 11:32:47 AM CET | LHC@home | Temporarily failed upload of LHC_2015_LHC_2015_234_BOINC_errors__23__s__62.31_60.32__2.2_2.3__5__55.5_1_sixvf_boinc78035_1_r2013299009_0: transient HTTP error
Mon 22 Jan 2018 11:32:47 AM CET | LHC@home | Backing off 05:40:47 on upload of LHC_2015_LHC_2015_234_BOINC_errors__23__s__62.31_60.32__2.2_2.3__5__55.5_1_sixvf_boinc78035_1_r2013299009_0
Mon 22 Jan 2018 11:32:47 AM CET | LHC@home | Temporarily failed upload of LHC_2015_LHC_2015_234_BOINC_errors__23__s__62.31_60.32__5.7_5.8__5__42_1_sixvf_boinc80091_0_r419032409_0: transient HTTP error
Mon 22 Jan 2018 11:32:47 AM CET | LHC@home | Backing off 03:51:20 on upload of LHC_2015_LHC_2015_234_BOINC_errors__23__s__62.31_60.32__5.7_5.8__5__42_1_sixvf_boinc80091_0_r419032409_0
Mon 22 Jan 2018 11:32:49 AM CET | | Internet access OK - project servers may be temporarily down.

//Gunnar
ID: 34025 · Report as offensive     Reply Quote
maeax

Send message
Joined: 2 May 07
Posts: 2184
Credit: 172,751,088
RAC: 42,089
Message 34028 - Posted: 22 Jan 2018, 12:12:24 UTC

Nils told us at 18/1/12, that the upgrate will be done in one or two months:
https://lhcathome.cern.ch/lhcathome/forum_thread.php?id=4572&postid=33809#33809

The team do the best to help us volunteers.
ID: 34028 · Report as offensive     Reply Quote
computezrmle
Volunteer moderator
Volunteer developer
Volunteer tester
Help desk expert
Avatar

Send message
Joined: 15 Jun 08
Posts: 2488
Credit: 247,533,973
RAC: 121,894
Message 34030 - Posted: 22 Jan 2018, 12:28:02 UTC

@Nils and the team

Are you sure it's not just a lack of available network sockets your servers are suffering from?
ID: 34030 · Report as offensive     Reply Quote
Erich56

Send message
Joined: 18 Dec 15
Posts: 1737
Credit: 114,773,591
RAC: 91,362
Message 34031 - Posted: 22 Jan 2018, 12:37:58 UTC - in response to Message 34028.  
Last modified: 22 Jan 2018, 12:38:23 UTC

Nils told us at 18/1/12, that the upgrate will be done in one or two months:
https://lhcathome.cern.ch/lhcathome/forum_thread.php?id=4572&postid=33809#33809

The team do the best to help us volunteers.
Well, while Nils was saying

"Regarding I/O, our NFS server is now in better shape, so transfer problems should be mostly ironed out"

this, unfortunately, has turned out not be the case :-(
The transfer problems are mostly the same as before.
ID: 34031 · Report as offensive     Reply Quote
ivan
Volunteer moderator
Project tester
Volunteer developer
Volunteer tester
Project scientist
Avatar

Send message
Joined: 29 Aug 05
Posts: 1045
Credit: 7,438,036
RAC: 7,948
Message 34034 - Posted: 22 Jan 2018, 14:39:24 UTC - in response to Message 34025.  

It seems that the servers are down for the moment.
I got the following in my log:

Mon 22 Jan 2018 11:09:33 AM CET | LHC@home | Started upload of w-c9_job.B1inj_c9.2158__26__s__64.28_59.31__16.1_17.1__6__60_1_sixvf_boinc27416_0_r905081043_0
...
Mon 22 Jan 2018 11:32:47 AM CET | LHC@home | Backing off 03:51:20 on upload of LHC_2015_LHC_2015_234_BOINC_errors__23__s__62.31_60.32__5.7_5.8__5__42_1_sixvf_boinc80091_0_r419032409_0
Mon 22 Jan 2018 11:32:49 AM CET | | Internet access OK - project servers may be temporarily down.

//Gunnar

There's a rolling campaign of hypervisor, etc., upgrades/reboots at CERN at the moment (Meltdown/Spectre/what-have-you). The CMS@home WMAgent was affected briefly this morning, so maybe this was as well.
ID: 34034 · Report as offensive     Reply Quote
Profile Nils Høimyr
Volunteer moderator
Project administrator
Project developer
Project tester

Send message
Joined: 15 Jul 05
Posts: 246
Credit: 5,974,599
RAC: 0
Message 34038 - Posted: 22 Jan 2018, 15:39:12 UTC
Last modified: 22 Jan 2018, 15:45:59 UTC

As part of our cleanup campaign, BOINC antique_file_deleter made our NFS server hit the limit of maximum number of open files. Now the NFS server should accept connections again.

We are trying to debug this intermittent file upload issue. During our debugging, we will stop upload for short periods.

Files will eventually upload, please remain patient and sorry for this.

We will also have more server reboots over the next days as Ivan mentions.
ID: 34038 · Report as offensive     Reply Quote
Profile Nils Høimyr
Volunteer moderator
Project administrator
Project developer
Project tester

Send message
Joined: 15 Jul 05
Posts: 246
Credit: 5,974,599
RAC: 0
Message 34045 - Posted: 23 Jan 2018, 9:25:12 UTC

The underlying cause of the NFS server saturation is that files are left open when the BOINC file upload handler script times out. When a number of BOINC clients retry failed uploads frequently, the effect on our file servers is similar to a denial of service attack. It seems that our move to a load-balanced cluster some time back to increase capacity simply moved the bottleneck to the NFS storage layer. We will need to change our system architecture to get a permanent fix.
ID: 34045 · Report as offensive     Reply Quote
Profile Yeti
Volunteer moderator
Avatar

Send message
Joined: 2 Sep 04
Posts: 455
Credit: 198,097,514
RAC: 88,596
Message 34090 - Posted: 26 Jan 2018, 8:29:16 UTC

Are you aware that the file-server again actual reports:

LHC@home 26-01-2018 09:25 [error] Error reported by file upload server: [Lm9LDm3i3yrnDDn7oo6G73TpABFKDmABFKDmpdKKDmABFKDm0Izqbn_0_r1695784418_ATLAS_result] locked by file_upload_handler PID=-1


Supporting BOINC, a great concept !
ID: 34090 · Report as offensive     Reply Quote
Previous · 1 · 2 · 3 · 4 · Next

Message boards : News : File upload issues


©2024 CERN