Message boards : ATLAS application : Uploads of finished tasks not possible since last night
Message board moderation

To post messages, you must log in.

Previous · 1 · 2 · 3 · 4 · 5 · 6 · Next

AuthorMessage
marmot
Avatar

Send message
Joined: 5 Nov 15
Posts: 138
Credit: 6,222,460
RAC: 0
Message 33410 - Posted: 16 Dec 2017, 14:32:05 UTC - in response to Message 33409.  

Aborting the upload and then clicking update for the project would clear the upload slot and lead to a a successful upload on other projects where a server glitch or maintenance cycle happened, but here the WU gets a failed upload error and is wasted.

Can't this be prevented as the data set is completed and safely on the client drives?
Why should a successfully completed data set be lost (4 entire CPU days per each WU on my machines) because the upload slot was cleared and retried at a later time?
ID: 33410 · Report as offensive     Reply Quote
Erich56

Send message
Joined: 18 Dec 15
Posts: 1343
Credit: 25,145,785
RAC: 23,373
Message 33414 - Posted: 16 Dec 2017, 15:58:14 UTC - in response to Message 33404.  

David Cameron wrote this morning at 8:40hrs UTC:
The upload server got completely full so I'm cleaning it now - there is an automatic cleaning tool but it wasn't cleaning enough to handle the huge volumes of data we got this week.
did anyone have an upload of a finished ATLAS task during the course of this day?

I sat at the PC for lengthy time periods and pushed the "retry now" button several hundert times, all day long - with no effect. Waste of time. Not a single one of the numerous finished tasks got uploaded.
I suspect that it all ended up in a mess, and that finally we can delete all our finished tasks, as they will have reached their deadlines before upload.
ID: 33414 · Report as offensive     Reply Quote
tullio

Send message
Joined: 19 Feb 08
Posts: 644
Credit: 3,920,608
RAC: 1,022
Message 33415 - Posted: 16 Dec 2017, 16:47:02 UTC

I have one ATLAS task trying to upload on a Linux box and 2 on a Windows PC. The Linux task has a 20 December deadline, the Windows 22 December. Since ATLAS and SixTrack tasks are the only ones not failing on my PCs I am not going to delete them.
Tullio
ID: 33415 · Report as offensive     Reply Quote
Jesse Viviano

Send message
Joined: 12 Feb 14
Posts: 71
Credit: 1,789,891
RAC: 0
Message 33416 - Posted: 16 Dec 2017, 18:45:07 UTC

I have uploaded two of three results that were stuck, but one of them is still stuck with a file locked by file_upload_handler PID=-1 error:
12/16/2017 1:43:11 PM | LHC@home | Started upload of tKSLDmWh3irnDDn7oo6G73TpABFKDmABFKDm9BLKDmABFKDmF5OqMn_1_r815870200_ATLAS_result
12/16/2017 1:43:16 PM | LHC@home | [error] Error reported by file upload server: [tKSLDmWh3irnDDn7oo6G73TpABFKDmABFKDm9BLKDmABFKDmF5OqMn_1_r815870200_ATLAS_result] locked by file_upload_handler PID=-1
12/16/2017 1:43:16 PM | LHC@home | Temporarily failed upload of tKSLDmWh3irnDDn7oo6G73TpABFKDmABFKDm9BLKDmABFKDmF5OqMn_1_r815870200_ATLAS_result: transient upload error
12/16/2017 1:43:16 PM | LHC@home | Backing off 03:39:17 on upload of tKSLDmWh3irnDDn7oo6G73TpABFKDmABFKDm9BLKDmABFKDmF5OqMn_1_r815870200_ATLAS_result
ID: 33416 · Report as offensive     Reply Quote
Carlos

Send message
Joined: 10 Nov 17
Posts: 6
Credit: 213,871
RAC: 0
Message 33417 - Posted: 16 Dec 2017, 19:04:37 UTC - in response to Message 33289.  

Since last night, uploads of finished tasks get stuck in "backoff" Status.
Any problem with the ATLAS upload server?


Same problem since few days. The upload speed and the progres bar stays on 0.000%(KB) only the upload time is running but stops after few minutes and a message apears 'restarting in ....'. (Mostly over 4h !)
One Simulation do this since 2 days and in the 'Active tasks' window is 'upload in progres' even in case of 0 progres!
ID: 33417 · Report as offensive     Reply Quote
Carlos

Send message
Joined: 10 Nov 17
Posts: 6
Credit: 213,871
RAC: 0
Message 33418 - Posted: 16 Dec 2017, 19:08:55 UTC - in response to Message 33414.  

Finished tasks yes but upload fails allways.
ID: 33418 · Report as offensive     Reply Quote
Erich56

Send message
Joined: 18 Dec 15
Posts: 1343
Credit: 25,145,785
RAC: 23,373
Message 33419 - Posted: 16 Dec 2017, 19:12:59 UTC - in response to Message 33417.  

Short time ago, several of my finished tasks were finally uploaded, which I am very pleased about, of course!

However, 2 "old" ones (i.e. such ones which were started before the server crashed) did NOT upload. The deadlines for those are tomorrow and day after tomorrow.

Any chance that they will be uploaded in time?
ID: 33419 · Report as offensive     Reply Quote
David Cameron
Project administrator
Project developer
Project scientist

Send message
Joined: 13 May 14
Posts: 330
Credit: 11,384,447
RAC: 3,673
Message 33420 - Posted: 16 Dec 2017, 19:36:07 UTC - in response to Message 33419.  

Here is my interpretation of what is happening:

- the upload server is overloaded, so many uploads fail leaving a half-complete file
- the retries fail because the half-complete file is still there (the "locked by file_upload_handler PID=-1" error)
- our cleaning of incomplete files runs only once per day so there is no possibliity of retries succeeding until one day has passed
- the server getting full last night was yet another problem but this is now fixed

I have changed the cleaning to run once every 6 hours and delete files older than 6 hours to make it more aggressive. But if you have a failed upload you'll still have to wait some time before it will work, so clicking retry every few minutes won't help.

In addition I've limited submission to keep only 10,000 WU in the server, so there won't be any new WU until the upload backlog clears and the number of running goes under 10,000.

If it looks like many WU will go over the deadline we can look at how to extend the deadline, although I don't know an easy way to do that right now.
ID: 33420 · Report as offensive     Reply Quote
Erich56

Send message
Joined: 18 Dec 15
Posts: 1343
Credit: 25,145,785
RAC: 23,373
Message 33421 - Posted: 16 Dec 2017, 20:23:06 UTC - in response to Message 33420.  

Thanks, David, for the thorough explanations, and also for your efforts to straighten things out.

So we'll see what happens during the next few days :-)
ID: 33421 · Report as offensive     Reply Quote
Jesse Viviano

Send message
Joined: 12 Feb 14
Posts: 71
Credit: 1,789,891
RAC: 0
Message 33422 - Posted: 16 Dec 2017, 23:56:13 UTC - in response to Message 33420.  

I think that I have seen other BOINC projects have upload handlers that either automatically start over on failed file uploads, or direct the BOINC client to start the upload at the point where the interruption occurred. Could this project be programmed to do either of these?
ID: 33422 · Report as offensive     Reply Quote
csbyseti

Send message
Joined: 6 Jul 17
Posts: 22
Credit: 26,987,451
RAC: 48,150
Message 33423 - Posted: 17 Dec 2017, 8:36:21 UTC

Thanks David for your work at weekend.

Some uploads finished at night some still suck in upload 100% but not finished.
In my opinion the bottleneck is the part between upload reached 100% and sending the 'upload OK and Task closed' to the Boinc client.
Don't know how server side Boinc works (perhaps copying the temp-upload file) but it look like the Boinc client will run in a timeout and set upload to faulty which result in a complete new upload.
If this happens at many clients you'll get a huge amount of upload load.

So if there is a timeout value in Boinc client, doubleling this value would help projects with big upload file size.
ID: 33423 · Report as offensive     Reply Quote
Jesse Viviano

Send message
Joined: 12 Feb 14
Posts: 71
Credit: 1,789,891
RAC: 0
Message 33424 - Posted: 17 Dec 2017, 9:03:29 UTC

I have finally been able to upload and report my ATLAS@home task.
ID: 33424 · Report as offensive     Reply Quote
AuxRx

Send message
Joined: 16 Sep 17
Posts: 100
Credit: 1,579,669
RAC: 900
Message 33425 - Posted: 17 Dec 2017, 11:04:09 UTC - in response to Message 33423.  

I could be mistaken, but I think some WUs simply jumped to 100% when the upload failed at x% amount. I.e. the last percentage I saw was 63%, suddenly 100% and next failed to retry. The task could not have been completed at the current speed (50kbps) and in that short amount of time.

What I am trying to say is, the numbers you see could be misleading you to think the database is the issue even though it is not.
ID: 33425 · Report as offensive     Reply Quote
PHILIPPE

Send message
Joined: 24 Jul 16
Posts: 88
Credit: 239,917
RAC: 0
Message 33426 - Posted: 17 Dec 2017, 16:16:27 UTC - in response to Message 33425.  

@ csbyseti
Don't know how server side Boinc works

It 's true this part of the boinc project is not clearly described (at least the general guidelines should appear somewhere to explain to volunteers what means the terms encountered in the status server project page)

so i found a short summary on the net :



Taskserver (or scheduling server) in details:

The scheduler handles requests from BOINC clients
The feeder caches jobs which are not yet transmitted

The transitioner examines jobs for which a state change has occurred and handles this change
The database purger removes jobs and instance database entries that are no longer needed
The validator compares the instances of a work unit
The assimilator handles tasks which are done
The file deleter deletes input and output files that are no longer needed

The work generator creates new jobs and their input files

Unfortunately ,the only component missing in this picture is the the file upload handler which is not linked to database storage but i found a picture on the net (slides n° 5) where its functioning is explained (slides n° 22 - 24).

Server directory structure

The directory structure for a typical BOINC project looks like:

PROJECT/
    apps/
    bin/
    cgi-bin/
    log_HOSTNAME/
    pid_HOSTNAME/
    download/
    html/
    inc/
    ops/
    project/
    stats/
    user/
    user_profile/
    keys/
    upload/



where PROJECT is the name of the project and HOSTNAME is the server host. Each project directory contains:

    apps: application and core client executables
    bin: server daemons and programs.
    cgi-bin: CGI programs
    log_HOSTNAME: log output
    pid_HOSTNAME: lock files, pid files
    download: storage for data server downloads.
    html: PHP files for public and private web interfaces
    keys: encryption keys
    upload: storage for data server uploads.



The upload and download directories may contain large numbers (millions) of files. For efficiency they are normally organized as a hierarchy of subdirectories.

Further informations are available on wikipedia.

ID: 33426 · Report as offensive     Reply Quote
Jim1348

Send message
Joined: 15 Nov 14
Posts: 491
Credit: 14,279,943
RAC: 10,022
Message 33427 - Posted: 17 Dec 2017, 19:41:11 UTC - in response to Message 33408.  

Good News: My first three native ATLAS on my two Ubuntu machines ran properly and completed without error.
They averaged about 3 1/2 hours on two cores per work unit (i7-4770, i7-4790).
Bad News: They are stuck in upload also.

All three of my native ATLAS have now uploaded automatically in the last day. But I hope someone will give us the "all clear" when things are back to normal. I don't want to get any more until then, though that may not be until January.
ID: 33427 · Report as offensive     Reply Quote
maeax

Send message
Joined: 2 May 07
Posts: 1095
Credit: 37,075,061
RAC: 18,605
Message 33455 - Posted: 21 Dec 2017, 9:31:53 UTC

Atlas doesn't upload finished tasks.
There is a problem with upload-Server at the moment, see sixtrack-threads.
ID: 33455 · Report as offensive     Reply Quote
Harri Liljeroos
Avatar

Send message
Joined: 28 Sep 04
Posts: 500
Credit: 26,508,876
RAC: 13,430
Message 33458 - Posted: 21 Dec 2017, 19:05:09 UTC - in response to Message 33455.  

Atlas doesn't upload finished tasks.
There is a problem with upload-Server at the moment, see sixtrack-threads.

So what is the 110 MB file Atlas is trying to upload? The file I have has been stuck for 16 hours.
ID: 33458 · Report as offensive     Reply Quote
maeax

Send message
Joined: 2 May 07
Posts: 1095
Credit: 37,075,061
RAC: 18,605
Message 33459 - Posted: 21 Dec 2017, 19:08:19 UTC

One of my 5 ATLAS was uploaded a few moments ago.
We have to wait...... and we are hopeful.
ID: 33459 · Report as offensive     Reply Quote
Harri Liljeroos
Avatar

Send message
Joined: 28 Sep 04
Posts: 500
Credit: 26,508,876
RAC: 13,430
Message 33464 - Posted: 21 Dec 2017, 21:06:02 UTC

The one that is stuck has retried now 10 times. Last retry went to 100% but still failed with transient HTTP error. I have uploaded successfully today 3 tasks on two different hosts.
ID: 33464 · Report as offensive     Reply Quote
newman

Send message
Joined: 16 May 08
Posts: 4
Credit: 1,031,316
RAC: 774
Message 33479 - Posted: 23 Dec 2017, 14:11:11 UTC - in response to Message 33464.  

I have now also 2 WUs not uploading.

Sa 23 Dez 2017 15:07:34 CET | LHC@home | [error] Error reported by file upload server: [wpaMDmkmnjrnSu7Ccp2YYBZmABFKDmABFKDmO0IKDmABFKDmioHCBn_1_r210029558_ATLAS_result] locked by file_upload_handler PID=-1
Sa 23 Dez 2017 15:07:34 CET | LHC@home | Temporarily failed upload of wpaMDmkmnjrnSu7Ccp2YYBZmABFKDmABFKDmO0IKDmABFKDmioHCBn_1_r210029558_ATLAS_result: transient upload error
Sa 23 Dez 2017 15:07:34 CET | LHC@home | Backing off 01:19:18 on upload of wpaMDmkmnjrnSu7Ccp2YYBZmABFKDmABFKDmO0IKDmABFKDmioHCBn_1_r210029558_ATLAS_result
ID: 33479 · Report as offensive     Reply Quote
Previous · 1 · 2 · 3 · 4 · 5 · 6 · Next

Message boards : ATLAS application : Uploads of finished tasks not possible since last night


©2021 CERN