Message boards : ATLAS application : Uploads of finished tasks not possible since last night
Message board moderation

To post messages, you must log in.

Previous · 1 · 2 · 3 · 4 · 5 · 6 · Next

AuthorMessage
AuxRx

Send message
Joined: 16 Sep 17
Posts: 100
Credit: 1,618,469
RAC: 0
Message 33361 - Posted: 14 Dec 2017, 18:44:13 UTC - in response to Message 33359.  

While I am griping around here, I would like add/point out, that I have plenty of own work to accomplish without having to check

BOINC is a hands off approach to contributing cpu time. Once the issue is fixed server side your client will catch up and resume normal operation.

Spewing hyperbole at other volunteers will not help. Check back in a week.

what the people at CERN should be aware of is that the longer it takes them to get the problem solved, the more finished tasks will reach their deadline, making them invalid.


Remember the individual amount of work our machines contribute is pretty insignificant, statistics in the greater context. It's sobering to loose some work now and then. I'm struggling myself, because I actually wanted to purge the project routinely but couldn't for some time.
ID: 33361 · Report as offensive     Reply Quote
Erich56

Send message
Joined: 18 Dec 15
Posts: 1451
Credit: 35,443,085
RAC: 42,710
Message 33362 - Posted: 14 Dec 2017, 18:54:13 UTC - in response to Message 33361.  

On the Project Status page I just notice that there are no new ATLAS tasks available.
I definitely was a wise idea to halt distribution of new tasks until the current upload-problem gets solved.
ID: 33362 · Report as offensive     Reply Quote
sjmielh

Send message
Joined: 27 Aug 16
Posts: 8
Credit: 615,935
RAC: 372
Message 33363 - Posted: 14 Dec 2017, 19:22:11 UTC

I also have an upload problem, but I think it is different from the one mentioned here. I have an atlas task that is stuck at uploading 100%. Nothing helps to get the upload to finished. It doesn't say anything about that it isn't able to connect to the server as others mention here. I tried different things for example press update, restarting my computer,.... But nothing makes it finish the upload. The event log just says: 'LHC@home | Not requesting tasks: don't need (CPU: not highest priority project; Intel GPU: )".

Sjmielh
ID: 33363 · Report as offensive     Reply Quote
Jim1348

Send message
Joined: 15 Nov 14
Posts: 568
Credit: 17,908,590
RAC: 20,231
Message 33364 - Posted: 14 Dec 2017, 20:05:41 UTC - in response to Message 33363.  

I also have an upload problem, but I think it is different from the one mentioned here.

No, it is the same problem. I don't get the "can't connect to server" message either. In fact, the server is apparently not the problem, but the data base. No one knows when it will be fixed.
ID: 33364 · Report as offensive     Reply Quote
AuxRx

Send message
Joined: 16 Sep 17
Posts: 100
Credit: 1,618,469
RAC: 0
Message 33365 - Posted: 14 Dec 2017, 20:06:11 UTC - in response to Message 33363.  

Sounds like the issue we all share. You can check the BOINC logs "Messages" in Advanced to verify.
ID: 33365 · Report as offensive     Reply Quote
David Cameron
Project administrator
Project developer
Project scientist

Send message
Joined: 13 May 14
Posts: 354
Credit: 12,055,123
RAC: 3,093
Message 33366 - Posted: 14 Dec 2017, 20:14:51 UTC - in response to Message 33362.  

On the Project Status page I just notice that there are no new ATLAS tasks available.
I definitely was a wise idea to halt distribution of new tasks until the current upload-problem gets solved.


This was not intentional. We have so many "running" WU that we hit an internal limit in our submission system which was there are a safety valve and we never thought we'd reach... I've increased the limit to allow more WU.

As Nils mentioned in the other thread, the server performance has been tweaked to handle the increased traffic better. This probably does not mean that everything will now work prefectly but it increases your chances of upload succeeding, so please have a little more patience if things still don't work.
ID: 33366 · Report as offensive     Reply Quote
Erich56

Send message
Joined: 18 Dec 15
Posts: 1451
Credit: 35,443,085
RAC: 42,710
Message 33367 - Posted: 14 Dec 2017, 21:00:38 UTC - in response to Message 33366.  

As Nils mentioned in the other thread, the server performance has been tweaked to handle the increased traffic better. This probably does not mean that everything will now work prefectly but it increases your chances of upload succeeding, so please have a little more patience if things still don't work.

David, I am coming back to what Crystal Pellet wrote a few hours ago:
It seems like new tasks are able to upload but those that finished on Tuesday night when we had the broken server are still stuck. The admins are looking into it.

It looks like the try of a result upload occupies a slot on the server, that's not freed when an upload fails.
Maybe a retry therefore fails over and over until that slot is freed manually
.
So what does the last sentence mean exactly? How or who will manually free the occupied upload slots? And even more important: WHEN?
As I said before, many of these tasks will reach their deadlne very soon. So, time is of the essence, and if there is no solution to this problem quickly, all these tasks will be lost.
ID: 33367 · Report as offensive     Reply Quote
Erich56

Send message
Joined: 18 Dec 15
Posts: 1451
Credit: 35,443,085
RAC: 42,710
Message 33369 - Posted: 15 Dec 2017, 6:19:50 UTC - in response to Message 33367.  

Unfortunately, the situation got worse again during last night:
whereas yesterday, at least the newer tasks (i.e. the ones that were started AFTER the server crash on Tuesday) were uploaded properly, this morning I noticed that more "new" tasks which got finished during last night did NOT upload.

So, at this point, the problem still exists for the "old" tasks (i.e. the ones that were started BEFORE the server crash) AND for the "new" tasks as well.
In other words: it all ended up in a real mess :-(((

Perhaps best would be not to make available any more tasks for download before this gross problem gets solved (the Server Status Page shows more than 22.000 tasks being processed - I wonder how they all can be loaded up in a timely manner). Otherwise, many thousands of computation hours of the volunteers will be for nothing. Even more, as the deadline for the "older" tasks is approaching rapidly, and most of them will become invalid unless a solution can be found still today.
ID: 33369 · Report as offensive     Reply Quote
maeax

Send message
Joined: 2 May 07
Posts: 1303
Credit: 39,712,943
RAC: 16,445
Message 33370 - Posted: 15 Dec 2017, 6:43:54 UTC - in response to Message 33347.  

This is a message from upload-Server:

14.12.2017 13:07:41 | LHC@home | [error] Error reported by file upload server: [qeyLDmVScirnDDn7oo6G73TpABFKDmABFKDmGPHKDmABFKDmmplz6m_0_r1896860968_ATLAS_result] locked by file_upload_handler PID=-1


In SL69 native App. new upload-files are with status download(?). The other with retry in..... hours.
ID: 33370 · Report as offensive     Reply Quote
gyllic

Send message
Joined: 9 Dec 14
Posts: 202
Credit: 2,533,875
RAC: 0
Message 33371 - Posted: 15 Dec 2017, 6:57:11 UTC

the error message from boinc looks similar as in this thread, so maybe the problem and the solution are also similar:

https://lhcathome.cern.ch/lhcathome/forum_thread.php?id=4162
ID: 33371 · Report as offensive     Reply Quote
Profile Nils Høimyr
Volunteer moderator
Project administrator
Project developer
Project tester

Send message
Joined: 15 Jul 05
Posts: 229
Credit: 5,359,497
RAC: 894
Message 33372 - Posted: 15 Dec 2017, 7:21:12 UTC

I have the same message for a couple of tasks that are awaiting upload. There were some tasks that finally uploaded during the night.

The file servers are now slightly less loaded, and we have upgraded the NFS volume to improve I/O. Still we will get these errors for a while until half-uploaded results have been cleaned. (There is a script that does this for ATLAS jobs as pointed out in the thread Gyllic refers to.)

We are looking at ways to accelerate this cleaning process, but need to be careful not to remove uploaded files that have not yet been validated and assimilated.

Thanks for your crunching and contributions, and please continue to be patient with this upload batch.
ID: 33372 · Report as offensive     Reply Quote
Erich56

Send message
Joined: 18 Dec 15
Posts: 1451
Credit: 35,443,085
RAC: 42,710
Message 33373 - Posted: 15 Dec 2017, 7:39:35 UTC - in response to Message 33372.  

Thanks for your crunching and contributions, and please continue to be patient with this upload batch.
Nils, two things:

I am coming back once more to what Crystal Pellet wrote a few hours ago:
It looks like the try of a result upload occupies a slot on the server, that's not freed when an upload fails.
Maybe a retry therefore fails over and over until that slot is freed manually.
Has this been investigated further?

The other thing: With BOINC retry intervals of 5 hours + it will be very difficult to get the waiting tasks uploaded in time before their deadline :-(
ID: 33373 · Report as offensive     Reply Quote
San-Fernando-Valley

Send message
Joined: 26 Mar 16
Posts: 30
Credit: 1,245,747
RAC: 0
Message 33375 - Posted: 15 Dec 2017, 8:02:37 UTC - in response to Message 33361.  

still having upload problems ...

can't check back in a week ...

deadlines will have passed ...

running WUs under BOINC "control" is no "most advanced physics" - it is just plain IT

ATLAS is messing up my rigs ...

As it says: we use your spare time when your PC is idling ...

I just want to contribute, not do personal research ...

Excuse me for griping - just having a couple of bad days
ID: 33375 · Report as offensive     Reply Quote
sjmielh

Send message
Joined: 27 Aug 16
Posts: 8
Credit: 615,935
RAC: 372
Message 33376 - Posted: 15 Dec 2017, 8:11:33 UTC - in response to Message 33364.  

I also have an upload problem, but I think it is different from the one mentioned here.

No, it is the same problem. I don't get the "can't connect to server" message either. In fact, the server is apparently not the problem, but the data base. No one knows when it will be fixed.


Thanks for the confirmation, Jim
ID: 33376 · Report as offensive     Reply Quote
Erich56

Send message
Joined: 18 Dec 15
Posts: 1451
Credit: 35,443,085
RAC: 42,710
Message 33377 - Posted: 15 Dec 2017, 8:19:24 UTC - in response to Message 33376.  

No one knows when it will be fixed.
that's what I am afraid of, too :-(
ID: 33377 · Report as offensive     Reply Quote
Profile Nils Høimyr
Volunteer moderator
Project administrator
Project developer
Project tester

Send message
Joined: 15 Jul 05
Posts: 229
Credit: 5,359,497
RAC: 894
Message 33378 - Posted: 15 Dec 2017, 8:40:07 UTC - in response to Message 33377.  

Sorry. I can assure you that we're still working on it. AFAIK the problem is not the DB, but partially uploaded files that block the file_upload process. Trying to find one of them manually takes ages with the current load. :-(
ID: 33378 · Report as offensive     Reply Quote
PHILIPPE

Send message
Joined: 24 Jul 16
Posts: 88
Credit: 239,917
RAC: 0
Message 33379 - Posted: 15 Dec 2017, 8:55:13 UTC - in response to Message 33378.  

Why do you delete the partial uploads manually ?
I remember that David Cameron did a script to fix this issue in the past.
Its script has just to be adapted to the new upload server...
ID: 33379 · Report as offensive     Reply Quote
computezrmle
Volunteer moderator
Volunteer developer
Volunteer tester
Avatar

Send message
Joined: 15 Jun 08
Posts: 1821
Credit: 123,382,205
RAC: 79,991
Message 33381 - Posted: 15 Dec 2017, 9:33:35 UTC - in response to Message 33378.  

... partially uploaded files that block the file_upload process. Trying to find one of them manually takes ages ...

Do you need some input from our side?
Task IDs or anything else?
ID: 33381 · Report as offensive     Reply Quote
Toby Broom
Volunteer moderator

Send message
Joined: 27 Sep 08
Posts: 690
Credit: 434,089,145
RAC: 73,431
Message 33382 - Posted: 15 Dec 2017, 9:47:27 UTC

I aborted the 15-20 in my queues, they are work well now, I assume the work will be re-created if needed or someoneelses work unit will re-validate
ID: 33382 · Report as offensive     Reply Quote
Profile Nils Høimyr
Volunteer moderator
Project administrator
Project developer
Project tester

Send message
Joined: 15 Jul 05
Posts: 229
Credit: 5,359,497
RAC: 894
Message 33384 - Posted: 15 Dec 2017, 10:14:00 UTC
Last modified: 15 Dec 2017, 10:14:54 UTC

The cleanup script is stuck due to load, so we will temporary stop the upload servers for a while to clear this backlog and half-uploaded entries.

Thus you will see a different message when your BOINC clients try to upload. Please simply let them back off, later we'll enable the file servers again.

Thanks for your patience.

No task id's etc should be needed, thanks we can get them from the DB and we have plenty of samples from our own BOINC clients.
ID: 33384 · Report as offensive     Reply Quote
Previous · 1 · 2 · 3 · 4 · 5 · 6 · Next

Message boards : ATLAS application : Uploads of finished tasks not possible since last night


©2021 CERN