Message boards : Number crunching : Error reported by file upload server: Server is out of disk space !?
Message board moderation

To post messages, you must log in.

1 · 2 · Next

AuthorMessage
Profile dr_mabuse
Avatar

Send message
Joined: 30 Dec 05
Posts: 57
Credit: 835,284
RAC: 0
Message 27087 - Posted: 29 Jan 2015, 8:12:26 UTC

hello dear experts,
since this morning I got the following error messages and I couldn't upload my results:
29.01.2015 09:09:19 | LHC@home 1.0 | Started upload of w1_jobhllhc10_sflathv_000_w1__26__s__62.31_60.32__24_26__5__37.5_1_sixvf_boinc10400_1_0
29.01.2015 09:09:19 | LHC@home 1.0 | Started upload of w1_jobhllhc10_sflathv_000_w1__27__s__62.31_60.32__4_6__5__67.5_1_sixvf_boinc10415_0_0
29.01.2015 09:09:21 | LHC@home 1.0 | [error] Error reported by file upload server: Server is out of disk space
29.01.2015 09:09:21 | LHC@home 1.0 | Temporarily failed upload of w1_jobhllhc10_sflathv_000_w1__26__s__62.31_60.32__24_26__5__37.5_1_sixvf_boinc10400_1_0: transient upload error
29.01.2015 09:09:21 | LHC@home 1.0 | Backing off 00:11:50 on upload of w1_jobhllhc10_sflathv_000_w1__26__s__62.31_60.32__24_26__5__37.5_1_sixvf_boinc10400_1_0
29.01.2015 09:09:22 | LHC@home 1.0 | Started upload of w1_jobhllhc10_sflathv_000_w1__27__s__62.31_60.32__6_8__5__45_1_sixvf_boinc10423_0_0
29.01.2015 09:09:23 | LHC@home 1.0 | [error] Error reported by file upload server: can't write file /data/boinc/project/sixtrack/upload/8c/w1_jobhllhc10_sflathv_000_w1__27__s__62.31_60.32__4_6__5__67.5_1_sixvf_boinc10415_0_0: No space left on server
29.01.2015 09:09:23 | LHC@home 1.0 | [error] Error reported by file upload server: Server is out of disk space
29.01.2015 09:09:23 | LHC@home 1.0 | Temporarily failed upload of w1_jobhllhc10_sflathv_000_w1__27__s__62.31_60.32__4_6__5__67.5_1_sixvf_boinc10415_0_0: transient upload error
29.01.2015 09:09:23 | LHC@home 1.0 | Backing off 00:09:30 on upload of w1_jobhllhc10_sflathv_000_w1__27__s__62.31_60.32__4_6__5__67.5_1_sixvf_boinc10415_0_0
29.01.2015 09:09:23 | LHC@home 1.0 | Temporarily failed upload of w1_jobhllhc10_sflathv_000_w1__27__s__62.31_60.32__6_8__5__45_1_sixvf_boinc10423_0_0: transient upload error
29.01.2015 09:09:23 | LHC@home 1.0 | Backing off 00:13:44 on upload of w1_jobhllhc10_sflathv_000_w1__27__s__62.31_60.32__6_8__5__45_1_sixvf_boinc10423_0_0

On the server status page there is no probroblem marked, everything is green.
What happened and how long will it take to fix it ?
Thanks for help
Dr.Mabuse
ID: 27087 · Report as offensive     Reply Quote
Richard Haselgrove

Send message
Joined: 27 Oct 07
Posts: 186
Credit: 3,297,640
RAC: 0
Message 27088 - Posted: 29 Jan 2015, 8:54:12 UTC - in response to Message 27087.  

Getting two variants of the same thing:

29/01/2015 08:48:47 | LHC@home 1.0 | [error] Error reported by file upload server: Server is out of disk space
29/01/2015 08:50:56 | LHC@home 1.0 | [error] Error reported by file upload server: can't write file /data/boinc/project/sixtrack/upload/ea/w200_HLLHC_RFcav_scanb3_50000.BOINC__3__s__62.31_60.32__16_18__5__28.5_1_sixvf_boinc668_1_0: No space left on server

Just a side effect of the large volume of work we've been doing recently. Some uploads are being accepted, presumably as older tasks are processed and files are deleted.
ID: 27088 · Report as offensive     Reply Quote
Rae Lockyer

Send message
Joined: 17 Oct 07
Posts: 5
Credit: 177,594
RAC: 0
Message 27089 - Posted: 29 Jan 2015, 9:27:12 UTC

Uploads are stalling - possibly due to ore smaller files being returned and the server cant keep up in processing?
The Einstein project had a problem with uploads - thing that was due to number of files.
ID: 27089 · Report as offensive     Reply Quote
Profile dr_mabuse
Avatar

Send message
Joined: 30 Dec 05
Posts: 57
Credit: 835,284
RAC: 0
Message 27090 - Posted: 29 Jan 2015, 10:10:42 UTC

It does the uploading now, about 1 file per hour.
ID: 27090 · Report as offensive     Reply Quote
Richard Haselgrove

Send message
Joined: 27 Oct 07
Posts: 186
Credit: 3,297,640
RAC: 0
Message 27091 - Posted: 29 Jan 2015, 11:00:05 UTC - in response to Message 27089.  

Uploads are stalling - possibly due to ore smaller files being returned and the server cant keep up in processing?

Unfortunately, the files seem to be all the same size, whether the task runs for seconds or several hours. We just fill up the space much more quickly if a lot of short-running tasks are passing through the system.

It probably helps (marginally) if we report any tasks which have successfully made it through the uploading stage, so that the next stage of processing can take place and the space can be freed up.

The Einstein project had a problem with uploads - thing that was due to number of files.

The problem at Einstein was that the server disk file system became incredibly slow - it was taking up to 8 seconds to locate a free 'inode' so that uploaded files could be stored and indexed. Given the thousands of files that a server needs to process, that slowed everything down to a crawl.
ID: 27091 · Report as offensive     Reply Quote
T.J.

Send message
Joined: 17 Feb 07
Posts: 86
Credit: 968,855
RAC: 0
Message 27092 - Posted: 29 Jan 2015, 11:32:34 UTC

Early this morning I was able to upload by manual intervention. Now at 11:31UTC that is no longer working. So perhaps someone need to look at he process.


Greetings from,
TJ
ID: 27092 · Report as offensive     Reply Quote
Rae Lockyer

Send message
Joined: 17 Oct 07
Posts: 5
Credit: 177,594
RAC: 0
Message 27093 - Posted: 29 Jan 2015, 12:18:40 UTC - in response to Message 27091.  

Yeah, same problem - capacity.

Suggest no new tasks be sent out to clear the backlog which also gives time to extend filesystem one way or another.
ID: 27093 · Report as offensive     Reply Quote
Profile Viking69
Avatar

Send message
Joined: 24 Jul 05
Posts: 56
Credit: 5,602,722
RAC: 4
Message 27094 - Posted: 29 Jan 2015, 12:30:59 UTC

Me too!
1/29/2015 4:29:31 AM | LHC@home 1.0 | [error] Error reported by file upload server: Server is out of disk space

Let's crunch for our future.
ID: 27094 · Report as offensive     Reply Quote
Richard Haselgrove

Send message
Joined: 27 Oct 07
Posts: 186
Credit: 3,297,640
RAC: 0
Message 27095 - Posted: 29 Jan 2015, 12:35:30 UTC

I fear that we may already have reached catch-22...

Most of the work I've succeeded in uploading this morning has gone into 'pending', and is waiting for a wingmate to upload their work so it can validate and move on.

And I've got 55 completed tasks backed up, waiting to upload so they can validate somebody else's work.

But how can get the two queues ever to meet each other?
ID: 27095 · Report as offensive     Reply Quote
Rae Lockyer

Send message
Joined: 17 Oct 07
Posts: 5
Credit: 177,594
RAC: 0
Message 27096 - Posted: 29 Jan 2015, 14:19:54 UTC

I've emailed these guys to see who can help or at least contact the correct person to help

•Eric McIntosh (eric.mcintosh@cern.ch)
•Harry Renshall (harry.renshall@cern.ch) - CERN BE-ABP-LCU - SixTrack expert
•Frank Schmidt (frank.schmidt@cern.ch) - CERN BE-ABP-ICE - SixTrack author and co-author tracking environment
•Igor Zacharov (igor.zacharov@gmail.com) - EPFL - Boinc system expert
• Massimo Giovannozzi (massimo.giovannozzi@cern.ch) - CERN BE-ABP-LCU - Responsible of LHC Commissioning and Upgrade Section
ID: 27096 · Report as offensive     Reply Quote
Rae Lockyer

Send message
Joined: 17 Oct 07
Posts: 5
Credit: 177,594
RAC: 0
Message 27097 - Posted: 29 Jan 2015, 14:22:21 UTC

I've emailed the below to see if they can get the correct person to investigate/fix


•Eric McIntosh (eric.mcintosh@cern.ch)
•Harry Renshall (harry.renshall@cern.ch) - CERN BE-ABP-LCU - SixTrack expert
•Frank Schmidt (frank.schmidt@cern.ch) - CERN BE-ABP-ICE - SixTrack author and co-author tracking environment
•Igor Zacharov (igor.zacharov@gmail.com) - EPFL - Boinc system expert
• Massimo Giovannozzi (massimo.giovannozzi@cern.ch) - CERN BE-ABP-LCU - Responsible of LHC Commissioning and Upgrade Section
ID: 27097 · Report as offensive     Reply Quote
Jesse Viviano

Send message
Joined: 12 Feb 14
Posts: 72
Credit: 4,639,155
RAC: 0
Message 27098 - Posted: 29 Jan 2015, 15:25:47 UTC - in response to Message 27087.  

The transitioner, file deleter, database purger, the assimilators, and the test work unit validator have gone down as of this writing.
ID: 27098 · Report as offensive     Reply Quote
Richard Haselgrove

Send message
Joined: 27 Oct 07
Posts: 186
Credit: 3,297,640
RAC: 0
Message 27100 - Posted: 29 Jan 2015, 15:35:02 UTC

I've had a PM reply from Eric Mcintosh

I have reported to CERN BOINC support.
ID: 27100 · Report as offensive     Reply Quote
Jesse Viviano

Send message
Joined: 12 Feb 14
Posts: 72
Credit: 4,639,155
RAC: 0
Message 27101 - Posted: 29 Jan 2015, 16:10:47 UTC
Last modified: 29 Jan 2015, 16:11:15 UTC

Could the problem be that the server is out of inodes? (In case you are wondering, an inode is a data structure that contains data about one file like file attributes and pointers to file block locations or to other pointers that point to either file block locations or other pointers to file block locations as needed depending on the file's size.) Since we are still able to post messages to the message board (which is on the same computer as the upload server according to the server status page), I don't think that the disks are out of free disk space blocks.

As for Einstein@home, its biggest problem was that its table of inodes got completely used, so its server started having to search the inode table to find a free inode to handle a file upload, with each search taking 8 seconds. Einstein@home's long term solution will be to reformat the crippled server with a newer version of XFS that allows free inodes to be tracked with a b-tree. While this creates overhead in consuming an inode to create a file or releasing an inode when deleting a file because the b-tree needs to be maintained, searching a b-tree for free inodes when there are no unused inodes left is much cheaper than searching the inode table row by row for a free inode.
ID: 27101 · Report as offensive     Reply Quote
Jesse Viviano

Send message
Joined: 12 Feb 14
Posts: 72
Credit: 4,639,155
RAC: 0
Message 27103 - Posted: 29 Jan 2015, 16:32:27 UTC

My uploads just went through.
ID: 27103 · Report as offensive     Reply Quote
Profile White Mountain Wes
Avatar

Send message
Joined: 1 Jan 09
Posts: 32
Credit: 1,106,567
RAC: 9
Message 27104 - Posted: 29 Jan 2015, 16:40:56 UTC - in response to Message 27103.  

Mine too.
ID: 27104 · Report as offensive     Reply Quote
Uffe F

Send message
Joined: 9 Jan 08
Posts: 66
Credit: 727,923
RAC: 0
Message 27105 - Posted: 29 Jan 2015, 16:43:50 UTC - in response to Message 27104.  

Same here :)
ID: 27105 · Report as offensive     Reply Quote
Richard Haselgrove

Send message
Joined: 27 Oct 07
Posts: 186
Credit: 3,297,640
RAC: 0
Message 27106 - Posted: 29 Jan 2015, 16:48:32 UTC

And the ones I got back in return all seem to be short-running, so I'm already generating new uploads to fill up the new disk (or cluster filesystem quota, as I suspect it may be). I hope the CERN admins are in a position to keep an eye on it overnight.
ID: 27106 · Report as offensive     Reply Quote
Uffe F

Send message
Joined: 9 Jan 08
Posts: 66
Credit: 727,923
RAC: 0
Message 27107 - Posted: 29 Jan 2015, 17:42:19 UTC - in response to Message 27106.  

I got a lot of long ones that are resends, so that will atleast clear some up.
ID: 27107 · Report as offensive     Reply Quote
Jesse Viviano

Send message
Joined: 12 Feb 14
Posts: 72
Credit: 4,639,155
RAC: 0
Message 27108 - Posted: 29 Jan 2015, 17:56:12 UTC
Last modified: 29 Jan 2015, 17:57:08 UTC

Reporting work units won't help clear out the old files because the transitioner is down as of this writing. Therefore, the validator will not know that it needs to validate any files. The assimilator cannot copy good results into the database due to not knowing it needs to do its job. The file deleter cannot delete any files due to not knowing that there are files that need removal.
ID: 27108 · Report as offensive     Reply Quote
1 · 2 · Next

Message boards : Number crunching : Error reported by file upload server: Server is out of disk space !?


©2024 CERN