Thread 'Tasks exceeding disk limit'

Author	Message
Harri Liljeroos Send message Joined: 28 Sep 04 Posts: 795 Credit: 64,422,582 RAC: 33,111	Message 26760 - Posted: 2 Oct 2014, 17:52:51 UTC A load of new tasks just loaded. In about 25 minutes at least some of them are creating an error "exceeding disk limit", message on Boinc Manager says: 31311 LHC@home 1.0 2.10.2014 20:03:16 Aborting task w-b3_26000_job.HLLHC_b3_26000.0732__1__s__62.31_60.32__11_13__5__10.5883_1_sixvf_boinc6_1: exceeded disk limit: 199.54MB > 190.73MB So far I have 4 of 4 finished ones with this kind of error, here is one of them : http://lhcathomeclassic.cern.ch/sixtrack/result.php?resultid=44495100 ID: 26760 · Reply Quote

Ananas Send message Joined: 17 Jul 05 Posts: 102 Credit: 542,016 RAC: 0	Message 26761 - Posted: 2 Oct 2014, 18:55:46 UTC - in response to Message 26760. Last modified: 2 Oct 2014, 19:11:12 UTC Yepp, that's why I came here too - and while watching what happens, I understood, why the CPU usage is so low in the startup phase : it creates and grows a lot of "fort.##" files simultanously, probably causing I/O-waits. Aborting task w-b3_-26000_job.HLLHC_b3_-26000.0732__1__s__62.31_60.32__15_17__5__28.2354_1_sixvf_boinc116_1: exceeded disk limit: 272.38MB > 190.73MB One current result has 72 fort.## files, 32 of them keep growing fast (~5MB each at the moment), total slot size currently ~160MB so it will sure crash soon. update : 6.3MB each, 200MB total slot (which already violates the limit) ID: 26761 · Reply Quote

Harri Liljeroos Send message Joined: 28 Sep 04 Posts: 795 Credit: 64,422,582 RAC: 33,111	Message 26762 - Posted: 2 Oct 2014, 19:10:42 UTC Last modified: 2 Oct 2014, 19:12:51 UTC ^Yep, I see that too. Some of the tasks finish before they crash, in a few minutes. Probably because of unstable beam conditions. Edit: the number of tasks out in the field is increasing fast. ID: 26762 · Reply Quote

Ray Murray Volunteer moderator Send message Joined: 29 Sep 04 Posts: 281 Credit: 11,888,115 RAC: 0	Message 26765 - Posted: 2 Oct 2014, 20:28:49 UTC Last modified: 2 Oct 2014, 20:35:56 UTC This one, at a little over half the initial estimated time, is my longest one of these to make it to completion. Another only lasted a couple of mins so probably never really got started. forts 85-90 grew to around 12MB each. Missed some that have just errored but watched one whose forts 61-90 all grew to exactly the same size of just over 7,00kB which was enough to push it over the edge. All the erroring ones are of the w-b3_xx000_job.HLLHC.... variety. NNT for now until I flush these through and let the known good ones run and allow the cache to run dry. Hopefully these duff wus will all be cleared out by morning. I've also been neglecting vLHC while there was work here so I'll give that some air. ID: 26765 · Reply Quote

Matthias Lehmkuhl Send message Joined: 15 Jul 05 Posts: 27 Credit: 2,675,621 RAC: 0	Message 26768 - Posted: 2 Oct 2014, 22:01:49 UTC count me in, 397 LHC@home 1.0 02.10.2014 21:51:31 Aborting task w-b3_-14000_job.HLLHC_b3_-14000.0732__2__s__62.31_60.32__13_15__5__7.05884_1_sixvf_boinc304_3: exceeded disk limit: 209.22MB > 190.73MB Looks like we have some bad WU http://lhcathomeclassic.cern.ch/sixtrack/workunit.php?wuid=21130329 http://lhcathomeclassic.cern.ch/sixtrack/workunit.php?wuid=21124227 http://lhcathomeclassic.cern.ch/sixtrack/workunit.php?wuid=21129470 all WU have more then one result with 196 (0xc4) EXIT_DISK_LIMIT_EXCEEDED And it's OS independent. On Windows also the "BOINC Windows Runtime Debugger" is started. Matthias ID: 26768 · Reply Quote

Professor Ray Send message Joined: 26 Nov 05 Posts: 39 Credit: 435,308 RAC: 0	Message 26771 - Posted: 3 Oct 2014, 3:51:00 UTC Last modified: 3 Oct 2014, 4:24:30 UTC Just started happen, other WU cruncher fine. What is this? 10/2/2014 8:23:33 PM \| LHC@home 1.0 \| Aborting task w-b3_-10000_job.HLLHC_b3_-10000.0732__1__s__62.31_60.32__11_13__5__24.7059_1_sixvf_boinc1442_1: exceeded disk limit: 198.54MB > 190.73MB http://lhcathomeclassic.cern.ch/sixtrack/result.php?resultid=44587584 10/2/2014 11:14:44 PM \| LHC@home 1.0 \| Aborting task w-b3_-16000_job.HLLHC_b3_-16000.0732__4__s__62.31_60.32__13_15__5__38.8236_1_sixvf_boinc822_4: exceeded disk limit: 200.23MB > 190.73MB http://lhcathomeclassic.cern.ch/sixtrack/result.php?resultid=44587584 BOINC dIsk preferences are to use at most 50% of total disk space (size allocated 12 GB), use less than 100GB (actual in-use: 0.75 GB), leave at most 0.25 GB free (0.8 GB actual free space on HDD) - but event log shows: 10/2/2014 1:37:05 AM \| \| max disk usage: 1.26GB (????) Could the output file be exceeding 1/2 GB in size? I'm gonna deep six these WU's until condition-red returns to safer yellow alert. EDIT: my stupid; event log shows error tripped threshold 190.73MB. Furthermore, max disk usage 1.26GB = allocated - used - 0.25 GB specified reserve ID: 26771 · Reply Quote

jay Send message Joined: 10 Aug 07 Posts: 60 Credit: 837,760 RAC: 0	Message 26772 - Posted: 3 Oct 2014, 6:50:04 UTC - in response to Message 26771. Last modified: 3 Oct 2014, 6:55:40 UTC Greetings, Also getting a task result of "Maximum disk usage exceeded". Is there a patch we can make to allow more disk space? Or, is this embedded in the task code?? I'm getting these around 15 to 20 minutes into the tasks. Thanks, Jay -- edit-- PS The disk limit does not seem to be related to the BOINC disk resource limit that the user sets up. ( I have >70 GB allocated for BOINC, only 1.7 used. Atlas had used up a large pre-allocated disk space...) ID: 26772 · Reply Quote

jay Send message Joined: 10 Aug 07 Posts: 60 Credit: 837,760 RAC: 0	Message 26773 - Posted: 3 Oct 2014, 7:10:12 UTC - in response to Message 26772. Another weird thing.... I looked at the co-pilots of the WU that went - over-limit for me. of the 30, I only saw one wu that was sent to a client that did not fail. http://lhcathomeclassic.cern.ch/sixtrack/workunit.php?wuid=21177447 I wonder why it didn't have the problem.. Jay ID: 26773 · Reply Quote

Speedy Send message Joined: 28 Jul 05 Posts: 42 Credit: 767,068 RAC: 0	Message 26774 - Posted: 3 Oct 2014, 7:17:46 UTC Last modified: 3 Oct 2014, 7:19:21 UTC Jay, another weird thing is the result showing as pending has now run time, the first time I have ever seen this. I also have 6 results that have exited with extending the disk limit. Ranging between 1000 and 1500 seconds run-time Have A Crunching Good day ID: 26774 · Reply Quote

Professor Ray Send message Joined: 26 Nov 05 Posts: 39 Credit: 435,308 RAC: 0	Message 26775 - Posted: 3 Oct 2014, 8:46:08 UTC Last modified: 3 Oct 2014, 8:54:59 UTC What I un'erstan'in max disk overflow is project specified threshold - its obviously not a client specified parameter. Why is LHC overflowing its disks: dunno. What's truly weird, is that after I posted, I got the white screen of death message - all links main page - intimated project is down - within a couple minutes everything was all peachy keen - as if 'what happen?' - and none of the servers on status page intimate any, nada, prollem. I'm not saying its grumpkins and snarks. But its grumpkins and* snarks*. Given that overflowed WU seem to complete by wingmen, maybe there's some sort of 'seed' parameter implemented in the cruncher, i.e., varies the crunching slightly. Either that, or its a precision issue, i.e., 2+2=4 on many machines and eventually they overflow, but on other machines 2+2 = 3.5 and they can complete the WU w/ out overflowing. ID: 26775 · Reply Quote

Richard Haselgrove Send message Joined: 27 Oct 07 Posts: 186 Credit: 3,297,640 RAC: 0	Message 26776 - Posted: 3 Oct 2014, 8:55:09 UTC The w-b3_ tasks are sent by the server with this line in the workunit definition: <rsc_disk_bound> 200000000 - decimal 200 MB. While running, working files fort.61 thru fort.90 seem to start growing exponentially: this screengrab was taken at about 50% progress. With a disk usage of 229 MB already for those 30 files, BOINC should have aborted the task already - and did so shortly afterwards. ID: 26776 · Reply Quote

Ken Beishir Send message Joined: 11 Dec 05 Posts: 4 Credit: 1,077,763 RAC: 0	Message 26785 - Posted: 3 Oct 2014, 16:28:17 UTC - in response to Message 26776. I am seeing lot of errors too (example): Name w-b3_-22000_job.HLLHC_b3_-22000.0732__11__s__62.31_60.32__11_13__5__44.1178_1_sixvf_boinc3935_2 Workunit 21232197 Created 3 Oct 2014, 11:42:05 UTC Sent 3 Oct 2014, 13:37:22 UTC Received 3 Oct 2014, 16:22:06 UTC Server state Over Outcome Computation error Client state Compute error Exit status 196 (0xc4) EXIT_DISK_LIMIT_EXCEEDED Computer ID 10315741 Report deadline 11 Oct 2014, 5:09:36 UTC Run time 4,885.56 CPU time 4,349.03 Validate state Invalid Credit 0.00 Application version SixTrack v451.07 (pni) ID: 26785 · Reply Quote

Matthias Lehmkuhl Send message Joined: 15 Jul 05 Posts: 27 Credit: 2,675,621 RAC: 0	Message 26786 - Posted: 3 Oct 2014, 16:52:48 UTC some more results 1328 LHC@home 1.0 03.10.2014 14:17:43 Aborting task w-b3_-12000_job.HLLHC_b3_-12000.0732__3__s__62.31_60.32__13_15__5__14.1177_1_sixvf_boinc558_3: exceeded disk limit: 197.24MB > 190.73MB 1392 LHC@home 1.0 03.10.2014 14:47:45 Aborting task w-b3_30000_job.HLLHC_b3_30000.0732__3__s__62.31_60.32__15_17__5__47.6472_1_sixvf_boinc1916_4: exceeded disk limit: 192.40MB > 190.73MB 1427 LHC@home 1.0 03.10.2014 14:52:45 Aborting task w-b3_24000_job.HLLHC_b3_24000.0732__4__s__62.31_60.32__11_13__5__75.8825_1_sixvf_boinc2141_2: exceeded disk limit: 213.06MB > 190.73MB 1471 LHC@home 1.0 03.10.2014 15:22:49 Aborting task w-b3_0_job.HLLHC_b3_0.0732__13__s__62.31_60.32__11_13__5__10.5883_1_sixvf_boinc4388_0: exceeded disk limit: 193.56MB > 190.73MB 1507 LHC@home 1.0 03.10.2014 15:52:52 Aborting task w-b3_4000_job.HLLHC_b3_4000.0732__8__s__62.31_60.32__13_15__5__67.059_1_sixvf_boinc3137_3: exceeded disk limit: 202.94MB > 190.73MB 1508 LHC@home 1.0 03.10.2014 15:52:52 Aborting task w-b3_-24000_job.HLLHC_b3_-24000.0732__10__s__62.31_60.32__11_13__5__15.8824_1_sixvf_boinc3676_3: exceeded disk limit: 191.12MB > 190.73MB 1546 LHC@home 1.0 03.10.2014 16:27:55 Aborting task w-b3_-22000_job.HLLHC_b3_-22000.0732__10__s__62.31_60.32__13_15__5__63.5296_1_sixvf_boinc3746_2: exceeded disk limit: 216.95MB > 190.73MB example WU http://lhcathomeclassic.cern.ch/sixtrack/workunit.php?wuid=21252862 some result actual finished successful. Matthias ID: 26786 · Reply Quote

keputnam Send message Joined: 27 Sep 04 Posts: 111 Credit: 8,602,669 RAC: 0	Message 26787 - Posted: 3 Oct 2014, 17:28:47 UTC Same here Lots of Disk Space Exceeded errors, all on w-3b* WUs But I've also got some w-3b WUs that complete successfully <shrug> ID: 26787 · Reply Quote

Professor Ray Send message Joined: 26 Nov 05 Posts: 39 Credit: 435,308 RAC: 0	Message 26790 - Posted: 3 Oct 2014, 18:13:01 UTC Last modified: 3 Oct 2014, 18:22:46 UTC project completion: 42% @ 2.6666 hrs project size = 6.20MB (6.21 slack) slct size = 191MB (192 slack) slot init_data.xml <rsc_memory_bound>100000000.000000</rsc_memory_bound> <rsc_disk_bound>200000000.000000</rsc_disk_bound> I change disk value to 400000000 and restart BOINC Manager. I wait short time and Now at 43.5% complete and slot size 195MB (196 slack), project size 6.20MB (6.21 slack). At 2.8 hrs it blow up: 10/3/2014 2:05:45 PM \| LHC@home 1.0 \| Aborting task w-b3_30000_job.HLLHC_b3_30000.0732__4__s__62.31_60.32__11_13__5__58.2354_1_sixvf_boinc2072_4: exceeded disk limit: 198.24MB > 190.73MB http://lhcathomeclassic.cern.ch/sixtrack/result.php?resultid=44914176 Dunno. I remember having to edit init for Lattice. Forgot what was done there. Mebbe hafta restart computer after edit init.xml? ID: 26790 · Reply Quote

Richard Haselgrove Send message Joined: 27 Oct 07 Posts: 186 Credit: 3,297,640 RAC: 0	Message 26791 - Posted: 3 Oct 2014, 18:24:14 UTC - in response to Message 26790. I change disk value to 400000000 and restart. You would probably need to make that change in client_state.xml rather than init_data.xml Very much at your own risk. ID: 26791 · Reply Quote

pvh Send message Joined: 17 Jun 13 Posts: 8 Credit: 6,548,286 RAC: 0	Message 26793 - Posted: 3 Oct 2014, 18:51:46 UTC I have racked up 240 failed tasks now due to "Maximum disk usage exceeded"... I am temporarily disabling this project until this is fixed. Too many resources wasted here... ID: 26793 · Reply Quote

Professor Ray Send message Joined: 26 Nov 05 Posts: 39 Credit: 435,308 RAC: 0	Message 26795 - Posted: 3 Oct 2014, 19:59:49 UTC - in response to Message 26791. I change disk value to 400000000 and restart. You would probably need to make that change in client_state.xml rather than init_data.xml Very much at your own risk. I just discovered that init_data.xml gets overwritten on BOINC Manager exit. I discovered that client_state.xml contains entry for LHC WU. I made aforementioned change to both files now. That's schizo coding; if the slot doesn't get created until a WU is started - the slot contains the init_data.xml - but the client_state.xml is the driver (overwriting init_data.xml edit?). We be crunching another one now. We'll see if it finishes. If so, there's the validation issue subsequent; if no wingman can complete the WU w/o abend due to disk overflow. ID: 26795 · Reply Quote

KWSN-GMC-Peeper of the Castle ... Send message Joined: 6 Oct 05 Posts: 18 Credit: 952,091 RAC: 0	Message 26799 - Posted: 3 Oct 2014, 21:24:31 UTC Got one of these myself yesterday..with over 13 GB of the 20 allotted to BOINC still available... ID: 26799 · Reply Quote

Professor Ray Send message Joined: 26 Nov 05 Posts: 39 Credit: 435,308 RAC: 0	Message 26806 - Posted: 3 Oct 2014, 23:00:23 UTC Last modified: 3 Oct 2014, 23:45:25 UTC After 3.25 hrs still crunchin' w/ 30 forts comprising 202MB for grantotal of 220 and still growin' - but there be 26 forts of 0kb so I've no idea how this'll play out. Its chunkin' and ka-thunkin' and whirrin' and blinkin' and flashin' and beepin' and boppin'....the little hamster wheel be whinin' something fierce - we be needin' to put some beef tallow on the bearing hub. EDIT: at 2340 UTC its crunched 255MB with headroom to 381MB - per aforementioned tweak to client_state.xml - at which point I can still give him 'nother 177MB That notwithstanding, I'm a gonna hafta suspend all other projects w/ no new work - no body else'll have room to run. ID: 26806 · Reply Quote