196 (0x000000C4) EXIT_DISK_LIMIT_EXCEEDED

Author	Message
computezrmle Volunteer moderator Volunteer developer Volunteer tester Help desk expert Send message Joined: 15 Jun 08 Posts: 2682 Credit: 286,876,172 RAC: 57,973	Message 38509 - Posted: 1 Apr 2019, 19:38:14 UTC Allowed disk space is never unlimited. Beside the different settings you can influence via your BOINC client each workunit has a maximum setting given by the server. You can check it if you examine your client_state.xml. Locate the section <workunit> that corresponds to the task (ID) you want to look at. The disk limit in Bytes is set by <rsc_disk_bound>. Beside that each VM has a limit regarding the partition size of it's virtual hard disk. This is where the logfile is written to. I'm not sure but isn't that size around 20 GB or so? This would explain why tasks are cancelled although there is plenty of free space on the host's disk. It does not explain why especially sherpa sometimes writes logfiles of that huge size. The latter could only be fixed by the project team but if I correctly understand older posts mentioned here a couple of times once a specific app version is published it will never be changed to guarantee reproducible results. Even if those results are known to be errors. BTW Erich's task did not finish successfully as the log says: 2019-03-31 22:24:25 (10000): Guest Log: [INFO] Job finished in slot1 with 1. ID: 38509 · Reply Quote

maeax Send message Joined: 2 May 07 Posts: 2277 Credit: 178,705,839 RAC: 104,384	Message 38510 - Posted: 1 Apr 2019, 20:41:10 UTC - in response to Message 38509. I'm not sure but isn't that size around 20 GB or so? Yes this is the limit, when a Sherpa-task write a endless loop of log sentences. We need help from the Science-people who define those Sherpa-tasks, like Peter Skands. Native-Theory Linux is not the problem for this crash of tasks. It works well. VM-Tasks (Windows or Linux) crash with the 20 Gbyte-limit or the 18 hour limit. It is useful to eliminate Sherpa-tasks for sending to us Volunteers for the moment. Have stopped Theory for me, unless there is a solution with this Sherpa-tasks. ID: 38510 · Reply Quote

Crystal Pellet Volunteer moderator Volunteer tester Send message Joined: 14 Jan 10 Posts: 1461 Credit: 9,857,903 RAC: 2,556	Message 38511 - Posted: 1 Apr 2019, 20:49:19 UTC - in response to Message 38509. The disk limit in Bytes is set by <rsc_disk_bound>. Beside that each VM has a limit regarding the partition size of it's virtual hard disk. This is where the logfile is written to. I'm not sure but isn't that size around 20 GB or so? The rsc_disk_bound for Theory VBox-tasks is 7629.3945 MB (8000000000 bytes). The task should stop earlier, but I've seen that the slot directory's total is not summed every minute and/or the size is not updated frequently by the OS. My impression is that a sherpa job is writing KB's per second to the logfiles in those circumstances. It's striking that Erich seems to be much more often a victim of the DISK_LIMIT_EXCEEDED error condition. ID: 38511 · Reply Quote

Crystal Pellet Volunteer moderator Volunteer tester Send message Joined: 14 Jan 10 Posts: 1461 Credit: 9,857,903 RAC: 2,556	Message 38512 - Posted: 1 Apr 2019, 20:56:24 UTC - in response to Message 38510. Native-Theory Linux is not the problem for this crash of tasks. It works well. During testing on the dev-project I had one error with DISK_LIMIT_EXCEEDED with a native Theory task. So it happens with the native's too. Difference is that the limit is set lower (~1907 MB), so a task will crash earlier. ID: 38512 · Reply Quote

maeax Send message Joined: 2 May 07 Posts: 2277 Credit: 178,705,839 RAC: 104,384	Message 38513 - Posted: 2 Apr 2019, 5:55:01 UTC - in response to Message 38512. Sorry Crystal, only Theory-Sherpa have a problem! ID: 38513 · Reply Quote

Peter Skands Send message Joined: 31 Jan 11 Posts: 12 Credit: 3,557,813 RAC: 0	Message 38516 - Posted: 3 Apr 2019, 1:49:02 UTC - in response to Message 38513. Dear all, From the project side, this is a difficult problem. We'd like to run Sherpa jobs - it is being used in LHC studies, so it is useful for the LHC@home project to run it. But as many of you are only too well aware of, it is also the source of most of the problematic jobs in the project. We are not ourselves authors of the Sherpa Monte Carlo, so we can only pass along the issues to the authors and hope that they will be addressed in future releases. In the past, Sherpa jobs on LHC@home had significant issues with infinite loops causing jobs to run to the time limit without finishing. Since versions are never patched backwards (bug fixes only go into new releases; old ones are not patched, for reproducibility) we had hoped to address this by deprecating some of the older Sherpa versions and only running the newest releases. I think that changeover happened relatively recently - though Anton Karneyeu can probably say more precisely when and what was done - he has been working on updating the project over the past few months. However, as is clear from this thread and related ones, the new versions have issues of their own. I will try to collate the feedback you have been providing and contact the authors about what we are seeing. I am not myself able to suggest a "hot fix" that could be applied easily to at least detect and kill the problematic jobs. The only thing I could do easily would be to remove Sherpa jobs from the request system - but then we would lose the possibility to compare to Sherpa at all, which would also not be ideal. First of all, I'd like to understand the magnitude of the problem; is this a relatively rare occurrence that happens only for the occasional Sherpa job, or does it happen most of the time? I saw a log posted on this thread that had only one successful Sherpa job for 13 or 14 failed ones. If that is typical, we might as well give up on running Sherpa, at least for the time being. Any feedback on how prevalent this issue is - whether we are still getting a good number of Sherpa jobs finishing or hardly any - would be appreciated. With best regards, and apologies to those who are (understandably) frustrated about their CPU time going to jobs that are not producing science! Peter ID: 38516 · Reply Quote

Erich56 Send message Joined: 18 Dec 15 Posts: 1908 Credit: 144,892,979 RAC: 83,731	Message 38557 - Posted: 10 Apr 2019, 4:55:49 UTC - in response to Message 38516. ... and apologies to those who are (understandably) frustrated about their CPU time going to jobs that are not producing science!Peter the next one: https://lhcathome.cern.ch/lhcathome/result.php?resultid=220826628 it failed after 16GB(!) disk usage and 18 hours processing time :-( and yes, this creates frustration :-( Please stop this nonsense! ID: 38557 · Reply Quote

Erich56 Send message Joined: 18 Dec 15 Posts: 1908 Credit: 144,892,979 RAC: 83,731	Message 38558 - Posted: 11 Apr 2019, 11:20:52 UTC now this seems to happen every day: https://lhcathome.cern.ch/lhcathome/result.php?resultid=220893933 once more, almost 14 hours CPU time for nothing :-( More than annoying! Could someone please stop these faulty tasks !!! ID: 38558 · Reply Quote

Ray Murray Volunteer moderator Send message Joined: 29 Sep 04 Posts: 281 Credit: 11,888,115 RAC: 1,013	Message 38559 - Posted: 11 Apr 2019, 22:01:57 UTC 2 today. Thought I'd caught one in time before it reached limit (16GB when I noticed, 16hrs wasted) but it errored out when I reset the VM. It also failed to remove itself from VBox and had to be deleted manually. Other one was north of 100MB after 3hrs and increasing but the VM reset OK. Both ee Sherpas While I was at it, I binned a pp Sherpa that wanted 2 days to complete as I saw little point in allowing it to run to time limit but not complete in that time, losing anything useful in the not-uploaded, partially complete logs. I don't know where Crystal Pellet dug up these figures from but they make sad reading on the success rate, or lack thereof, of ee Sherpas. ID: 38559 · Reply Quote

Erich56 Send message Joined: 18 Dec 15 Posts: 1908 Credit: 144,892,979 RAC: 83,731	Message 38565 - Posted: 15 Apr 2019, 4:56:19 UTC the next one: https://lhcathome.cern.ch/lhcathome/result.php?resultid=221079025 why does no one stop submitting these faulty tasks ??? ID: 38565 · Reply Quote

Crystal Pellet Volunteer moderator Volunteer tester Send message Joined: 14 Jan 10 Posts: 1461 Credit: 9,857,903 RAC: 2,556	Message 38578 - Posted: 18 Apr 2019, 17:44:07 UTC 196 (0x000000C4) EXIT_DISK_LIMIT_EXCEEDED - how come? The reason is massive growing of the running log file and so the size of the VM in the slot directory. Writing ~2MB every second. At the moment the vm_image.vdi is 2.1GB I've a candidate here. 19 attemps - 12 failures and 7 lost. ===> [runRivet] Thu Apr 18 17:07:51 CEST 2019 [boinc pp jets 7000 170,-,2960 - sherpa 2.1.0 default 48000 44] Channel_Elements::GenerateYUniform(1.04921,{-8.98847e+307,0,-8.98847e+307,0,0,},{-10,10,-nan,}): Y out of bounds ! ymin, ymax vs. y : 0.0240183 -0.0240183 vs. -0.00922275 ISR_Handler::MakeISR(..): s' out of bounds. s'_{min}, s'_{max 1,2} vs. s': 5.25696e+07, 4.9e+07, 4.9e+07 vs. 5.14112e+07 ISR_Handler::MakeISR(..): s' out of bounds. s'_{min}, s'_{max 1,2} vs. s': 5.25696e+07, 4.9e+07, 4.9e+07 vs. 4.9344e+07 ISR_Handler::MakeISR(..): s' out of bounds. s'_{min}, s'_{max 1,2} vs. s': 5.25696e+07, 4.9e+07, 4.9e+07 vs. 5.25106e+07 Channel_Elements::GenerateYUniform(1.06416,{-8.98847e+307,0,-8.98847e+307,0,0,},{-10,10,-nan,}): Y out of bounds ! ymin, ymax vs. y : 0.0310934 -0.0310934 vs. 0.0126543 ISR_Handler::MakeISR(..): s' out of bounds. s'_{min}, s'_{max 1,2} vs. s': 5.25696e+07, 4.9e+07, 4.9e+07 vs. 5.21439e+07 ISR_Handler::MakeISR(..): s' out of bounds. s'_{min}, s'_{max 1,2} vs. s': 5.25696e+07, 4.9e+07, 4.9e+07 vs. 5.15232e+07 ISR_Handler::MakeISR(..): s' out of bounds. s'_{min}, s'_{max 1,2} vs. s': 5.25696e+07, 4.9e+07, 4.9e+07 vs. 5.04853e+07 ISR_Handler::MakeISR(..): s' out of bounds. s'_{min}, s'_{max 1,2} vs. s': 5.25696e+07, 4.9e+07, 4.9e+07 vs. 5.04799e+07 ISR_Handler::MakeISR(..): s' out of bounds. s'_{min}, s'_{max 1,2} vs. s': 5.25696e+07, 4.9e+07, 4.9e+07 vs. 4.9716e+07 ISR_Handler::MakeISR(..): s' out of bounds. s'_{min}, s'_{max 1,2} vs. s': 5.25696e+07, 4.9e+07, 4.9e+07 vs. 4.95098e+07 ISR_Handler::MakeISR(..): s' out of bounds. s'_{min}, s'_{max 1,2} vs. s': 5.25696e+07, 4.9e+07, 4.9e+07 vs. 5.23729e+07 ISR_Handler::MakeISR(..): s' out of bounds. s'_{min}, s'_{max 1,2} vs. s': 5.25696e+07, 4.9e+07, 4.9e+07 vs. 5.21549e+07 Channel_Elements::GenerateYUniform(1.04776,{-8.98847e+307,0,-8.98847e+307,0,0,},{-10,10,-nan,}): Y out of bounds ! ymin, ymax vs. y : 0.0233275 -0.0233275 vs. -0.0127388 ISR_Handler::MakeISR(..): s' out of bounds. s'_{min}, s'_{max 1,2} vs. s': 5.25696e+07, 4.9e+07, 4.9e+07 vs. 5.13403e+07 Channel_Elements::GenerateYUniform(1.07081,{-8.98847e+307,0,-8.98847e+307,0,0,},{-10,10,-0.0127388,}): Y out of bounds ! ymin, ymax vs. y : 0.034207 -0.034207 vs. 0.020383 ISR_Handler::MakeISR(..): s' out of bounds. s'_{min}, s'_{max 1,2} vs. s': 5.25696e+07, 4.9e+07, 4.9e+07 vs. 5.24696e+07 Channel_Elements::GenerateYUniform(1.01653,{-8.98847e+307,0,-8.98847e+307,0,0,},{-10,10,0.020383,}): Y out of bounds ! ymin, ymax vs. y : 0.00819796 -0.00819796 vs. 0.00663273 ISR_Handler::MakeISR(..): s' out of bounds. s'_{min}, s'_{max 1,2} vs. s': 5.25696e+07, 4.9e+07, 4.9e+07 vs. 4.981e+07 ISR_Handler::MakeISR(..): s' out of bounds. s'_{min}, s'_{max 1,2} vs. s': 5.25696e+07, 4.9e+07, 4.9e+07 vs. 4.94124e+07 ISR_Handler::MakeISR(..): s' out of bounds. s'_{min}, s'_{max 1,2} vs. s': 5.25696e+07, 4.9e+07, 4.9e+07 vs. 5.13743e+07 Channel_Elements::GenerateYUniform(1.0575,{-8.98847e+307,0,-8.98847e+307,0,0,},{-10,10,-nan,}): Y out of bounds ! ymin, ymax vs. y : 0.027955 -0.027955 vs. 0.00218613 ISR_Handler::MakeISR(..): s' out of bounds. s'_{min}, s'_{max 1,2} vs. s': 5.25696e+07, 4.9e+07, 4.9e+07 vs. 5.18176e+07 ISR_Handler::MakeISR(..): s' out of bounds. s'_{min}, s'_{max 1,2} vs. s': 5.25696e+07, 4.9e+07, 4.9e+07 vs. 5.04083e+07 ISR_Handler::MakeISR(..): s' out of bounds. s'_{min}, s'_{max 1,2} vs. s': 5.25696e+07, 4.9e+07, 4.9e+07 vs. 5.12492e+07 ISR_Handler::MakeISR(..): s' out of bounds. s'_{min}, s'_{max 1,2} vs. s': 5.25696e+07, 4.9e+07, 4.9e+07 vs. 4.96146e+07 ISR_Handler::MakeISR(..): s' out of bounds. s'_{min}, s'_{max 1,2} vs. s': 5.25696e+07, 4.9e+07, 4.9e+07 vs. 5.16881e+07 ISR_Handler::MakeISR(..): s' out of bounds. s'_{min}, s'_{max 1,2} vs. s': 5.25696e+07, 4.9e+07, 4.9e+07 vs. 4.98863e+07 ISR_Handler::MakeISR(..): s' out of bounds. s'_{min}, s'_{max 1,2} vs. s': 5.25696e+07, 4.9e+07, 4.9e+07 vs. 5.0189e+07 ISR_Handler::MakeISR(..): s' out of bounds. s'_{min}, s'_{max 1,2} vs. s': 5.25696e+07, 4.9e+07, 4.9e+07 vs. 5.07392e+07 ISR_Handler::MakeISR(..): s' out of bounds. s'_{min}, s'_{max 1,2} vs. s': 5.25696e+07, 4.9e+07, 4.9e+07 vs. 4.96088e+07 ISR_Handler::MakeISR(..): s' out of bounds. s'_{min}, s'_{max 1,2} vs. s': 5.25696e+07, 4.9e+07, 4.9e+07 vs. 4.99324e+07 ISR_Handler::MakeISR(..): s' out of bounds. s'_{min}, s'_{max 1,2} vs. s': 5.25696e+07, 4.9e+07, 4.9e+07 vs. 4.98934e+07 ID: 38578 · Reply Quote

Erich56 Send message Joined: 18 Dec 15 Posts: 1908 Credit: 144,892,979 RAC: 83,731	Message 38598 - Posted: 21 Apr 2019, 16:34:48 UTC - in response to Message 38578. 196 (0x000000C4) EXIT_DISK_LIMIT_EXCEEDED - how come? The reason is massive growing of the running log file and so the size of the VM in the slot directory. Writing ~2MB every second. At the moment the vm_image.vdi is 2.1GB ... From what I can see, normally the VM in the slot directory is 492.544 KB. Does all this mean that if I happen to detect an image bigger than that, or - by watsching it for a minute and noticing that the size is permanently growing - then it would make sense to abort the task? Probably YES, right? ID: 38598 · Reply Quote

Crystal Pellet Volunteer moderator Volunteer tester Send message Joined: 14 Jan 10 Posts: 1461 Credit: 9,857,903 RAC: 2,556	Message 38600 - Posted: 21 Apr 2019, 17:58:49 UTC - in response to Message 38598. From what I can see, normally the VM in the slot directory is 492.544 KB. Does all this mean that if I happen to detect an image bigger than that, or - by watsching it for a minute and noticing that the size is permanently growing - then it would make sense to abort the task? Probably YES, right? 492544kB is the size of the vdi-file in the project directory and when used in a slot, it will increase in size after initialization and processing sub-jobs. The sizes of the 2 tasks I've running atm are 988,160kB and 1,059,840kB. Better info you get from the Console ALT-F2 and from localhost portnr running.log. ID: 38600 · Reply Quote

Erich56 Send message Joined: 18 Dec 15 Posts: 1908 Credit: 144,892,979 RAC: 83,731	Message 38602 - Posted: 22 Apr 2019, 6:39:21 UTC - in response to Message 38600. From what I can see, normally the VM in the slot directory is 492.544 KB. Does all this mean that if I happen to detect an image bigger than that, or - by watsching it for a minute and noticing that the size is permanently growing - then it would make sense to abort the task? Probably YES, right? 492544kB is the size of the vdi-file in the project directory and when used in a slot, it will increase in size after initialization and processing sub-jobs. The sizes of the 2 tasks I've running atm are 988,160kB and 1,059,840kB. Better info you get from the Console ALT-F2 and from localhost portnr running.log. the interesting thing is that on my "main" PC on which I am running 6 Theory tasks simultaneously, the vdi-files in the slot directories are showing 492.544kB all the time, from beginning to the end. On all other PCs the vdi-files in the slot directory are growing continously, as you are explaing above. ID: 38602 · Reply Quote

Crystal Pellet Volunteer moderator Volunteer tester Send message Joined: 14 Jan 10 Posts: 1461 Credit: 9,857,903 RAC: 2,556	Message 38603 - Posted: 22 Apr 2019, 7:46:19 UTC - in response to Message 38602. the interesting thing is that on my "main" PC on which I am running 6 Theory tasks simultaneously, the vdi-files in the slot directories are showing 492.544kB all the time, from beginning to the end. On all other PCs the vdi-files in the slot directory are growing continously, as you are explaing above. What ID has your 'main' PC? What's happening to the shown size when you update the directory you're watching with Function Key F5? What is the size of the vdi-file when you look at the file properties? ID: 38603 · Reply Quote

Erich56 Send message Joined: 18 Dec 15 Posts: 1908 Credit: 144,892,979 RAC: 83,731	Message 38609 - Posted: 23 Apr 2019, 4:53:00 UTC - in response to Message 38603. ID: 10555784 When updating the slot directory with Key F6, nothing happens - the size shown stays same (492.544kb). When going to file properties, the correct file size seems to be shown. And after that, when going back to the slot directory, also there the correct file size shows up. Strange behaviour of the Windows Explorer :-( ID: 38609 · Reply Quote

Erich56 Send message Joined: 18 Dec 15 Posts: 1908 Credit: 144,892,979 RAC: 83,731	Message 38610 - Posted: 23 Apr 2019, 4:55:30 UTC the next one: https://lhcathome.cern.ch/lhcathome/result.php?resultid=221535924 failed after nearly 15 hours, disk usage 18.323 MB. Why does no one stop the submission of these Sherpa tasks ??? I am sick of wasting my CPU for nothing. ID: 38610 · Reply Quote

Crystal Pellet Volunteer moderator Volunteer tester Send message Joined: 14 Jan 10 Posts: 1461 Credit: 9,857,903 RAC: 2,556	Message 38616 - Posted: 23 Apr 2019, 16:10:27 UTC Another candidate for exceeding the disk limit. This time on LHC-dev, but coming from the same pool. pp jets 8000 250,-,4160 - sherpa 1.4.3 default 25 attemps 0 success 14 failures and 11 lost. The vm_image.vdi growing. Meanwhile 3.379.200 kB Aborted by BOINC: lhcathome-dev 23 Apr 18:05:52 Aborting task Theory_2279-790029-48_1: exceeded disk limit: 3300.38MB > 1907.35MB As you see the limit for tasks from the dev-system is much lower and so aborted earlier. https://lhcathomedev.cern.ch/lhcathome-dev/result.php?resultid=2770803 ID: 38616 · Reply Quote

Erich56 Send message Joined: 18 Dec 15 Posts: 1908 Credit: 144,892,979 RAC: 83,731	Message 38619 - Posted: 24 Apr 2019, 3:23:51 UTC The next one: https://lhcathome.cern.ch/lhcathome/result.php?resultid=221643489 failed after 13 1/2 hours :-( When will this nonsense finally be stopped ??? ID: 38619 · Reply Quote

Magic Quantum Mechanic Send message Joined: 24 Oct 04 Posts: 1234 Credit: 79,759,531 RAC: 78,864	Message 38630 - Posted: 25 Apr 2019, 6:38:30 UTC https://lhcathome.cern.ch/lhcathome/result.php?resultid=221635292 ID: 38630 · Reply Quote

LHC@home