Message boards :
Theory Application :
196 (0x000000C4) EXIT_DISK_LIMIT_EXCEEDED - how come?
Message board moderation
Previous · 1 · 2 · 3 · 4 · 5 · 6 · Next
Author | Message |
---|---|
Send message Joined: 15 Jun 08 Posts: 2564 Credit: 257,131,802 RAC: 113,088 |
Allowed disk space is never unlimited. Beside the different settings you can influence via your BOINC client each workunit has a maximum setting given by the server. You can check it if you examine your client_state.xml. Locate the section <workunit> that corresponds to the task (ID) you want to look at. The disk limit in Bytes is set by <rsc_disk_bound>. Beside that each VM has a limit regarding the partition size of it's virtual hard disk. This is where the logfile is written to. I'm not sure but isn't that size around 20 GB or so? This would explain why tasks are cancelled although there is plenty of free space on the host's disk. It does not explain why especially sherpa sometimes writes logfiles of that huge size. The latter could only be fixed by the project team but if I correctly understand older posts mentioned here a couple of times once a specific app version is published it will never be changed to guarantee reproducible results. Even if those results are known to be errors. BTW Erich's task did not finish successfully as the log says: 2019-03-31 22:24:25 (10000): Guest Log: [INFO] Job finished in slot1 with 1. |
Send message Joined: 2 May 07 Posts: 2249 Credit: 174,093,647 RAC: 5,356 |
I'm not sure but isn't that size around 20 GB or so? Yes this is the limit, when a Sherpa-task write a endless loop of log sentences. We need help from the Science-people who define those Sherpa-tasks, like Peter Skands. Native-Theory Linux is not the problem for this crash of tasks. It works well. VM-Tasks (Windows or Linux) crash with the 20 Gbyte-limit or the 18 hour limit. It is useful to eliminate Sherpa-tasks for sending to us Volunteers for the moment. Have stopped Theory for me, unless there is a solution with this Sherpa-tasks. |
Send message Joined: 14 Jan 10 Posts: 1433 Credit: 9,598,155 RAC: 3,461 |
The disk limit in Bytes is set by <rsc_disk_bound>.The rsc_disk_bound for Theory VBox-tasks is 7629.3945 MB (8000000000 bytes). The task should stop earlier, but I've seen that the slot directory's total is not summed every minute and/or the size is not updated frequently by the OS. My impression is that a sherpa job is writing KB's per second to the logfiles in those circumstances. It's striking that Erich seems to be much more often a victim of the DISK_LIMIT_EXCEEDED error condition. |
Send message Joined: 14 Jan 10 Posts: 1433 Credit: 9,598,155 RAC: 3,461 |
Native-Theory Linux is not the problem for this crash of tasks. It works well.During testing on the dev-project I had one error with DISK_LIMIT_EXCEEDED with a native Theory task. So it happens with the native's too. Difference is that the limit is set lower (~1907 MB), so a task will crash earlier. |
Send message Joined: 2 May 07 Posts: 2249 Credit: 174,093,647 RAC: 5,356 |
Sorry Crystal, only Theory-Sherpa have a problem! |
Send message Joined: 31 Jan 11 Posts: 12 Credit: 3,557,813 RAC: 0 |
Dear all, From the project side, this is a difficult problem. We'd like to run Sherpa jobs - it is being used in LHC studies, so it is useful for the LHC@home project to run it. But as many of you are only too well aware of, it is also the source of most of the problematic jobs in the project. We are not ourselves authors of the Sherpa Monte Carlo, so we can only pass along the issues to the authors and hope that they will be addressed in future releases. In the past, Sherpa jobs on LHC@home had significant issues with infinite loops causing jobs to run to the time limit without finishing. Since versions are never patched backwards (bug fixes only go into new releases; old ones are not patched, for reproducibility) we had hoped to address this by deprecating some of the older Sherpa versions and only running the newest releases. I think that changeover happened relatively recently - though Anton Karneyeu can probably say more precisely when and what was done - he has been working on updating the project over the past few months. However, as is clear from this thread and related ones, the new versions have issues of their own. I will try to collate the feedback you have been providing and contact the authors about what we are seeing. I am not myself able to suggest a "hot fix" that could be applied easily to at least detect and kill the problematic jobs. The only thing I could do easily would be to remove Sherpa jobs from the request system - but then we would lose the possibility to compare to Sherpa at all, which would also not be ideal. First of all, I'd like to understand the magnitude of the problem; is this a relatively rare occurrence that happens only for the occasional Sherpa job, or does it happen most of the time? I saw a log posted on this thread that had only one successful Sherpa job for 13 or 14 failed ones. If that is typical, we might as well give up on running Sherpa, at least for the time being. Any feedback on how prevalent this issue is - whether we are still getting a good number of Sherpa jobs finishing or hardly any - would be appreciated. With best regards, and apologies to those who are (understandably) frustrated about their CPU time going to jobs that are not producing science! Peter |
Send message Joined: 18 Dec 15 Posts: 1835 Credit: 120,833,522 RAC: 79,459 |
... and apologies to those who are (understandably) frustrated about their CPU time going to jobs that are not producing science!Peterthe next one: https://lhcathome.cern.ch/lhcathome/result.php?resultid=220826628 it failed after 16GB(!) disk usage and 18 hours processing time :-( and yes, this creates frustration :-( Please stop this nonsense! |
Send message Joined: 18 Dec 15 Posts: 1835 Credit: 120,833,522 RAC: 79,459 |
now this seems to happen every day: https://lhcathome.cern.ch/lhcathome/result.php?resultid=220893933 once more, almost 14 hours CPU time for nothing :-( More than annoying! Could someone please stop these faulty tasks !!! |
Send message Joined: 29 Sep 04 Posts: 281 Credit: 11,866,264 RAC: 0 |
2 today. Thought I'd caught one in time before it reached limit (16GB when I noticed, 16hrs wasted) but it errored out when I reset the VM. It also failed to remove itself from VBox and had to be deleted manually. Other one was north of 100MB after 3hrs and increasing but the VM reset OK. Both ee Sherpas While I was at it, I binned a pp Sherpa that wanted 2 days to complete as I saw little point in allowing it to run to time limit but not complete in that time, losing anything useful in the not-uploaded, partially complete logs. I don't know where Crystal Pellet dug up these figures from but they make sad reading on the success rate, or lack thereof, of ee Sherpas. |
Send message Joined: 18 Dec 15 Posts: 1835 Credit: 120,833,522 RAC: 79,459 |
the next one: https://lhcathome.cern.ch/lhcathome/result.php?resultid=221079025 why does no one stop submitting these faulty tasks ??? |
Send message Joined: 14 Jan 10 Posts: 1433 Credit: 9,598,155 RAC: 3,461 |
196 (0x000000C4) EXIT_DISK_LIMIT_EXCEEDED - how come?The reason is massive growing of the running log file and so the size of the VM in the slot directory. Writing ~2MB every second. At the moment the vm_image.vdi is 2.1GB I've a candidate here. 19 attemps - 12 failures and 7 lost. ===> [runRivet] Thu Apr 18 17:07:51 CEST 2019 [boinc pp jets 7000 170,-,2960 - sherpa 2.1.0 default 48000 44] Channel_Elements::GenerateYUniform(1.04921,{-8.98847e+307,0,-8.98847e+307,0,0,},{-10,10,-nan,}): Y out of bounds ! ymin, ymax vs. y : 0.0240183 -0.0240183 vs. -0.00922275 ISR_Handler::MakeISR(..): s' out of bounds. s'_{min}, s'_{max 1,2} vs. s': 5.25696e+07, 4.9e+07, 4.9e+07 vs. 5.14112e+07 ISR_Handler::MakeISR(..): s' out of bounds. s'_{min}, s'_{max 1,2} vs. s': 5.25696e+07, 4.9e+07, 4.9e+07 vs. 4.9344e+07 ISR_Handler::MakeISR(..): s' out of bounds. s'_{min}, s'_{max 1,2} vs. s': 5.25696e+07, 4.9e+07, 4.9e+07 vs. 5.25106e+07 Channel_Elements::GenerateYUniform(1.06416,{-8.98847e+307,0,-8.98847e+307,0,0,},{-10,10,-nan,}): Y out of bounds ! ymin, ymax vs. y : 0.0310934 -0.0310934 vs. 0.0126543 ISR_Handler::MakeISR(..): s' out of bounds. s'_{min}, s'_{max 1,2} vs. s': 5.25696e+07, 4.9e+07, 4.9e+07 vs. 5.21439e+07 ISR_Handler::MakeISR(..): s' out of bounds. s'_{min}, s'_{max 1,2} vs. s': 5.25696e+07, 4.9e+07, 4.9e+07 vs. 5.15232e+07 ISR_Handler::MakeISR(..): s' out of bounds. s'_{min}, s'_{max 1,2} vs. s': 5.25696e+07, 4.9e+07, 4.9e+07 vs. 5.04853e+07 ISR_Handler::MakeISR(..): s' out of bounds. s'_{min}, s'_{max 1,2} vs. s': 5.25696e+07, 4.9e+07, 4.9e+07 vs. 5.04799e+07 ISR_Handler::MakeISR(..): s' out of bounds. s'_{min}, s'_{max 1,2} vs. s': 5.25696e+07, 4.9e+07, 4.9e+07 vs. 4.9716e+07 ISR_Handler::MakeISR(..): s' out of bounds. s'_{min}, s'_{max 1,2} vs. s': 5.25696e+07, 4.9e+07, 4.9e+07 vs. 4.95098e+07 ISR_Handler::MakeISR(..): s' out of bounds. s'_{min}, s'_{max 1,2} vs. s': 5.25696e+07, 4.9e+07, 4.9e+07 vs. 5.23729e+07 ISR_Handler::MakeISR(..): s' out of bounds. s'_{min}, s'_{max 1,2} vs. s': 5.25696e+07, 4.9e+07, 4.9e+07 vs. 5.21549e+07 Channel_Elements::GenerateYUniform(1.04776,{-8.98847e+307,0,-8.98847e+307,0,0,},{-10,10,-nan,}): Y out of bounds ! ymin, ymax vs. y : 0.0233275 -0.0233275 vs. -0.0127388 ISR_Handler::MakeISR(..): s' out of bounds. s'_{min}, s'_{max 1,2} vs. s': 5.25696e+07, 4.9e+07, 4.9e+07 vs. 5.13403e+07 Channel_Elements::GenerateYUniform(1.07081,{-8.98847e+307,0,-8.98847e+307,0,0,},{-10,10,-0.0127388,}): Y out of bounds ! ymin, ymax vs. y : 0.034207 -0.034207 vs. 0.020383 ISR_Handler::MakeISR(..): s' out of bounds. s'_{min}, s'_{max 1,2} vs. s': 5.25696e+07, 4.9e+07, 4.9e+07 vs. 5.24696e+07 Channel_Elements::GenerateYUniform(1.01653,{-8.98847e+307,0,-8.98847e+307,0,0,},{-10,10,0.020383,}): Y out of bounds ! ymin, ymax vs. y : 0.00819796 -0.00819796 vs. 0.00663273 ISR_Handler::MakeISR(..): s' out of bounds. s'_{min}, s'_{max 1,2} vs. s': 5.25696e+07, 4.9e+07, 4.9e+07 vs. 4.981e+07 ISR_Handler::MakeISR(..): s' out of bounds. s'_{min}, s'_{max 1,2} vs. s': 5.25696e+07, 4.9e+07, 4.9e+07 vs. 4.94124e+07 ISR_Handler::MakeISR(..): s' out of bounds. s'_{min}, s'_{max 1,2} vs. s': 5.25696e+07, 4.9e+07, 4.9e+07 vs. 5.13743e+07 Channel_Elements::GenerateYUniform(1.0575,{-8.98847e+307,0,-8.98847e+307,0,0,},{-10,10,-nan,}): Y out of bounds ! ymin, ymax vs. y : 0.027955 -0.027955 vs. 0.00218613 ISR_Handler::MakeISR(..): s' out of bounds. s'_{min}, s'_{max 1,2} vs. s': 5.25696e+07, 4.9e+07, 4.9e+07 vs. 5.18176e+07 ISR_Handler::MakeISR(..): s' out of bounds. s'_{min}, s'_{max 1,2} vs. s': 5.25696e+07, 4.9e+07, 4.9e+07 vs. 5.04083e+07 ISR_Handler::MakeISR(..): s' out of bounds. s'_{min}, s'_{max 1,2} vs. s': 5.25696e+07, 4.9e+07, 4.9e+07 vs. 5.12492e+07 ISR_Handler::MakeISR(..): s' out of bounds. s'_{min}, s'_{max 1,2} vs. s': 5.25696e+07, 4.9e+07, 4.9e+07 vs. 4.96146e+07 ISR_Handler::MakeISR(..): s' out of bounds. s'_{min}, s'_{max 1,2} vs. s': 5.25696e+07, 4.9e+07, 4.9e+07 vs. 5.16881e+07 ISR_Handler::MakeISR(..): s' out of bounds. s'_{min}, s'_{max 1,2} vs. s': 5.25696e+07, 4.9e+07, 4.9e+07 vs. 4.98863e+07 ISR_Handler::MakeISR(..): s' out of bounds. s'_{min}, s'_{max 1,2} vs. s': 5.25696e+07, 4.9e+07, 4.9e+07 vs. 5.0189e+07 ISR_Handler::MakeISR(..): s' out of bounds. s'_{min}, s'_{max 1,2} vs. s': 5.25696e+07, 4.9e+07, 4.9e+07 vs. 5.07392e+07 ISR_Handler::MakeISR(..): s' out of bounds. s'_{min}, s'_{max 1,2} vs. s': 5.25696e+07, 4.9e+07, 4.9e+07 vs. 4.96088e+07 ISR_Handler::MakeISR(..): s' out of bounds. s'_{min}, s'_{max 1,2} vs. s': 5.25696e+07, 4.9e+07, 4.9e+07 vs. 4.99324e+07 ISR_Handler::MakeISR(..): s' out of bounds. s'_{min}, s'_{max 1,2} vs. s': 5.25696e+07, 4.9e+07, 4.9e+07 vs. 4.98934e+07 |
Send message Joined: 18 Dec 15 Posts: 1835 Credit: 120,833,522 RAC: 79,459 |
From what I can see, normally the VM in the slot directory is 492.544 KB.196 (0x000000C4) EXIT_DISK_LIMIT_EXCEEDED - how come?The reason is massive growing of the running log file and so the size of the VM in the slot directory. Writing ~2MB every second. At the moment the vm_image.vdi is 2.1GB ... Does all this mean that if I happen to detect an image bigger than that, or - by watsching it for a minute and noticing that the size is permanently growing - then it would make sense to abort the task? Probably YES, right? |
Send message Joined: 14 Jan 10 Posts: 1433 Credit: 9,598,155 RAC: 3,461 |
From what I can see, normally the VM in the slot directory is 492.544 KB. 492544kB is the size of the vdi-file in the project directory and when used in a slot, it will increase in size after initialization and processing sub-jobs. The sizes of the 2 tasks I've running atm are 988,160kB and 1,059,840kB. Better info you get from the Console ALT-F2 and from localhost portnr running.log. |
Send message Joined: 18 Dec 15 Posts: 1835 Credit: 120,833,522 RAC: 79,459 |
From what I can see, normally the VM in the slot directory is 492.544 KB. the interesting thing is that on my "main" PC on which I am running 6 Theory tasks simultaneously, the vdi-files in the slot directories are showing 492.544kB all the time, from beginning to the end. On all other PCs the vdi-files in the slot directory are growing continously, as you are explaing above. |
Send message Joined: 14 Jan 10 Posts: 1433 Credit: 9,598,155 RAC: 3,461 |
the interesting thing is that on my "main" PC on which I am running 6 Theory tasks simultaneously, the vdi-files in the slot directories are showing 492.544kB all the time, from beginning to the end. What ID has your 'main' PC? What's happening to the shown size when you update the directory you're watching with Function Key F5? What is the size of the vdi-file when you look at the file properties? |
Send message Joined: 18 Dec 15 Posts: 1835 Credit: 120,833,522 RAC: 79,459 |
ID: 10555784 When updating the slot directory with Key F6, nothing happens - the size shown stays same (492.544kb). When going to file properties, the correct file size seems to be shown. And after that, when going back to the slot directory, also there the correct file size shows up. Strange behaviour of the Windows Explorer :-( |
Send message Joined: 18 Dec 15 Posts: 1835 Credit: 120,833,522 RAC: 79,459 |
the next one: https://lhcathome.cern.ch/lhcathome/result.php?resultid=221535924 failed after nearly 15 hours, disk usage 18.323 MB. Why does no one stop the submission of these Sherpa tasks ??? I am sick of wasting my CPU for nothing. |
Send message Joined: 14 Jan 10 Posts: 1433 Credit: 9,598,155 RAC: 3,461 |
Another candidate for exceeding the disk limit. This time on LHC-dev, but coming from the same pool. pp jets 8000 250,-,4160 - sherpa 1.4.3 default 25 attemps 0 success 14 failures and 11 lost. The vm_image.vdi growing. Meanwhile 3.379.200 kB Aborted by BOINC: lhcathome-dev 23 Apr 18:05:52 Aborting task Theory_2279-790029-48_1: exceeded disk limit: 3300.38MB > 1907.35MB As you see the limit for tasks from the dev-system is much lower and so aborted earlier. https://lhcathomedev.cern.ch/lhcathome-dev/result.php?resultid=2770803 |
Send message Joined: 18 Dec 15 Posts: 1835 Credit: 120,833,522 RAC: 79,459 |
The next one: https://lhcathome.cern.ch/lhcathome/result.php?resultid=221643489 failed after 13 1/2 hours :-( When will this nonsense finally be stopped ??? |
Send message Joined: 24 Oct 04 Posts: 1183 Credit: 56,469,194 RAC: 63,385 |
|
©2025 CERN