Message boards : Theory Application : 196 (0x000000C4) EXIT_DISK_LIMIT_EXCEEDED - how come?
Message board moderation

To post messages, you must log in.

Previous · 1 · 2 · 3 · 4 · 5 · 6 · Next

AuthorMessage
computezrmle
Volunteer moderator
Volunteer developer
Volunteer tester
Help desk expert
Avatar

Send message
Joined: 15 Jun 08
Posts: 2411
Credit: 226,329,342
RAC: 131,970
Message 38509 - Posted: 1 Apr 2019, 19:38:14 UTC

Allowed disk space is never unlimited. Beside the different settings you can influence via your BOINC client each
workunit has a maximum setting given by the server.
You can check it if you examine your client_state.xml.

Locate the section <workunit> that corresponds to the task (ID) you want to look at.
The disk limit in Bytes is set by <rsc_disk_bound>.


Beside that each VM has a limit regarding the partition size of it's virtual hard disk.
This is where the logfile is written to.
I'm not sure but isn't that size around 20 GB or so?



This would explain why tasks are cancelled although there is plenty of free space on the host's disk.
It does not explain why especially sherpa sometimes writes logfiles of that huge size.

The latter could only be fixed by the project team but if I correctly understand older posts mentioned here a couple of times once a specific app version is published it will never be changed to guarantee reproducible results. Even if those results are known to be errors.




BTW
Erich's task did not finish successfully as the log says:
2019-03-31 22:24:25 (10000): Guest Log: [INFO] Job finished in slot1 with 1.
ID: 38509 · Report as offensive     Reply Quote
maeax

Send message
Joined: 2 May 07
Posts: 2099
Credit: 159,815,788
RAC: 143,603
Message 38510 - Posted: 1 Apr 2019, 20:41:10 UTC - in response to Message 38509.  

I'm not sure but isn't that size around 20 GB or so?

Yes this is the limit, when a Sherpa-task write a endless loop of log sentences.
We need help from the Science-people who define those Sherpa-tasks, like Peter Skands.

Native-Theory Linux is not the problem for this crash of tasks. It works well.
VM-Tasks (Windows or Linux) crash with the 20 Gbyte-limit or the 18 hour limit.

It is useful to eliminate Sherpa-tasks for sending to us Volunteers for the moment.
Have stopped Theory for me, unless there is a solution with this Sherpa-tasks.
ID: 38510 · Report as offensive     Reply Quote
Crystal Pellet
Volunteer moderator
Volunteer tester

Send message
Joined: 14 Jan 10
Posts: 1280
Credit: 8,489,945
RAC: 1,944
Message 38511 - Posted: 1 Apr 2019, 20:49:19 UTC - in response to Message 38509.  

The disk limit in Bytes is set by <rsc_disk_bound>.


Beside that each VM has a limit regarding the partition size of it's virtual hard disk.
This is where the logfile is written to.
I'm not sure but isn't that size around 20 GB or so?
The rsc_disk_bound for Theory VBox-tasks is 7629.3945 MB (8000000000 bytes). The task should stop earlier,
but I've seen that the slot directory's total is not summed every minute and/or the size is not updated frequently by the OS.
My impression is that a sherpa job is writing KB's per second to the logfiles in those circumstances.

It's striking that Erich seems to be much more often a victim of the DISK_LIMIT_EXCEEDED error condition.
ID: 38511 · Report as offensive     Reply Quote
Crystal Pellet
Volunteer moderator
Volunteer tester

Send message
Joined: 14 Jan 10
Posts: 1280
Credit: 8,489,945
RAC: 1,944
Message 38512 - Posted: 1 Apr 2019, 20:56:24 UTC - in response to Message 38510.  

Native-Theory Linux is not the problem for this crash of tasks. It works well.
During testing on the dev-project I had one error with DISK_LIMIT_EXCEEDED with a native Theory task.
So it happens with the native's too. Difference is that the limit is set lower (~1907 MB), so a task will crash earlier.
ID: 38512 · Report as offensive     Reply Quote
maeax

Send message
Joined: 2 May 07
Posts: 2099
Credit: 159,815,788
RAC: 143,603
Message 38513 - Posted: 2 Apr 2019, 5:55:01 UTC - in response to Message 38512.  

Sorry Crystal,
only Theory-Sherpa have a problem!
ID: 38513 · Report as offensive     Reply Quote
Peter Skands

Send message
Joined: 31 Jan 11
Posts: 12
Credit: 3,557,813
RAC: 0
Message 38516 - Posted: 3 Apr 2019, 1:49:02 UTC - in response to Message 38513.  

Dear all,

From the project side, this is a difficult problem. We'd like to run Sherpa jobs - it is being used in LHC studies, so it is useful for the LHC@home project to run it. But as many of you are only too well aware of, it is also the source of most of the problematic jobs in the project. We are not ourselves authors of the Sherpa Monte Carlo, so we can only pass along the issues to the authors and hope that they will be addressed in future releases. In the past, Sherpa jobs on LHC@home had significant issues with infinite loops causing jobs to run to the time limit without finishing. Since versions are never patched backwards (bug fixes only go into new releases; old ones are not patched, for reproducibility) we had hoped to address this by deprecating some of the older Sherpa versions and only running the newest releases. I think that changeover happened relatively recently - though Anton Karneyeu can probably say more precisely when and what was done - he has been working on updating the project over the past few months. However, as is clear from this thread and related ones, the new versions have issues of their own. I will try to collate the feedback you have been providing and contact the authors about what we are seeing. I am not myself able to suggest a "hot fix" that could be applied easily to at least detect and kill the problematic jobs. The only thing I could do easily would be to remove Sherpa jobs from the request system - but then we would lose the possibility to compare to Sherpa at all, which would also not be ideal. First of all, I'd like to understand the magnitude of the problem; is this a relatively rare occurrence that happens only for the occasional Sherpa job, or does it happen most of the time? I saw a log posted on this thread that had only one successful Sherpa job for 13 or 14 failed ones. If that is typical, we might as well give up on running Sherpa, at least for the time being. Any feedback on how prevalent this issue is - whether we are still getting a good number of Sherpa jobs finishing or hardly any - would be appreciated.

With best regards, and apologies to those who are (understandably) frustrated about their CPU time going to jobs that are not producing science!

Peter
ID: 38516 · Report as offensive     Reply Quote
Erich56

Send message
Joined: 18 Dec 15
Posts: 1688
Credit: 103,752,516
RAC: 122,129
Message 38557 - Posted: 10 Apr 2019, 4:55:49 UTC - in response to Message 38516.  

... and apologies to those who are (understandably) frustrated about their CPU time going to jobs that are not producing science!Peter
the next one:
https://lhcathome.cern.ch/lhcathome/result.php?resultid=220826628
it failed after 16GB(!) disk usage and 18 hours processing time :-(
and yes, this creates frustration :-(
Please stop this nonsense!
ID: 38557 · Report as offensive     Reply Quote
Erich56

Send message
Joined: 18 Dec 15
Posts: 1688
Credit: 103,752,516
RAC: 122,129
Message 38558 - Posted: 11 Apr 2019, 11:20:52 UTC

now this seems to happen every day:

https://lhcathome.cern.ch/lhcathome/result.php?resultid=220893933

once more, almost 14 hours CPU time for nothing :-(
More than annoying!

Could someone please stop these faulty tasks !!!
ID: 38558 · Report as offensive     Reply Quote
Profile Ray Murray
Volunteer moderator
Avatar

Send message
Joined: 29 Sep 04
Posts: 281
Credit: 11,859,285
RAC: 0
Message 38559 - Posted: 11 Apr 2019, 22:01:57 UTC

2 today. Thought I'd caught one in time before it reached limit (16GB when I noticed, 16hrs wasted) but it errored out when I reset the VM. It also failed to remove itself from VBox and had to be deleted manually. Other one was north of 100MB after 3hrs and increasing but the VM reset OK.
Both ee Sherpas
While I was at it, I binned a pp Sherpa that wanted 2 days to complete as I saw little point in allowing it to run to time limit but not complete in that time, losing anything useful in the not-uploaded, partially complete logs.

I don't know where Crystal Pellet dug up these figures from but they make sad reading on the success rate, or lack thereof, of ee Sherpas.
ID: 38559 · Report as offensive     Reply Quote
Erich56

Send message
Joined: 18 Dec 15
Posts: 1688
Credit: 103,752,516
RAC: 122,129
Message 38565 - Posted: 15 Apr 2019, 4:56:19 UTC

the next one:

https://lhcathome.cern.ch/lhcathome/result.php?resultid=221079025

why does no one stop submitting these faulty tasks ???
ID: 38565 · Report as offensive     Reply Quote
Crystal Pellet
Volunteer moderator
Volunteer tester

Send message
Joined: 14 Jan 10
Posts: 1280
Credit: 8,489,945
RAC: 1,944
Message 38578 - Posted: 18 Apr 2019, 17:44:07 UTC

196 (0x000000C4) EXIT_DISK_LIMIT_EXCEEDED - how come?
The reason is massive growing of the running log file and so the size of the VM in the slot directory. Writing ~2MB every second. At the moment the vm_image.vdi is 2.1GB
I've a candidate here. 19 attemps - 12 failures and 7 lost.
 ===> [runRivet] Thu Apr 18 17:07:51 CEST 2019 [boinc pp jets 7000 170,-,2960 - sherpa 2.1.0 default 48000 44]

Channel_Elements::GenerateYUniform(1.04921,{-8.98847e+307,0,-8.98847e+307,0,0,},{-10,10,-nan,}): Y out of bounds !
ymin, ymax vs. y : 0.0240183 -0.0240183 vs. -0.00922275
ISR_Handler::MakeISR(..): s' out of bounds.
s'_{min}, s'_{max 1,2} vs. s': 5.25696e+07, 4.9e+07, 4.9e+07 vs. 5.14112e+07
ISR_Handler::MakeISR(..): s' out of bounds.
s'_{min}, s'_{max 1,2} vs. s': 5.25696e+07, 4.9e+07, 4.9e+07 vs. 4.9344e+07
ISR_Handler::MakeISR(..): s' out of bounds.
s'_{min}, s'_{max 1,2} vs. s': 5.25696e+07, 4.9e+07, 4.9e+07 vs. 5.25106e+07
Channel_Elements::GenerateYUniform(1.06416,{-8.98847e+307,0,-8.98847e+307,0,0,},{-10,10,-nan,}): Y out of bounds !
ymin, ymax vs. y : 0.0310934 -0.0310934 vs. 0.0126543
ISR_Handler::MakeISR(..): s' out of bounds.
s'_{min}, s'_{max 1,2} vs. s': 5.25696e+07, 4.9e+07, 4.9e+07 vs. 5.21439e+07
ISR_Handler::MakeISR(..): s' out of bounds.
s'_{min}, s'_{max 1,2} vs. s': 5.25696e+07, 4.9e+07, 4.9e+07 vs. 5.15232e+07
ISR_Handler::MakeISR(..): s' out of bounds.
s'_{min}, s'_{max 1,2} vs. s': 5.25696e+07, 4.9e+07, 4.9e+07 vs. 5.04853e+07
ISR_Handler::MakeISR(..): s' out of bounds.
s'_{min}, s'_{max 1,2} vs. s': 5.25696e+07, 4.9e+07, 4.9e+07 vs. 5.04799e+07
ISR_Handler::MakeISR(..): s' out of bounds.
s'_{min}, s'_{max 1,2} vs. s': 5.25696e+07, 4.9e+07, 4.9e+07 vs. 4.9716e+07
ISR_Handler::MakeISR(..): s' out of bounds.
s'_{min}, s'_{max 1,2} vs. s': 5.25696e+07, 4.9e+07, 4.9e+07 vs. 4.95098e+07
ISR_Handler::MakeISR(..): s' out of bounds.
s'_{min}, s'_{max 1,2} vs. s': 5.25696e+07, 4.9e+07, 4.9e+07 vs. 5.23729e+07
ISR_Handler::MakeISR(..): s' out of bounds.
s'_{min}, s'_{max 1,2} vs. s': 5.25696e+07, 4.9e+07, 4.9e+07 vs. 5.21549e+07
Channel_Elements::GenerateYUniform(1.04776,{-8.98847e+307,0,-8.98847e+307,0,0,},{-10,10,-nan,}): Y out of bounds !
ymin, ymax vs. y : 0.0233275 -0.0233275 vs. -0.0127388
ISR_Handler::MakeISR(..): s' out of bounds.
s'_{min}, s'_{max 1,2} vs. s': 5.25696e+07, 4.9e+07, 4.9e+07 vs. 5.13403e+07
Channel_Elements::GenerateYUniform(1.07081,{-8.98847e+307,0,-8.98847e+307,0,0,},{-10,10,-0.0127388,}): Y out of bounds !
ymin, ymax vs. y : 0.034207 -0.034207 vs. 0.020383
ISR_Handler::MakeISR(..): s' out of bounds.
s'_{min}, s'_{max 1,2} vs. s': 5.25696e+07, 4.9e+07, 4.9e+07 vs. 5.24696e+07
Channel_Elements::GenerateYUniform(1.01653,{-8.98847e+307,0,-8.98847e+307,0,0,},{-10,10,0.020383,}): Y out of bounds !
ymin, ymax vs. y : 0.00819796 -0.00819796 vs. 0.00663273
ISR_Handler::MakeISR(..): s' out of bounds.
s'_{min}, s'_{max 1,2} vs. s': 5.25696e+07, 4.9e+07, 4.9e+07 vs. 4.981e+07
ISR_Handler::MakeISR(..): s' out of bounds.
s'_{min}, s'_{max 1,2} vs. s': 5.25696e+07, 4.9e+07, 4.9e+07 vs. 4.94124e+07
ISR_Handler::MakeISR(..): s' out of bounds.
s'_{min}, s'_{max 1,2} vs. s': 5.25696e+07, 4.9e+07, 4.9e+07 vs. 5.13743e+07
Channel_Elements::GenerateYUniform(1.0575,{-8.98847e+307,0,-8.98847e+307,0,0,},{-10,10,-nan,}): Y out of bounds !
ymin, ymax vs. y : 0.027955 -0.027955 vs. 0.00218613
ISR_Handler::MakeISR(..): s' out of bounds.
s'_{min}, s'_{max 1,2} vs. s': 5.25696e+07, 4.9e+07, 4.9e+07 vs. 5.18176e+07
ISR_Handler::MakeISR(..): s' out of bounds.
s'_{min}, s'_{max 1,2} vs. s': 5.25696e+07, 4.9e+07, 4.9e+07 vs. 5.04083e+07
ISR_Handler::MakeISR(..): s' out of bounds.
s'_{min}, s'_{max 1,2} vs. s': 5.25696e+07, 4.9e+07, 4.9e+07 vs. 5.12492e+07
ISR_Handler::MakeISR(..): s' out of bounds.
s'_{min}, s'_{max 1,2} vs. s': 5.25696e+07, 4.9e+07, 4.9e+07 vs. 4.96146e+07
ISR_Handler::MakeISR(..): s' out of bounds.
s'_{min}, s'_{max 1,2} vs. s': 5.25696e+07, 4.9e+07, 4.9e+07 vs. 5.16881e+07
ISR_Handler::MakeISR(..): s' out of bounds.
s'_{min}, s'_{max 1,2} vs. s': 5.25696e+07, 4.9e+07, 4.9e+07 vs. 4.98863e+07
ISR_Handler::MakeISR(..): s' out of bounds.
s'_{min}, s'_{max 1,2} vs. s': 5.25696e+07, 4.9e+07, 4.9e+07 vs. 5.0189e+07
ISR_Handler::MakeISR(..): s' out of bounds.
s'_{min}, s'_{max 1,2} vs. s': 5.25696e+07, 4.9e+07, 4.9e+07 vs. 5.07392e+07
ISR_Handler::MakeISR(..): s' out of bounds.
s'_{min}, s'_{max 1,2} vs. s': 5.25696e+07, 4.9e+07, 4.9e+07 vs. 4.96088e+07
ISR_Handler::MakeISR(..): s' out of bounds.
s'_{min}, s'_{max 1,2} vs. s': 5.25696e+07, 4.9e+07, 4.9e+07 vs. 4.99324e+07
ISR_Handler::MakeISR(..): s' out of bounds.
s'_{min}, s'_{max 1,2} vs. s': 5.25696e+07, 4.9e+07, 4.9e+07 vs. 4.98934e+07
ID: 38578 · Report as offensive     Reply Quote
Erich56

Send message
Joined: 18 Dec 15
Posts: 1688
Credit: 103,752,516
RAC: 122,129
Message 38598 - Posted: 21 Apr 2019, 16:34:48 UTC - in response to Message 38578.  

196 (0x000000C4) EXIT_DISK_LIMIT_EXCEEDED - how come?
The reason is massive growing of the running log file and so the size of the VM in the slot directory. Writing ~2MB every second. At the moment the vm_image.vdi is 2.1GB ...
From what I can see, normally the VM in the slot directory is 492.544 KB.

Does all this mean that if I happen to detect an image bigger than that, or - by watsching it for a minute and noticing that the size is permanently growing - then it would make sense to abort the task? Probably YES, right?
ID: 38598 · Report as offensive     Reply Quote
Crystal Pellet
Volunteer moderator
Volunteer tester

Send message
Joined: 14 Jan 10
Posts: 1280
Credit: 8,489,945
RAC: 1,944
Message 38600 - Posted: 21 Apr 2019, 17:58:49 UTC - in response to Message 38598.  

From what I can see, normally the VM in the slot directory is 492.544 KB.

Does all this mean that if I happen to detect an image bigger than that, or - by watsching it for a minute and noticing that the size is permanently growing - then it would make sense to abort the task? Probably YES, right?

492544kB is the size of the vdi-file in the project directory and when used in a slot, it will increase in size after initialization and processing sub-jobs.
The sizes of the 2 tasks I've running atm are 988,160kB and 1,059,840kB.
Better info you get from the Console ALT-F2 and from localhost portnr running.log.
ID: 38600 · Report as offensive     Reply Quote
Erich56

Send message
Joined: 18 Dec 15
Posts: 1688
Credit: 103,752,516
RAC: 122,129
Message 38602 - Posted: 22 Apr 2019, 6:39:21 UTC - in response to Message 38600.  

From what I can see, normally the VM in the slot directory is 492.544 KB.

Does all this mean that if I happen to detect an image bigger than that, or - by watsching it for a minute and noticing that the size is permanently growing - then it would make sense to abort the task? Probably YES, right?

492544kB is the size of the vdi-file in the project directory and when used in a slot, it will increase in size after initialization and processing sub-jobs.
The sizes of the 2 tasks I've running atm are 988,160kB and 1,059,840kB.
Better info you get from the Console ALT-F2 and from localhost portnr running.log.

the interesting thing is that on my "main" PC on which I am running 6 Theory tasks simultaneously, the vdi-files in the slot directories are showing 492.544kB all the time, from beginning to the end.
On all other PCs the vdi-files in the slot directory are growing continously, as you are explaing above.
ID: 38602 · Report as offensive     Reply Quote
Crystal Pellet
Volunteer moderator
Volunteer tester

Send message
Joined: 14 Jan 10
Posts: 1280
Credit: 8,489,945
RAC: 1,944
Message 38603 - Posted: 22 Apr 2019, 7:46:19 UTC - in response to Message 38602.  

the interesting thing is that on my "main" PC on which I am running 6 Theory tasks simultaneously, the vdi-files in the slot directories are showing 492.544kB all the time, from beginning to the end.
On all other PCs the vdi-files in the slot directory are growing continously, as you are explaing above.

What ID has your 'main' PC?
What's happening to the shown size when you update the directory you're watching with Function Key F5?
What is the size of the vdi-file when you look at the file properties?
ID: 38603 · Report as offensive     Reply Quote
Erich56

Send message
Joined: 18 Dec 15
Posts: 1688
Credit: 103,752,516
RAC: 122,129
Message 38609 - Posted: 23 Apr 2019, 4:53:00 UTC - in response to Message 38603.  

ID: 10555784

When updating the slot directory with Key F6, nothing happens - the size shown stays same (492.544kb).

When going to file properties, the correct file size seems to be shown. And after that, when going back to the slot directory, also there the correct file size shows up.

Strange behaviour of the Windows Explorer :-(
ID: 38609 · Report as offensive     Reply Quote
Erich56

Send message
Joined: 18 Dec 15
Posts: 1688
Credit: 103,752,516
RAC: 122,129
Message 38610 - Posted: 23 Apr 2019, 4:55:30 UTC

the next one:

https://lhcathome.cern.ch/lhcathome/result.php?resultid=221535924

failed after nearly 15 hours, disk usage 18.323 MB.

Why does no one stop the submission of these Sherpa tasks ??? I am sick of wasting my CPU for nothing.
ID: 38610 · Report as offensive     Reply Quote
Crystal Pellet
Volunteer moderator
Volunteer tester

Send message
Joined: 14 Jan 10
Posts: 1280
Credit: 8,489,945
RAC: 1,944
Message 38616 - Posted: 23 Apr 2019, 16:10:27 UTC

Another candidate for exceeding the disk limit. This time on LHC-dev, but coming from the same pool.

pp jets 8000 250,-,4160 - sherpa 1.4.3 default

25 attemps 0 success 14 failures and 11 lost.

The vm_image.vdi growing. Meanwhile 3.379.200 kB

Aborted by BOINC:
lhcathome-dev 23 Apr 18:05:52 Aborting task Theory_2279-790029-48_1: exceeded disk limit: 3300.38MB > 1907.35MB

As you see the limit for tasks from the dev-system is much lower and so aborted earlier.

https://lhcathomedev.cern.ch/lhcathome-dev/result.php?resultid=2770803
ID: 38616 · Report as offensive     Reply Quote
Erich56

Send message
Joined: 18 Dec 15
Posts: 1688
Credit: 103,752,516
RAC: 122,129
Message 38619 - Posted: 24 Apr 2019, 3:23:51 UTC

The next one:

https://lhcathome.cern.ch/lhcathome/result.php?resultid=221643489

failed after 13 1/2 hours :-(

When will this nonsense finally be stopped ???
ID: 38619 · Report as offensive     Reply Quote
Profile Magic Quantum Mechanic
Avatar

Send message
Joined: 24 Oct 04
Posts: 1127
Credit: 49,750,167
RAC: 9,606
Message 38630 - Posted: 25 Apr 2019, 6:38:30 UTC

ID: 38630 · Report as offensive     Reply Quote
Previous · 1 · 2 · 3 · 4 · 5 · 6 · Next

Message boards : Theory Application : 196 (0x000000C4) EXIT_DISK_LIMIT_EXCEEDED - how come?


©2024 CERN