Message boards : Number crunching : fubar host of the day
Message board moderation

To post messages, you must log in.

Previous · 1 · 2 · 3 · 4 · 5 · 6 · 7 · Next

AuthorMessage
vseven

Send message
Joined: 22 Jan 18
Posts: 32
Credit: 2,756,359
RAC: 0
Message 36403 - Posted: 14 Aug 2018, 0:27:56 UTC

Side note - just had 2 of 3 8 CPU tasks abort due to "Aborted: task disk limit exceeded". Again...all limits are unchecked and I have plenty of free space. Waste of 5+ hours on 16 cores...
ID: 36403 · Report as offensive     Reply Quote
bronco

Send message
Joined: 13 Apr 18
Posts: 443
Credit: 8,196,738
RAC: 10,577
Message 36404 - Posted: 14 Aug 2018, 2:15:04 UTC - in response to Message 36403.  

Side note - just had 2 of 3 8 CPU tasks abort due to "Aborted: task disk limit exceeded". Again...all limits are unchecked and I have plenty of free space. Waste of 5+ hours on 16 cores...

Again... that error (Aborted: task disk limit exceeded) has nothing to do with the disk limits you set in your prefs. See Harri Liljeroos's explanation upthread.
ID: 36404 · Report as offensive     Reply Quote
vseven

Send message
Joined: 22 Jan 18
Posts: 32
Credit: 2,756,359
RAC: 0
Message 36408 - Posted: 14 Aug 2018, 10:56:28 UTC - in response to Message 36404.  

Side note - just had 2 of 3 8 CPU tasks abort due to "Aborted: task disk limit exceeded". Again...all limits are unchecked and I have plenty of free space. Waste of 5+ hours on 16 cores...

Again... that error (Aborted: task disk limit exceeded) has nothing to do with the disk limits you set in your prefs. See Harri Liljeroos's explanation upthread.


I understand that. But can it be fixed so I'm not wasting my computer on something that will fail?
ID: 36408 · Report as offensive     Reply Quote
Profile Yeti
Volunteer moderator
Avatar

Send message
Joined: 2 Sep 04
Posts: 404
Credit: 86,848,055
RAC: 94,975
Message 36409 - Posted: 14 Aug 2018, 11:58:11 UTC - in response to Message 36371.  
Last modified: 14 Aug 2018, 11:58:29 UTC

The "Exit status 196 (0x000000C4) EXIT_DISK_LIMIT_EXCEEDED " comes from a situation where the disk usage of a slot directory where the task is running exceeds the limits set in init_data.xml <rsc_disk_bound>xxx</rsc_disk_bound>. Either the value was set too low for the task by the server when task was created or for some reason there were some extra files in the slot directory left behind from previous tasks or extra files were created during the task calculation which bloated the disk consumption over the <rsc_disk_bound> value. This is not related to any user setting of allowed disk usage for Boinc.

Okay, I have Re-Read this statement and I wonder, why this happens only to vseven. This should affect ALL Atlas-Users, but nothing has happened so far.

So, the final question is, what is different on vseven machine ? !


Supporting BOINC, a great concept !
ID: 36409 · Report as offensive     Reply Quote
bronco

Send message
Joined: 13 Apr 18
Posts: 443
Credit: 8,196,738
RAC: 10,577
Message 36410 - Posted: 14 Aug 2018, 13:22:13 UTC - in response to Message 36409.  

@ vseven and Yeti
This should affect ALL Atlas-Users, but nothing has happened so far.
Yes, all users but only if the server is sending tasks with too-small disk usage estimates. Harri suggests 2 causes for the problem, the server being the root of only the first cause.

So, the final question is, what is different on vseven machine ? !

Maybe it's the second cause mentioned by Harri... extra files in the slot folder. I don't know how or why extra files might be there but that doesn't mean they cannot be there. Maybe left over from a previous task? Maybe not deleted due to a permissions problem?
In his second post in this thread Harri mentions Googling for that error turns up links to discussions of that error on this message board. Maybe further clues can be gleaned from those discussions?
ID: 36410 · Report as offensive     Reply Quote
Profile Yeti
Volunteer moderator
Avatar

Send message
Joined: 2 Sep 04
Posts: 404
Credit: 86,848,055
RAC: 94,975
Message 36411 - Posted: 14 Aug 2018, 13:27:43 UTC - in response to Message 36410.  

Maybe it's the second cause mentioned by Harri... extra files in the slot folder. I don't know how or why extra files might be there but that doesn't mean they cannot be there. Maybe left over from a previous task? Maybe not deleted due to a permissions problem?

HM, maybe, an uninstall and re-install wouldn't have deleted / cleaned the Slots-Folder ...


Supporting BOINC, a great concept !
ID: 36411 · Report as offensive     Reply Quote
vseven

Send message
Joined: 22 Jan 18
Posts: 32
Credit: 2,756,359
RAC: 0
Message 36412 - Posted: 14 Aug 2018, 13:49:19 UTC - in response to Message 36371.  
Last modified: 14 Aug 2018, 14:03:01 UTC


HM, maybe, an uninstall and re-install wouldn't have deleted / cleaned the Slots-Folder ...


I didn't clean up the BOINC data directory when I uninstalled but I would think with 0 tasks on the machine those slots folders wouldn't be there anyway. I can do a "no new tasks" and once everything is finished manually delete them but I don't think that's the issue. Here is what I can tell you. Two of the 8 CPU tasks threw this error and the third task finished successfully. I switched back to 4 CPU tasks which are all finishing successfully and producing 140 - 150Mb HITS files so I know those are working.

- If I look at the stats on one of the 4 CPU tasks it says "Peak disk usage 2,029.85 MB" which is nothing (on this machine)
- If I look at the valid 8 CPU task it says "Peak disk usage 6,695.35 MB" which is a big difference but still barely anything.
- If I look at one of the invalid 8 CPU tasks that failed due to the disk space error it says "Peak disk usage 9,817.26 MB".


The "Exit status 196 (0x000000C4) EXIT_DISK_LIMIT_EXCEEDED " comes from a situation where the disk usage of a slot directory where the task is running exceeds the limits set in init_data.xml <rsc_disk_bound>xxx</rsc_disk_bound>. Either the value was set too low for the task by the server when task was created or for some reason there were some extra files in the slot directory left behind from previous tasks or extra files were created during the task calculation which bloated the disk consumption over the <rsc_disk_bound> value. This is not related to any user setting of allowed disk usage for Boinc.


So assuming there were no extra files (why would there be?) is there a way to find out what the server set the disk usage limit to AFTER the task is off my host? I looked at the slot directory the two invalids ran from and both are empty....assuming it cleaned them up afterward. I looked at a couple other 8 CPU tasks that are waiting to run and this is what is set in all of them:

<rsc_disk_bound>8000000000.000000</rsc_disk_bound>

Is this 8Gb? If so could this have caused them to abort (assuming all 8 CPU tasks are set the same and my task was above that)?
ID: 36412 · Report as offensive     Reply Quote
bronco

Send message
Joined: 13 Apr 18
Posts: 443
Credit: 8,196,738
RAC: 10,577
Message 36413 - Posted: 14 Aug 2018, 13:57:03 UTC - in response to Message 36411.  

Right. He might attempt a manual cleaning to see if that makes the error go away. Not suggesting manual cleanup for every task but just to test the hypothesis that there are orphaned files in the slot folder.

Or compare file creation/access datetimes to task datetimes, if a file seems to have been created prior to the currently running task then it would seem the file is an orphan from a previous task. Because tasks normally do not create files in the past.

Or compare actual filenames left in the slot folder after a task with a list of filenames that are normally left there.

Or check stderr output to see if the current task's slot folder cleanup routine threw an error like "can't clean the slot folder"
ID: 36413 · Report as offensive     Reply Quote
bronco

Send message
Joined: 13 Apr 18
Posts: 443
Credit: 8,196,738
RAC: 10,577
Message 36414 - Posted: 14 Aug 2018, 14:36:02 UTC - in response to Message 36412.  

Two of the 8 CPU tasks threw this error and the third task finished successfully. I switched back to 4 CPU tasks which are all finishing successfully and producing 140 - 150Mb HITS files so I know those are working.
Interesting that 8 CPU tasks fail whereas 4 CPU tasks don't. Maybe 8 CPU tasks create bigger output files than 4 CPU tasks? Perhaps the server needs to issue a bigger <rsc_disk_bound> value?

<rsc_disk_bound>8000000000.000000</rsc_disk_bound>

Is this 8Gb? If so could this have caused them to abort (assuming all 8 CPU tasks are set the same and my task was above that)?
Yes, it is. BOINC normally gives all sizes in bytes unless otherwise specified by trailing units. So divide that number by 1024^3 and you get 7.45GB. Note the disk usage for the valid 8 CPU tasks was just under that limit while usage for the invalid was over that limit.
ID: 36414 · Report as offensive     Reply Quote
vseven

Send message
Joined: 22 Jan 18
Posts: 32
Credit: 2,756,359
RAC: 0
Message 36415 - Posted: 14 Aug 2018, 14:56:06 UTC - in response to Message 36414.  
Last modified: 14 Aug 2018, 14:58:23 UTC

Note the disk usage for the valid 8 CPU tasks was just under that limit while usage for the invalid was over that limit.


That exactly what I'm saying. And it sounds like that can only be increased on the server side. Whcih makes me wonder if I'm the only one having this issue or not. Or maybe not that many people are cruching multiple 8 CPU WU's so its not very noticeable.

I do know that one of my 8 CPU WU that just downloaded was 380Mb while most are right under 350Mb. I'm assuming (maybe incorrectly) that if the file is bigger then the task will use more space. Which makes me think that 380Mb download is going to end up failing with the same error. If it does I'm switching to 4 CPU tasks and not doing any more 8 CPU tasks...especially since it won't abort until its already wasted 6+ hours of 8 cores.
ID: 36415 · Report as offensive     Reply Quote
Crystal Pellet
Volunteer moderator
Volunteer tester

Send message
Joined: 14 Jan 10
Posts: 739
Credit: 6,027,121
RAC: 1,035
Message 36416 - Posted: 14 Aug 2018, 15:00:16 UTC - in response to Message 36414.  

Yes, it is. BOINC normally gives all sizes in bytes unless otherwise specified by trailing units. So divide that number by 1024^3 and you get 7.45GB. Note the disk usage for the valid 8 CPU tasks was just under that limit while usage for the invalid was over that limit.

During the run the virtual machine image file in the slot folder will grow from Initially ~1.63GB.
When suspending a VM-task with "Leave application in memory" unticked, a snapshot is saved into the slot directory.
After resuming the task the VM is restored and the snapshot is deleted.
This snapshot can easily be several GBs - >3GB.
So maybe the snapshot was even bigger or an older snapshot was not cleaned properly after resuming the task.
ID: 36416 · Report as offensive     Reply Quote
bronco

Send message
Joined: 13 Apr 18
Posts: 443
Credit: 8,196,738
RAC: 10,577
Message 36417 - Posted: 14 Aug 2018, 15:50:18 UTC - in response to Message 36415.  

Or maybe not that many people are cruching multiple 8 CPU WU's so its not very noticeable.
There may not be that many now but I think as more and more volunteers discover they can shoehorn lots of CPUs into limited RAM this issue will pop up with ever greater frequency.

I do know that one of my 8 CPU WU that just downloaded was 380Mb while most are right under 350Mb. I'm assuming (maybe incorrectly) that if the file is bigger then the task will use more space. Which makes me think that 380Mb download is going to end up failing with the same error. If it does I'm switching to 4 CPU tasks and not doing any more 8 CPU tasks...especially since it won't abort until its already wasted 6+ hours of 8 cores.
The advice given in these forums is that if you have enough RAM then you should not run 8 CPU tasks.
ID: 36417 · Report as offensive     Reply Quote
bronco

Send message
Joined: 13 Apr 18
Posts: 443
Credit: 8,196,738
RAC: 10,577
Message 36418 - Posted: 14 Aug 2018, 16:13:01 UTC - in response to Message 36416.  

During the run the virtual machine image file in the slot folder will grow from Initially ~1.63GB.
When suspending a VM-task with "Leave application in memory" unticked, a snapshot is saved into the slot directory.
After resuming the task the VM is restored and the snapshot is deleted.
This snapshot can easily be several GBs - >3GB.
So maybe the snapshot was even bigger or an older snapshot was not cleaned properly after resuming the task.


Aha! That answers vseven's "how could there be?" question. And it raises more questions (or quandries)...
1) Should we then tick "leave application in memory"? Seems that would fail to release RAM that is presumably needed for a task that could be resuming. Yes, the snapshot would get pushed off into virtual RAM but that takes time and resources and might push a nearly overloaded system over the edge. What to do, what to do? A kludgy script to babysit the whole mess and keep all the ducks in a row? But kludgy scripts eat up resources too.
2) Hmmm. I had a second question but I forgot it while typing the first question.
ID: 36418 · Report as offensive     Reply Quote
Crystal Pellet
Volunteer moderator
Volunteer tester

Send message
Joined: 14 Jan 10
Posts: 739
Credit: 6,027,121
RAC: 1,035
Message 36419 - Posted: 14 Aug 2018, 16:33:56 UTC - in response to Message 36418.  

Aha! That answers vseven's "how could there be?" question. And it raises more questions (or quandries)...

Best practice for VM-tasks is to run them in one flow without suspending.
To avoid switching between BOINC-tasks (even other projects) set in BOINC "switching between tasks" very high (days).
Only then high priority jobs (panic) will suspend a VM. For VM's I prefer not to keep tasks in memory, so the VM will be saved to disk by creating a snapshot when suspended.
ID: 36419 · Report as offensive     Reply Quote
vseven

Send message
Joined: 22 Jan 18
Posts: 32
Credit: 2,756,359
RAC: 0
Message 36420 - Posted: 14 Aug 2018, 18:44:19 UTC - in response to Message 36419.  

Aha! That answers vseven's "how could there be?" question. And it raises more questions (or quandries)...

Best practice for VM-tasks is to run them in one flow without suspending.
To avoid switching between BOINC-tasks (even other projects) set in BOINC "switching between tasks" very high (days).
Only then high priority jobs (panic) will suspend a VM. For VM's I prefer not to keep tasks in memory, so the VM will be saved to disk by creating a snapshot when suspended.


Yeah....when this was running it was the only project so it didn't switch tasks. I.e. the tasks should have been running straight through. I didn't have the "Leave non-GPU tasks i memory while suspended" checked but since this machine has 64Gb of RAM I went ahead and checked that just in case. I also changed the switching to 1200 minutes even though as mentioned I don't think tasks switched.
ID: 36420 · Report as offensive     Reply Quote
bronco

Send message
Joined: 13 Apr 18
Posts: 443
Credit: 8,196,738
RAC: 10,577
Message 36421 - Posted: 14 Aug 2018, 19:24:34 UTC - in response to Message 36420.  

the tasks should have been running straight through
.
.
.
I don't think tasks switched.

Is this the task... https://lhcathome.cern.ch/lhcathome/result.php?resultid=204837218? If so then looking through the stderr output I see several references to stopping and starting the VM. Not sure if those stop-starts would cause a snapshot to be saved.

Also these lines look rather odd:
2018-08-13 13:29:29 (912): Guest Log: CCooppyyiinngg  iinnppuut tf iflielse si nitnot oR uRnuAntAltalsa.s.
2018-08-13 13:29:29 (912): Guest Log: CCooppyyiinngg  iinnppuut tf iflielse si nitnot oR uRnuAntAltalsa.s.
2018-08-13 13:29:29 (912): Guest Log: Copied input files into RunAtlas.
2018-08-13 13:29:40 (912): Guest Log: Copied input files into RunAtlas.
2018-08-13 13:30:53 (912): Guest Log: copcopied the webapp to /var/www
2018-08-13 13:30:53 (912): Guest Log: ied the webapp to /var/www
2018-08-13 13:31:04 (912): Guest Log: TThhisi svm v md odeose sn otn onte ende etdo  steot uspe thutptp  hptrtopx yp
2018-08-13 13:31:04 (912): Guest Log: xy
2018-08-13 13:31:04 (912): Guest Log: AATHTEHNEAN_AP_RPORCO_CN_UNMUBMEBRE=R8
2018-08-13 13:31:04 (912): Guest Log: 8
2018-08-13 13:31:04 (912): Guest Log: SStarttianrgt iAnTgL AAS TjLoAbS.  j(oPba.nd a(IPDa=n4d0a2I4D64=64804294 t6a4s6k8I4D9=1 4t8a6s7k2I7D3=)1


I have some theories as to why those lines are garbled but you won't like them :)
ID: 36421 · Report as offensive     Reply Quote
vseven

Send message
Joined: 22 Jan 18
Posts: 32
Credit: 2,756,359
RAC: 0
Message 36422 - Posted: 14 Aug 2018, 19:40:43 UTC - in response to Message 36421.  
Last modified: 14 Aug 2018, 19:44:47 UTC

I didn't even notice that. Its only that section (for about 2 miniutes), all the rest is fine. What does that mean?

And yeah, if it stopped and started then that would easily be why it went over disk space. I do have the option set to suspend work if CPU usage goes above xx so its possible that happened. Shouldn't have been anything switching tasks though. Not sure why that hard limit is even in there with todays technology and abundant disk space.
ID: 36422 · Report as offensive     Reply Quote
bronco

Send message
Joined: 13 Apr 18
Posts: 443
Credit: 8,196,738
RAC: 10,577
Message 36423 - Posted: 14 Aug 2018, 20:00:43 UTC - in response to Message 36422.  

It might mean a cheater script is attempting to alter the stderr output file? Like it's parsing lines in order to sub cheater info back into the lines but concatenating the strings incorrectly resulting in the garbled lines. Just a theory. Maybe a cheater script you don't even know is on your system... maybe something somebody else snuck onto your system without your knowledge? And there are probably other theories that make just as much sense and don't involve cheating.

Anyway those garbled lines are a separate issue from the stop-starts, I think. You said the task shouldn't have paused but it seems it did. Or maybe the garbled lines and the stop-starts are all part of the same issue, I don't know, I'm no VBox expert. So with that I'll bow out for a while and make some room here for the experts, the ones with the experience I don't have, just a curious noob here.
ID: 36423 · Report as offensive     Reply Quote
vseven

Send message
Joined: 22 Jan 18
Posts: 32
Credit: 2,756,359
RAC: 0
Message 36424 - Posted: 14 Aug 2018, 20:09:24 UTC - in response to Message 36423.  

I'm 100% positive there isn't a script running nor has anyone had access to the server. Its a old HyperV host, recently formatted by me when I put in new SSDs for the OS drive. It literally sits and does nothing all day other then take some copies of VM hard drives on weekends (for disaster recovery. just the HD images, HyperV isn't installed).
ID: 36424 · Report as offensive     Reply Quote
bronco

Send message
Joined: 13 Apr 18
Posts: 443
Credit: 8,196,738
RAC: 10,577
Message 36425 - Posted: 14 Aug 2018, 20:13:15 UTC - in response to Message 36422.  

Seems you added to your post while I was responding, NP

I do have the option set to suspend work if CPU usage goes above xx so its possible that happened.
I'm thinking "Bingo"
Not sure why that hard limit is even in there with todays technology and abundant disk space.
It's a failsafe protection, a redundant protection that protects things when other protections fail. And it helps protect BOINC's reputation and reduce accusations that BOINC is malware that gobbles up all your disk space. Reputation is everything in the software business. Better to be a little over zealous in that regard.
ID: 36425 · Report as offensive     Reply Quote
Previous · 1 · 2 · 3 · 4 · 5 · 6 · 7 · Next

Message boards : Number crunching : fubar host of the day


©2019 CERN