1) Message boards : Number crunching : fubar host of the day (Message 36424)
Posted 14 Aug 2018 by vseven
Post:
I'm 100% positive there isn't a script running nor has anyone had access to the server. Its a old HyperV host, recently formatted by me when I put in new SSDs for the OS drive. It literally sits and does nothing all day other then take some copies of VM hard drives on weekends (for disaster recovery. just the HD images, HyperV isn't installed).
2) Message boards : Number crunching : fubar host of the day (Message 36422)
Posted 14 Aug 2018 by vseven
Post:
I didn't even notice that. Its only that section (for about 2 miniutes), all the rest is fine. What does that mean?

And yeah, if it stopped and started then that would easily be why it went over disk space. I do have the option set to suspend work if CPU usage goes above xx so its possible that happened. Shouldn't have been anything switching tasks though. Not sure why that hard limit is even in there with todays technology and abundant disk space.
3) Message boards : Number crunching : fubar host of the day (Message 36420)
Posted 14 Aug 2018 by vseven
Post:
Aha! That answers vseven's "how could there be?" question. And it raises more questions (or quandries)...

Best practice for VM-tasks is to run them in one flow without suspending.
To avoid switching between BOINC-tasks (even other projects) set in BOINC "switching between tasks" very high (days).
Only then high priority jobs (panic) will suspend a VM. For VM's I prefer not to keep tasks in memory, so the VM will be saved to disk by creating a snapshot when suspended.


Yeah....when this was running it was the only project so it didn't switch tasks. I.e. the tasks should have been running straight through. I didn't have the "Leave non-GPU tasks i memory while suspended" checked but since this machine has 64Gb of RAM I went ahead and checked that just in case. I also changed the switching to 1200 minutes even though as mentioned I don't think tasks switched.
4) Message boards : Number crunching : fubar host of the day (Message 36415)
Posted 14 Aug 2018 by vseven
Post:
Note the disk usage for the valid 8 CPU tasks was just under that limit while usage for the invalid was over that limit.


That exactly what I'm saying. And it sounds like that can only be increased on the server side. Whcih makes me wonder if I'm the only one having this issue or not. Or maybe not that many people are cruching multiple 8 CPU WU's so its not very noticeable.

I do know that one of my 8 CPU WU that just downloaded was 380Mb while most are right under 350Mb. I'm assuming (maybe incorrectly) that if the file is bigger then the task will use more space. Which makes me think that 380Mb download is going to end up failing with the same error. If it does I'm switching to 4 CPU tasks and not doing any more 8 CPU tasks...especially since it won't abort until its already wasted 6+ hours of 8 cores.
5) Message boards : Number crunching : fubar host of the day (Message 36412)
Posted 14 Aug 2018 by vseven
Post:

HM, maybe, an uninstall and re-install wouldn't have deleted / cleaned the Slots-Folder ...


I didn't clean up the BOINC data directory when I uninstalled but I would think with 0 tasks on the machine those slots folders wouldn't be there anyway. I can do a "no new tasks" and once everything is finished manually delete them but I don't think that's the issue. Here is what I can tell you. Two of the 8 CPU tasks threw this error and the third task finished successfully. I switched back to 4 CPU tasks which are all finishing successfully and producing 140 - 150Mb HITS files so I know those are working.

- If I look at the stats on one of the 4 CPU tasks it says "Peak disk usage 2,029.85 MB" which is nothing (on this machine)
- If I look at the valid 8 CPU task it says "Peak disk usage 6,695.35 MB" which is a big difference but still barely anything.
- If I look at one of the invalid 8 CPU tasks that failed due to the disk space error it says "Peak disk usage 9,817.26 MB".


The "Exit status 196 (0x000000C4) EXIT_DISK_LIMIT_EXCEEDED " comes from a situation where the disk usage of a slot directory where the task is running exceeds the limits set in init_data.xml <rsc_disk_bound>xxx</rsc_disk_bound>. Either the value was set too low for the task by the server when task was created or for some reason there were some extra files in the slot directory left behind from previous tasks or extra files were created during the task calculation which bloated the disk consumption over the <rsc_disk_bound> value. This is not related to any user setting of allowed disk usage for Boinc.


So assuming there were no extra files (why would there be?) is there a way to find out what the server set the disk usage limit to AFTER the task is off my host? I looked at the slot directory the two invalids ran from and both are empty....assuming it cleaned them up afterward. I looked at a couple other 8 CPU tasks that are waiting to run and this is what is set in all of them:

<rsc_disk_bound>8000000000.000000</rsc_disk_bound>

Is this 8Gb? If so could this have caused them to abort (assuming all 8 CPU tasks are set the same and my task was above that)?
6) Message boards : Number crunching : fubar host of the day (Message 36408)
Posted 14 Aug 2018 by vseven
Post:
Side note - just had 2 of 3 8 CPU tasks abort due to "Aborted: task disk limit exceeded". Again...all limits are unchecked and I have plenty of free space. Waste of 5+ hours on 16 cores...

Again... that error (Aborted: task disk limit exceeded) has nothing to do with the disk limits you set in your prefs. See Harri Liljeroos's explanation upthread.


I understand that. But can it be fixed so I'm not wasting my computer on something that will fail?
7) Message boards : Number crunching : fubar host of the day (Message 36403)
Posted 14 Aug 2018 by vseven
Post:
Side note - just had 2 of 3 8 CPU tasks abort due to "Aborted: task disk limit exceeded". Again...all limits are unchecked and I have plenty of free space. Waste of 5+ hours on 16 cores...
8) Message boards : Number crunching : fubar host of the day (Message 36397)
Posted 13 Aug 2018 by vseven
Post:
And something similar this way could make the difference you saw here


Sure. But why would it only affect Atlas tasks and no other projects? I'm pretty positive VBox would know the difference since its kept up to date.....would it be in the older VBox wrapper this project uses?

I completed another 4 CPU tasks without errors (around 9 hours) and with another 160Mb HITS file so my host seems to be happy. Still going to try a 8 CPU and 2 CPU task to make sure those both work.
9) Message boards : Number crunching : fubar host of the day (Message 36387)
Posted 13 Aug 2018 by vseven
Post:
Shame he hasn't been able to get tasks to work properly.


So this test task, after telling BOINC there was no hard drive space limits at all by unchecking all three options, was a success (marked valid) and appears to have created a HITS file. It took just under 7 hours using 4 CPUs.

I don't know what the difference between telling BOINC it can use 100Gb and not to exceed 90% when I have 100+ Gb free and telling it there are no limits but it didn't give the disk space error. Going to run some more 4 CPU tasks then re-try some 8 CPU tasks and see what happens.
10) Message boards : Number crunching : fubar host of the day (Message 36380)
Posted 12 Aug 2018 by vseven
Post:
Currently it is set for use no more than 100 gigs of disk space, leave at least 10 gigs free, and use no more than 90%. The drive is a 200Gb SSD which currently only has 50 gigs used. Surely an Atlas task can't use a 100 gigs of space, can it?

This should be okay.

Does the client show the same figures as the WEB ? Perhaps you set a local profile in the past ?


Yes, same settings on client. I unchecked all three anyway so it has no limits, removed the project completely, verified the project directory was gone, and readded the project. Its downloading the VDI and a 4 CPU task right now....I'll know in 5 - 6 hours if it worked (maybe more if it doesn't error out.)

This is one from vseven from a earlier message.
https://lhcathome.cern.ch/lhcathome/result.php?resultid=204064087
Is this your Computer?


Yup, same computer.
11) Message boards : Number crunching : fubar host of the day (Message 36377)
Posted 12 Aug 2018 by vseven
Post:
So is there anything I can do other then abandon this project on this host?

Shure !
Go to your preferences and check / adjust following settings:



Currently it is set for use no more than 100 gigs of disk space, leave at least 10 gigs free, and use no more than 90%. The drive is a 200Gb SSD which currently only has 50 gigs used. Surely an Atlas task can't use a 100 gigs of space, can it?
12) Message boards : Number crunching : fubar host of the day (Message 36375)
Posted 12 Aug 2018 by vseven
Post:
and did you ever take a ride through this checklist ?


Yes...multiple times. Even tried running just 1 WU by itself. Same results.

If you refuse to expose your hosts then there is no way anybody can help. I believe you're here just to yank chains.

I did expose my hosts, for a week, and one person looked at things but didn't know what was wrong. Every reply you've given me has been negative and not helpful. I'm trying to figure out what is wrong and all you do is think I'm trying to cheat. Which doesn't make any sense. Please stop replying to my posts if you dont have anything helpful to say.

It is too bad that in this case the error makes solving the problem in hand even more difficult than normal.


So is there anything I can do other then abandon this project on this host? Can I manually put a setting in the app config the Tells it to use more disk space? The server has 100+ gigs free space so its definitely not a resource issue. I ended up doing 4 4-core WU and all ended with the same error.
13) Message boards : Number crunching : fubar host of the day (Message 36367)
Posted 10 Aug 2018 by vseven
Post:
Here is something fun. I couldn't get a Cosmology VBox WU all weekend and into this week (0 WU available) so I crunched some other stuff including Universe@Home, SRBase, Yafu, YoYo, and some massive Citizen Science Grid WU (3+ days each). Every task was successful and validated. So I finished up all tasks, uninstalled VBox, uninstalled BOINC, rebooted, and reinstalled a fresh copy of the latest BOINC (7.12.1 + VBox 5.2.8). Added LHC again and got the exact same results with a 8 CPU WU...it finished in 600~ seconds as invalid. So I switched my preferences to 4 CPU max and it downloaded some of those. This time around it took what I consider a better time, a little over 3 hours. However the couple I did all gave a "error while computing" and actual errors in the stderr output. Here is one of them:

https://lhcathome.cern.ch/lhcathome/result.php?resultid=204064087

Can anyone interpret this as anything?
14) Message boards : Number crunching : fubar host of the day (Message 36253)
Posted 3 Aug 2018 by vseven
Post:
Atlas uses Vbox, have you run any other projects that use it ?


Since you said this I started thinking back and I did run Cosmology@Home which uses VBox. So I attached to that project again, it downloaded its vdi files and wrapper, and it grabbed a bunch of 24 CPU task. Gonna let those run over the weekend and see what happens.
15) Message boards : Number crunching : fubar host of the day (Message 36245)
Posted 3 Aug 2018 by vseven
Post:
No, if the other is working fine as a service install it won't be that.

Atlas uses Vbox, have you run any other projects that use it ?


No. And see my edit above.

Also just some testing to make sure its not my host. I downloaded some yafu project tasks, 24 CPU WU's (YAFU for small composites 134.05). It crunched using all 24 threads, finished, and validated successfully. I know its apples to oranges because of VB but there is nothing fundamentally wrong with my machine itself.
16) Message boards : Number crunching : fubar host of the day (Message 36242)
Posted 3 Aug 2018 by vseven
Post:
Is Boinc a Service Install on the other one that works ?


Yup. I mean I can uninstall that too and reinstall not as a service but this is the only project I'm having a issue with so I wouldn't imagine that's it.

Edit: Here is another work unit that I just did that's listed as invalid: https://lhcathome.cern.ch/lhcathome/workunit.php?wuid=99773436 Now look at the other computer that also crunched it. Its also invalid, it also has about the same time on it. This can't be a issue with just my host...
17) Message boards : Number crunching : fubar host of the day (Message 36239)
Posted 3 Aug 2018 by vseven
Post:
I just did another "Reset Project" on it. Its downloading everything fresh again and I'll see if that makes a difference: https://imgur.com/PNvB9sD

If it doesn't then I don't know what else to say other then the code is buggy. Unless I need to switch to older versions of VB and BOINC.

Edit: Surprise surprise....exact same results. Example WU: https://lhcathome.cern.ch/lhcathome/workunit.php?wuid=99747354 I like that this one has "Error while computing" on two other hosts also....

I'm suspending this host and will switch back to Universe@home until someone can figure it out.
18) Message boards : Number crunching : fubar host of the day (Message 36238)
Posted 3 Aug 2018 by vseven
Post:
This Computer 10522400 run not because of a wrong configuration with sandbox.

You need a clean Installation of Virtualbox with a reboot after deinstallation and a reboot after installation.

If sandbox is than always avalaible, don't know a answer.

Edit: Do you have a second Hypervisor installed?


Ok, here is what I did, which is things I've done before but did it again anyway:

- Shut down BOINC and stopped running tasks. Waited for task manager to show all "VirtualBox Interface" processes closed and all BOINC processes gone
- Uninstalled VirtualBox 5.2.16 through add/remove programs. (Side note: the only programs installed on this host are BOINC, Microsoft Silverlight, and VirtualBox)
- Rebooted the server
- BOINC started (service install) and all tasks reported "Postponed: Detection of VM Hypervisor failed. (8 CPUs)" which is correct since VB is gone.
- Shut down BOINC again.
- Cleaned up the machine just to make sure. (Deleted all temp files, deleted remains of c:\Program Files\Oracle, deleted HKLM\Software\Oracle registry key, etc)
- Installed VirtualBox 5.2.16. All default settings (full install).
- Installed VB Extensions Pack 5.2.16
- Rebooted server

BOINC came back up and is chugging along at the same pace as before. 8 CPU WU's are taking 12 - 14 minutes each. Here is one that was waiting to run, switched to running (after the reinstall), and finished in 13:10: https://lhcathome.cern.ch/lhcathome/result.php?resultid=203538752 It is marked as a validate error.

This server was a former HyperV host but has been retired. So yes it did have HyperV role installed at one point. But I've removed that role. In fact it has 0 roles installed (other then Storage Services which is required and cannot be removed).
19) Message boards : Number crunching : fubar host of the day (Message 36235)
Posted 3 Aug 2018 by vseven
Post:
So that host https://lhcathome.cern.ch/lhcathome/results.php?hostid=10522400&offset=0&show_names=0&state=4&appid= at the moment has 785 VALID tasks that used about 3 minutes of CPU time for a credit of about 300 each.

Would an Admin like to confirm that they are valid results please ?
I see the VMs have 8 CPUs assigned but the run time is barely enough to spin up the VM let alone do any valid work.


This are sandbox installation from docker in Virtualbox:
2018-08-01 05:54:27 (4100): Detected: Sandbox Configuration Enabled

vseven, and you don't know about this installation??


I'm not sure what you mean. I have BOINC 7.10.2 (x64) and VirtualBox 5.2.16 installed. Beyond that neither host has been modified in any way. And again as said I have tried a reset project on both, watched a fresh VDI file get downloaded, and both hosts run the exact same way.

I have tried older version of VB with no change (5.1.*, 5.2.8, etc). If there is something I can try to test I will but this is a vanilla install with no app_config or any other modifications.

Both hosts are Server 2012 R2 if that matters. I posted the actual specs a couple posts up.
20) Message boards : Number crunching : fubar host of the day (Message 36233)
Posted 3 Aug 2018 by vseven
Post:
The main reason why I accused you cheating is the fact that you ask for help but do everything to hide relevant information.


.....and I posted links to example tasks. Maybe that's considering "doing everything to hide relevant information" :rolleyes:

As a side note I've also crunched SixTrack WU's, hundreds of them since I can run 40 at a time. 100% valid results with both hosts. These hosts were doing Universe@Home tasks for two months before switching to LHC. 100% valid results from there also. They also did SRBase for a while. 100% valid from that project. Hence me saying something is buggy in Atlas.

That is quite the machine there.
https://lhcathome.cern.ch/lhcathome/show_host_detail.php?hostid=10522402

It is running those tasks with 8 cores each


Correct...its one of mine. A dual Xeon E5-2650 v3 @ 2.3Ghz with 128Gb of Ram and SSD hard drives. Hyperthreading is turned on so it shows as 40 threads. So it runs five 8 CPU tasks at a time. My "problem" machine is a dual Xeon X5670 @ 2.93 with 64Gb of ram and SSD hard drives. Hyperthreading is also turned on so it shows as 24 threads. It runs three 8 CPU tasks at a time but apparently not very well...


Next 20


©2020 CERN