Imbalance between Subprojects

Author	Message
Erich56 Send message Joined: 18 Dec 15 Posts: 1921 Credit: 148,223,464 RAC: 132,715	Message 30593 - Posted: 2 Jun 2017, 12:49:01 UTC - in response to Message 30592. Yes, if we can agree on what is wanted ... I guess this is rather clear, isn't it? ID: 30593 · Reply Quote

Toby Broom Volunteer moderator Send message Joined: 27 Sep 08 Posts: 888 Credit: 758,168,038 RAC: 357,495	Message 30598 - Posted: 2 Jun 2017, 18:12:20 UTC I would like to do about the same for each sub project, or as I said before whatever priorities the scientists need. ID: 30598 · Reply Quote

Laurence Project administrator Project developer Send message Joined: 20 Jun 14 Posts: 407 Credit: 238,712 RAC: 0	Message 30601 - Posted: 2 Jun 2017, 19:20:03 UTC - in response to Message 30598. I would like to do about the same for each sub project, or as I said before whatever priorities the scientists need. What metric is the same? tasks, wall time, cpu time, credit or other? Forget about the scientists :) We have to assume that they always have work and that to them it is the most important thing ever. The person who prioritizes ATLAS tasks over CMS or vice versa is a brave (or stupid) person. :) In the cases where the machine is dedicated to one application, there is nothing to do. For the machines that are shared, how it is shared should be decided by the machine owner. In the absence of anything BOINC just does what it does, it may not be the best but it does work and nobody can complain that we are biassed. ID: 30601 · Reply Quote

computezrmle Volunteer moderator Volunteer developer Volunteer tester Help desk expert Send message Joined: 15 Jun 08 Posts: 2703 Credit: 290,864,086 RAC: 142,880	Message 30604 - Posted: 2 Jun 2017, 20:17:22 UTC - in response to Message 30601. I vote for this order: 1. wall time 2. # of WUs Comments: Wall time is sometimes different between my host's scheduler request and the value on the CERN website. It should be investigated why. Multicore wall time should be calculated: raw wall time * # of used cores # of WUs should be used if wall time calculation or transfer is unreliable The following metrics shouldn't be used: CPU time is unreliable if errors occure, e.g. repeated uploads (CMS) Credits are a "moving target" as the calculation is based on too much parameters that change from WU to WU. Each subproject should get it's own switch for # of CPUs (to prepare for more multicore apps) Each subproject should get it's own switch for Max # of WUs Each subproject should get a priority value field (default 100; same function as users know from the BOINC client but based on the metric above) Each subproject should have a new switch: "normal" vs. "backup"; send backup work only if normals don't have work The standard switch ON/OFF shouldn't be removed ID: 30604 · Reply Quote

Laurence Project administrator Project developer Send message Joined: 20 Jun 14 Posts: 407 Credit: 238,712 RAC: 0	Message 30609 - Posted: 2 Jun 2017, 21:04:20 UTC - in response to Message 30601. On thing that I forgot to highlight is that the scientist (application) is concerned with the overall usage and throughput rather than an individual machine. Looking at the plot from the 8th BOINC Pentathlon, you can see the large number of Sixtrack tasks and the lasagne of VM tasks at the bottom. This suggests to us that the default scheduling is sufficient. The VM applications are all at the same level and it is possible to pump out a large number of Sixtrack tasks. Our concern at the moment is that the total number of VM tasks that we run is much smaller than the number of classic tasks. Rather than optimising on a few percentage of difference with the sharing of resources, we should focus on being able to exploit an order of magnitude more resources. So in summary, project level scheduling seems to be ok for us and what would be great is if more volunteers could run VM applications. What is the limitation? Improving the sharing of individual machines would be important for some volunteers and we can try to make improvements in this area but it will not necessarily make a big difference to the overall project. Of course have happy volunteers is very important for the health of a project so it is something that should be addressed. ID: 30609 · Reply Quote

Harri Liljeroos Send message Joined: 28 Sep 04 Posts: 790 Credit: 61,371,474 RAC: 51,631	Message 30611 - Posted: 2 Jun 2017, 21:27:01 UTC - in response to Message 30604. computezrmle's suggestions sound very good to me. ID: 30611 · Reply Quote

Toby Broom Volunteer moderator Send message Joined: 27 Sep 08 Posts: 888 Credit: 758,168,038 RAC: 357,495	Message 30612 - Posted: 2 Jun 2017, 22:51:03 UTC - in response to Message 30601. Last modified: 2 Jun 2017, 22:53:52 UTC Tasks I think. If you think it's OK then I'm happy. It would be intresting to know why less people use the VM, computer resorces? ID: 30612 · Reply Quote

Crystal Pellet Volunteer moderator Volunteer tester Send message Joined: 14 Jan 10 Posts: 1468 Credit: 9,913,888 RAC: 1,905	Message 30619 - Posted: 3 Jun 2017, 14:41:41 UTC - in response to Message 30609. So in summary, project level scheduling seems to be ok for us and what would be great is if more volunteers could run VM applications. What is the limitation? The main limitation for most crunchers in my opinion: "It's not set and forget". ID: 30619 · Reply Quote

PHILIPPE Send message Joined: 24 Jul 16 Posts: 88 Credit: 239,917 RAC: 0	Message 30631 - Posted: 4 Jun 2017, 20:44:36 UTC - in response to Message 30619. The original boinc philosophy was to use the power of the idle cores of volunteer's computer with low priority. But it appears LHC projects need more than this statement. The standard of most of public computer is about 4 cpus and 4 GBytes RAM, even if we start to see 8 cpus and between 6 and 8 GBytes ram associated. More the requirements of the projects are far from this target and more the difficulty is big for the volunteer to suit and run them. Not everyone has the amount of money to buy a gamer configuration for his personal use or the skillness to build it and / or to add ram memory. The volunteer wants ,first of all to keep his computer responsive,when he uses it. Not everyone has a dedicated host to crunch only boinc projects. ---------------------------------------------------------------------------------------------------------------------------------------------------------------- Maybe a public inquiries (sent by boinc message to be completed by volunteers) could give datas and informations on the people who crunch LHC project. For instance : Man , Woman Student , Worker , retired Where he learns about the possibility to run lhc projects (tv , magazines , social network , newspaper ,...)? Does the computer bought only for crunch? Does he crunch at home , at work , or both ? Does he find clear , or not the instructions on the site ? Does he use the forum ? How does he evaluate his skillness (beginner , medium , pro)? Does he install boinc as a service ? Does he encounter troubles during running ? Was it about the OS platform , the virtualbox manager , the internet service provider , the app_config )? Does he crunch other project ? Does he know virtualbox ? Does he use it elsewhere ? And so on... The results of the inquiries may teach you some hidden facts , and bring a work to solve them.More you know about your volunteers (distribution and behavior), more you can understand and help them. ---------------------------------------------------------------------------------------------------------------------------------------------------------------- Theory tasks should be advised to beginners because they accept the shutdown and the reboot without any trouble and require the less ram memory. And only 1 cpu and 1 job the first time. By the way , some improvements should be made , about the RAM requirements. Apparently , it is possible to run the theory , CMS , and LHCb tasks with less ram memory than defined by default.(with no errors and the duration of internal jobs was not longer)(Is 2048 MBytes a remnant of xp OS ?). Modifying and reducing the default setting may prevent the beginner volunteer's host to be saturated. But if default values are necessary , let inform the well skilled crunchers to increase it in their app_config.xml file. It could enable more people to feel the first instant , more comfortable. They have to trust in themselve and have the good feeling that they can do it without fear.If the computer becomes unresponsive , they don't go further... ID: 30631 · Reply Quote

Dave Peachey Send message Joined: 9 May 09 Posts: 17 Credit: 772,975 RAC: 0	Message 30632 - Posted: 4 Jun 2017, 22:08:46 UTC - in response to Message 30609. Last modified: 4 Jun 2017, 22:13:01 UTC So in summary, project-level scheduling seems to be OK for us and what would be great is if more volunteers could run VM applications. What is the limitation? Laurence, The big issue for me (which also chimes with Philippe's comment in the immediately preceding post regarding the original DC philosophy) is that running any of the LHC sub-projects bar SixTrack seems to require an investment in additional time and resources beyond that which your average cruncher is prepared to proceed ... albeit "average" doesn't come near to describing some of the serious crunchers on LHC@Home ;-) Specifically, and in order to run multiple instances of one or more sub-project WUs, I've found that this means: - significant amounts of RAM for each machine (8GB seems to be a practical minimum; 32GB is better if the machine can take it) in order to run more than one single-core VM at a time and also use each machine for anything practical at the same time - a large monthly data download allowance (ATLAS chews through 150-200GB per WU; CMS isn't far behind with its multiple jobs per session) - fairly significant CPU power (to get the WU results back within a reasonable timescale) - a robust computer set-up which can be optimised and left to run of its own devices (as noted above by Crystal Pellet) without encountering a periodic dearth of work (due to connection glitches with CERN servers) or periodic failures with the WUs it receives (bad batches of work) On this basis, none of the sub-projects could be said to supply "rock solid" and "always available" work at a low TCO! All of this costs your average cruncher in terms of money to acquire the hardware, electricity and bandwidth allowances (hence more money) to operate it and also requires a degree of monitoring (time) which is more than most people are prepared to invest. Now I know that some of the above criticisms could be levelled at any number of other BOINC projects so it's not unique to LHC@Home. However, allowing crunchers to optimise the balance for their individual machines might go some way to encouraging more people to run the LHC VM-based sub-projects rather than just sitting around waiting for SixTrack WUs. Dave ID: 30632 · Reply Quote

Dave Peachey Send message Joined: 9 May 09 Posts: 17 Credit: 772,975 RAC: 0	Message 30634 - Posted: 5 Jun 2017, 0:31:38 UTC - in response to Message 30632. Last modified: 5 Jun 2017, 0:32:51 UTC - a large monthly data download allowance (ATLAS chews through 150-200GB per WU; CMS isn't far behind with its multiple jobs per session) Correction ... that should be "150-200MB per WU" (but that's still a lot given that I get through one ATLAS WU per hour per day). ID: 30634 · Reply Quote

Laurence Project administrator Project developer Send message Joined: 20 Jun 14 Posts: 407 Credit: 238,712 RAC: 0	Message 30647 - Posted: 6 Jun 2017, 8:56:07 UTC - in response to Message 30619. The main limitation for most crunchers in my opinion: "It's not set and forget". Why not? When do you have to intervene? We would really like this to just work out of the box so please let us know the reasons why it doesn't. ID: 30647 · Reply Quote

Laurence Project administrator Project developer Send message Joined: 20 Jun 14 Posts: 407 Credit: 238,712 RAC: 0	Message 30648 - Posted: 6 Jun 2017, 9:24:57 UTC - in response to Message 30631. The original boinc philosophy was to use the power of the idle cores of volunteer's computer with low priority. A computer has RAM, Storage and Network in addition to CPU and we would like to use all those resources. This my differ from the original philosophy but it is still volunteer computing and technology changes over time. For example today with people watching TV online, home residential networks have the capacity to do more data intensive computations. The standard of most of public computer is about 4 cpus and 4 GBytes RAM, even if we start to see 8 cpus and between 6 and 8 GBytes ram associated. More the requirements of the projects are far from this target and more the difficulty is big for the volunteer to suit and run them. The specification of the HEP jobs is 2GB per core. This is what is needed to do the science. In addition the applications only run on Linux and require 64bits. Sixtrack is different as their application models the accelerator itself rather than the collisions. The other constraint is that LHC@home has to fit into the overall computing infrastructure that we have. For more details, you may be interested in this presentation that I gave on the computing challenge. Not everyone has the amount of money to buy a gamer configuration for his personal use or the skillness to build it and / or to add ram memory. The volunteer wants ,first of all to keep his computer responsive,when he uses it. Not everyone has a dedicated host to crunch only boinc projects. We do not expect people to buy machines specifically for this. If we go down this route we should have a sponsor a machine in our data center program :) However, we would be happy if 1 core of that 4 core machine could be put to use between 5pm and 9am. Maybe a public inquiries (sent by boinc message to be completed by volunteers) could give datas and informations on the people who crunch LHC project. This has been studied by the Citizen Cyberlab. [/list] ID: 30648 · Reply Quote

Laurence Project administrator Project developer Send message Joined: 20 Jun 14 Posts: 407 Credit: 238,712 RAC: 0	Message 30649 - Posted: 6 Jun 2017, 9:44:15 UTC - in response to Message 30632. The big issue for me (which also chimes with Philippe's comment in the immediately preceding post regarding the original DC philosophy) is that running any of the LHC sub-projects bar SixTrack seems to require an investment in additional time and resources beyond that which your average cruncher is prepared to proceed ... albeit "average" doesn't come near to describing some of the serious crunchers on LHC@Home ;-) Agreed and we would like to make this easier. On this basis, none of the sub-projects could be said to supply "rock solid" and "always available" work at a low TCO! All of this costs your average cruncher in terms of money to acquire the hardware, electricity and bandwidth allowances (hence more money) to operate it and also requires a degree of monitoring (time) which is more than most people are prepared to invest. We would like to make this "rock solid" to at least reduce the time that people are investing. ID: 30649 · Reply Quote

Jim1348 Send message Joined: 15 Nov 14 Posts: 602 Credit: 24,371,321 RAC: 0	Message 30650 - Posted: 6 Jun 2017, 10:04:18 UTC - in response to Message 30649. Last modified: 6 Jun 2017, 10:05:35 UTC It has been rock solid for me for several weeks, and I have not had any work shortages. But I run it on a dedicated Ubuntu machine with 32 GB memory, which I don't reboot very often. Any errors have cleared themselves up; no problems with VirtualBox. https://lhcathome.cern.ch/lhcathome/results.php?hostid=10477864 ID: 30650 · Reply Quote

Crystal Pellet Volunteer moderator Volunteer tester Send message Joined: 14 Jan 10 Posts: 1468 Credit: 9,913,888 RAC: 1,905	Message 30651 - Posted: 6 Jun 2017, 10:30:20 UTC - in response to Message 30647. The main limitation for most crunchers in my opinion: "It's not set and forget". Why not? When do you have to intervene? We would really like this to just work out of the box so please let us know the reasons why it doesn't. I don't have reasons for my own (not the average cruncher), but what I've seen here and read on other fora a. Crunchers have to enable VT-x in their BIOS when disabled by default. b. VirtualBox in the BOINC package is too old, so one has to download and install VBox yourself (incl. Extension pack). A lot of crunchers are reluctant to install other software besides of BOINC. c. BOINC sometimes set vbox disabled and does not correct this after Vbox is installed. d. When no jobs available for a VM, BOINC task creates an error in spite of the volunteers host is functioning well. The existence of Yeti's great checklist itself is also a prove that's not a set and forget project. ID: 30651 · Reply Quote

Erich56 Send message Joined: 18 Dec 15 Posts: 1921 Credit: 148,223,464 RAC: 132,715	Message 30652 - Posted: 6 Jun 2017, 10:51:24 UTC - in response to Message 30650. It has been rock solid for me for several weeks, and I have not had any work shortages. But I run it on a dedicated Ubuntu machine with 32 GB memory, which I don't reboot very often. Any errors have cleared themselves up; no problems with VirtualBox. Similar situation with me - I run 2 GPUGRID tasks + 8 LHC@Home tasks on a dedicated machine (Windows) with a 6+6(HT) core processor, and 32GB RAM. In addition, both projects are run on 3 smaller PCs as well, plus 1 CMS even on a 4GB notebook. Of course, time dedication needs more than just "set and forget" which I would not be so much interested in anyway. Everything together needs close watching from time to time, and turning some screws now and then - only this makes it interesting :-) ID: 30652 · Reply Quote

Yeti Volunteer moderator Send message Joined: 2 Sep 04 Posts: 468 Credit: 215,197,406 RAC: 3,933	Message 30654 - Posted: 6 Jun 2017, 11:36:52 UTC Laurence, sorry that I have to say the following sentences, but you have asked for it: BOINC in actual releases isn't really good handling VMs and their memory-requierements. BOINC should improve it's handling with VM so that a cruncher never sees "postponed: ....." Each postponed-message shows that BOINC can't handle the actual situation right Why does the volunteer have to check if his machine is capable running VMs ? I feel, this has to be done by BOINC or you create a small tool that checks the machine of a volunteer and tells him exactly what has to be done Normal crunchers do not expect that there has to be done more than attaching the project. I remember times, people where running Theory weeks by weeks and were credited, but they never crunched something usefull because they were sitting behind firewalls VMs are not the same as "normal" BOINC-Apps. Normal BOINC-Apps run on idle-priority and in all the years we could allow 100% use of processors for BOINC, but the machine didn't get sluggish. Having a VM running, this run's on "normal" priority, so if you want to use your machines for normal work, it is necessary to find out the the max-Core-Numbers individual for each of your hosts. On my top-machine, I can only run 2x4-Core-Atlas (it is a 6-Core-CPU with HT). If I offer one more core for BOINC my machine gets sluggish. You remember that your VMs download large packets after starting a WU. Not every user has enough bandwith for this and look at CMS, they need so much upload-speed that this seem to be a bigger problem So far as I can see, you need enthuiastic users for running your VM-Projects. Coming to the end I would like to point you to two important base-points: First, invest into the needed improvements I mentioned above. Second, If first is successfull finished, invest in some kind of marketing for running the VM-Projects Marketing must not be expensive, but make it more attractiv to run the VMs. Set up a challenge, let people win or earn a visit in CERN, meet with Key-People from their favorite project, show them the real LHC, ... Ahh, I nearly forgot: LHC@Home is a project that spends very low credits in compare to other projects. So, a lot of crunchers, that are for credits, don't like to crunch LHC@Home. Not all people are interested in the science of CERN You could change this by lifting up the credits for running VMs Supporting BOINC, a great concept ! ID: 30654 · Reply Quote

Erich56 Send message Joined: 18 Dec 15 Posts: 1921 Credit: 148,223,464 RAC: 132,715	Message 30655 - Posted: 6 Jun 2017, 11:44:41 UTC - in response to Message 30654. Ahh, I nearly forgot: LHC@Home is a project that spends very low credits in compare to other projects ... that's what I was wondering about to begin with. What's the reason for this? You could change this by lifting up the credits for running VMs I fully endorse ID: 30655 · Reply Quote

Laurence Project administrator Project developer Send message Joined: 20 Jun 14 Posts: 407 Credit: 238,712 RAC: 0	Message 30656 - Posted: 6 Jun 2017, 11:46:16 UTC - in response to Message 30651. a. Crunchers have to enable VT-x in their BIOS when disabled by default. This is something that we can't work around. We did have a look at Hyper-V for Windows but just end up in the same situation. The only hope is that the HEP Software Foundation comes up with some new algorithms that are better for us. b. VirtualBox in the BOINC package is too old, so one has to download and install VBox yourself (incl. Extension pack). This is something that can and should be fixed. c. BOINC sometimes set vbox disabled and does not correct this after Vbox is installed. Does this still happen? I thought this bug was fixed? d. When no jobs available for a VM, BOINC task creates an error in spite of the volunteers host is functioning well.[/list] This should be resolved with the auto kill switch we recently added. The existence of Yeti's great checklist itself is also a prove that's not a set and forget project. Yes and no. There will aways be situations where it doesn't work first time and the check list will help identify where the issue is. However, resorting to this should be the exception rather than the rule. ID: 30656 · Reply Quote

LHC@home