Message boards : Number crunching : Imbalance between Subprojects
Message board moderation

To post messages, you must log in.

Previous · 1 · 2 · 3 · Next

AuthorMessage
Erich56

Send message
Joined: 18 Dec 15
Posts: 1686
Credit: 100,368,702
RAC: 102,022
Message 30593 - Posted: 2 Jun 2017, 12:49:01 UTC - in response to Message 30592.  

Yes, if we can agree on what is wanted ...

I guess this is rather clear, isn't it?
ID: 30593 · Report as offensive     Reply Quote
Toby Broom
Volunteer moderator

Send message
Joined: 27 Sep 08
Posts: 798
Credit: 644,731,442
RAC: 233,853
Message 30598 - Posted: 2 Jun 2017, 18:12:20 UTC

I would like to do about the same for each sub project, or as I said before whatever priorities the scientists need.
ID: 30598 · Report as offensive     Reply Quote
Profile Laurence
Project administrator
Project developer

Send message
Joined: 20 Jun 14
Posts: 372
Credit: 238,712
RAC: 0
Message 30601 - Posted: 2 Jun 2017, 19:20:03 UTC - in response to Message 30598.  

I would like to do about the same for each sub project, or as I said before whatever priorities the scientists need.


What metric is the same? tasks, wall time, cpu time, credit or other? Forget about the scientists :) We have to assume that they always have work and that to them it is the most important thing ever. The person who prioritizes ATLAS tasks over CMS or vice versa is a brave (or stupid) person. :) In the cases where the machine is dedicated to one application, there is nothing to do. For the machines that are shared, how it is shared should be decided by the machine owner. In the absence of anything BOINC just does what it does, it may not be the best but it does work and nobody can complain that we are biassed.
ID: 30601 · Report as offensive     Reply Quote
computezrmle
Volunteer moderator
Volunteer developer
Volunteer tester
Help desk expert
Avatar

Send message
Joined: 15 Jun 08
Posts: 2386
Credit: 222,928,195
RAC: 137,743
Message 30604 - Posted: 2 Jun 2017, 20:17:22 UTC - in response to Message 30601.  

I vote for this order:

1. wall time
2. # of WUs


Comments:
Wall time is sometimes different between my host's scheduler request and the value on the CERN website.
It should be investigated why.
Multicore wall time should be calculated: raw wall time * # of used cores

# of WUs should be used if wall time calculation or transfer is unreliable



The following metrics shouldn't be used:
CPU time is unreliable if errors occure, e.g. repeated uploads (CMS)

Credits are a "moving target" as the calculation is based on too much parameters that change from WU to WU.


Each subproject should get it's own switch for # of CPUs (to prepare for more multicore apps)
Each subproject should get it's own switch for Max # of WUs
Each subproject should get a priority value field (default 100; same function as users know from the BOINC client but based on the metric above)
Each subproject should have a new switch: "normal" vs. "backup"; send backup work only if normals don't have work
The standard switch ON/OFF shouldn't be removed
ID: 30604 · Report as offensive     Reply Quote
Profile Laurence
Project administrator
Project developer

Send message
Joined: 20 Jun 14
Posts: 372
Credit: 238,712
RAC: 0
Message 30609 - Posted: 2 Jun 2017, 21:04:20 UTC - in response to Message 30601.  

On thing that I forgot to highlight is that the scientist (application) is concerned with the overall usage and throughput rather than an individual machine. Looking at the plot from the 8th BOINC Pentathlon, you can see the large number of Sixtrack tasks and the lasagne of VM tasks at the bottom. This suggests to us that the default scheduling is sufficient. The VM applications are all at the same level and it is possible to pump out a large number of Sixtrack tasks. Our concern at the moment is that the total number of VM tasks that we run is much smaller than the number of classic tasks. Rather than optimising on a few percentage of difference with the sharing of resources, we should focus on being able to exploit an order of magnitude more resources.

So in summary, project level scheduling seems to be ok for us and what would be great is if more volunteers could run VM applications. What is the limitation? Improving the sharing of individual machines would be important for some volunteers and we can try to make improvements in this area but it will not necessarily make a big difference to the overall project. Of course have happy volunteers is very important for the health of a project so it is something that should be addressed.
ID: 30609 · Report as offensive     Reply Quote
Harri Liljeroos
Avatar

Send message
Joined: 28 Sep 04
Posts: 674
Credit: 43,151,503
RAC: 15,790
Message 30611 - Posted: 2 Jun 2017, 21:27:01 UTC - in response to Message 30604.  

computezrmle's suggestions sound very good to me.
ID: 30611 · Report as offensive     Reply Quote
Toby Broom
Volunteer moderator

Send message
Joined: 27 Sep 08
Posts: 798
Credit: 644,731,442
RAC: 233,853
Message 30612 - Posted: 2 Jun 2017, 22:51:03 UTC - in response to Message 30601.  
Last modified: 2 Jun 2017, 22:53:52 UTC

Tasks I think.

If you think it's OK then I'm happy.

It would be intresting to know why less people use the VM, computer resorces?
ID: 30612 · Report as offensive     Reply Quote
Crystal Pellet
Volunteer moderator
Volunteer tester

Send message
Joined: 14 Jan 10
Posts: 1268
Credit: 8,421,616
RAC: 2,139
Message 30619 - Posted: 3 Jun 2017, 14:41:41 UTC - in response to Message 30609.  

So in summary, project level scheduling seems to be ok for us and what would be great is if more volunteers could run VM applications. What is the limitation?

The main limitation for most crunchers in my opinion: "It's not set and forget".
ID: 30619 · Report as offensive     Reply Quote
PHILIPPE

Send message
Joined: 24 Jul 16
Posts: 88
Credit: 239,917
RAC: 0
Message 30631 - Posted: 4 Jun 2017, 20:44:36 UTC - in response to Message 30619.  

The original boinc philosophy was to use the power of the idle cores of volunteer's computer with low priority.
But it appears LHC projects need more than this statement.
The standard of most of public computer is about 4 cpus and 4 GBytes RAM, even if we start to see 8 cpus and between 6 and 8 GBytes ram associated.
More the requirements of the projects are far from this target and more the difficulty is big for the volunteer to suit and run them.
Not everyone has the amount of money to buy a gamer configuration for his personal use or the skillness to build it and / or to add ram memory.
The volunteer wants ,first of all to keep his computer responsive,when he uses it.
Not everyone has a dedicated host to crunch only boinc projects.
----------------------------------------------------------------------------------------------------------------------------------------------------------------
Maybe a public inquiries (sent by boinc message to be completed by volunteers) could give datas and informations on the people who crunch LHC project.
For instance :
Man , Woman
Student , Worker , retired
Where he learns about the possibility to run lhc projects (tv , magazines , social network , newspaper ,...)?
Does the computer bought only for crunch?
Does he crunch at home , at work , or both ?
Does he find clear , or not the instructions on the site ?
Does he use the forum ?
How does he evaluate his skillness (beginner , medium , pro)?
Does he install boinc as a service ?
Does he encounter troubles during running ?
Was it about the OS platform , the virtualbox manager , the internet service provider , the app_config )?
Does he crunch other project ?
Does he know virtualbox ?
Does he use it elsewhere ?
And so on...
The results of the inquiries may teach you some hidden facts , and bring a work to solve them.More you know about your volunteers (distribution and behavior), more you can understand and help them.
----------------------------------------------------------------------------------------------------------------------------------------------------------------
Theory tasks should be advised to beginners because they accept the shutdown and the reboot without any trouble and require the less ram memory.
And only 1 cpu and 1 job the first time.
By the way , some improvements should be made , about the RAM requirements.
Apparently , it is possible to run the theory , CMS , and LHCb tasks with less ram memory than defined by default.(with no errors and the duration of internal jobs was not longer)(Is 2048 MBytes a remnant of xp OS ?).
Modifying and reducing the default setting may prevent the beginner volunteer's host to be saturated.
But if default values are necessary , let inform the well skilled crunchers to increase it in their app_config.xml file.
It could enable more people to feel the first instant , more comfortable.
They have to trust in themselve and have the good feeling that they can do it without fear.If the computer becomes unresponsive , they don't go further...
ID: 30631 · Report as offensive     Reply Quote
Dave Peachey

Send message
Joined: 9 May 09
Posts: 17
Credit: 772,975
RAC: 0
Message 30632 - Posted: 4 Jun 2017, 22:08:46 UTC - in response to Message 30609.  
Last modified: 4 Jun 2017, 22:13:01 UTC

So in summary, project-level scheduling seems to be OK for us and what would be great is if more volunteers could run VM applications. What is the limitation?

Laurence,

The big issue for me (which also chimes with Philippe's comment in the immediately preceding post regarding the original DC philosophy) is that running any of the LHC sub-projects bar SixTrack seems to require an investment in additional time and resources beyond that which your average cruncher is prepared to proceed ... albeit "average" doesn't come near to describing some of the serious crunchers on LHC@Home ;-)

Specifically, and in order to run multiple instances of one or more sub-project WUs, I've found that this means:
- significant amounts of RAM for each machine (8GB seems to be a practical minimum; 32GB is better if the machine can take it) in order to run more than one single-core VM at a time and also use each machine for anything practical at the same time
- a large monthly data download allowance (ATLAS chews through 150-200GB per WU; CMS isn't far behind with its multiple jobs per session)
- fairly significant CPU power (to get the WU results back within a reasonable timescale)
- a robust computer set-up which can be optimised and left to run of its own devices (as noted above by Crystal Pellet) without encountering a periodic dearth of work (due to connection glitches with CERN servers) or periodic failures with the WUs it receives (bad batches of work)

On this basis, none of the sub-projects could be said to supply "rock solid" and "always available" work at a low TCO! All of this costs your average cruncher in terms of money to acquire the hardware, electricity and bandwidth allowances (hence more money) to operate it and also requires a degree of monitoring (time) which is more than most people are prepared to invest.

Now I know that some of the above criticisms could be levelled at any number of other BOINC projects so it's not unique to LHC@Home. However, allowing crunchers to optimise the balance for their individual machines might go some way to encouraging more people to run the LHC VM-based sub-projects rather than just sitting around waiting for SixTrack WUs.

Dave
ID: 30632 · Report as offensive     Reply Quote
Dave Peachey

Send message
Joined: 9 May 09
Posts: 17
Credit: 772,975
RAC: 0
Message 30634 - Posted: 5 Jun 2017, 0:31:38 UTC - in response to Message 30632.  
Last modified: 5 Jun 2017, 0:32:51 UTC

- a large monthly data download allowance (ATLAS chews through 150-200GB per WU; CMS isn't far behind with its multiple jobs per session)

Correction ... that should be "150-200MB per WU" (but that's still a lot given that I get through one ATLAS WU per hour per day).
ID: 30634 · Report as offensive     Reply Quote
Profile Laurence
Project administrator
Project developer

Send message
Joined: 20 Jun 14
Posts: 372
Credit: 238,712
RAC: 0
Message 30647 - Posted: 6 Jun 2017, 8:56:07 UTC - in response to Message 30619.  

The main limitation for most crunchers in my opinion: "It's not set and forget".


Why not? When do you have to intervene? We would really like this to just work out of the box so please let us know the reasons why it doesn't.
ID: 30647 · Report as offensive     Reply Quote
Profile Laurence
Project administrator
Project developer

Send message
Joined: 20 Jun 14
Posts: 372
Credit: 238,712
RAC: 0
Message 30648 - Posted: 6 Jun 2017, 9:24:57 UTC - in response to Message 30631.  

The original boinc philosophy was to use the power of the idle cores of volunteer's computer with low priority.


A computer has RAM, Storage and Network in addition to CPU and we would like to use all those resources. This my differ from the original philosophy but it is still volunteer computing and technology changes over time. For example today with people watching TV online, home residential networks have the capacity to do more data intensive computations.


The standard of most of public computer is about 4 cpus and 4 GBytes RAM, even if we start to see 8 cpus and between 6 and 8 GBytes ram associated.
More the requirements of the projects are far from this target and more the difficulty is big for the volunteer to suit and run them.


The specification of the HEP jobs is 2GB per core. This is what is needed to do the science. In addition the applications only run on Linux and require 64bits. Sixtrack is different as their application models the accelerator itself rather than the collisions. The other constraint is that LHC@home has to fit into the overall computing infrastructure that we have. For more details, you may be interested in this presentation that I gave on the computing challenge.


Not everyone has the amount of money to buy a gamer configuration for his personal use or the skillness to build it and / or to add ram memory.
The volunteer wants ,first of all to keep his computer responsive,when he uses it.
Not everyone has a dedicated host to crunch only boinc projects.

We do not expect people to buy machines specifically for this. If we go down this route we should have a sponsor a machine in our data center program :)
However, we would be happy if 1 core of that 4 core machine could be put to use between 5pm and 9am.


Maybe a public inquiries (sent by boinc message to be completed by volunteers) could give datas and informations on the people who crunch LHC project.


This has been studied by the Citizen Cyberlab.
[/list]
ID: 30648 · Report as offensive     Reply Quote
Profile Laurence
Project administrator
Project developer

Send message
Joined: 20 Jun 14
Posts: 372
Credit: 238,712
RAC: 0
Message 30649 - Posted: 6 Jun 2017, 9:44:15 UTC - in response to Message 30632.  


The big issue for me (which also chimes with Philippe's comment in the immediately preceding post regarding the original DC philosophy) is that running any of the LHC sub-projects bar SixTrack seems to require an investment in additional time and resources beyond that which your average cruncher is prepared to proceed ... albeit "average" doesn't come near to describing some of the serious crunchers on LHC@Home ;-)

Agreed and we would like to make this easier.


On this basis, none of the sub-projects could be said to supply "rock solid" and "always available" work at a low TCO! All of this costs your average cruncher in terms of money to acquire the hardware, electricity and bandwidth allowances (hence more money) to operate it and also requires a degree of monitoring (time) which is more than most people are prepared to invest.


We would like to make this "rock solid" to at least reduce the time that people are investing.
ID: 30649 · Report as offensive     Reply Quote
Jim1348

Send message
Joined: 15 Nov 14
Posts: 602
Credit: 24,371,321
RAC: 0
Message 30650 - Posted: 6 Jun 2017, 10:04:18 UTC - in response to Message 30649.  
Last modified: 6 Jun 2017, 10:05:35 UTC

It has been rock solid for me for several weeks, and I have not had any work shortages. But I run it on a dedicated Ubuntu machine with 32 GB memory, which I don't reboot very often.

Any errors have cleared themselves up; no problems with VirtualBox.
https://lhcathome.cern.ch/lhcathome/results.php?hostid=10477864
ID: 30650 · Report as offensive     Reply Quote
Crystal Pellet
Volunteer moderator
Volunteer tester

Send message
Joined: 14 Jan 10
Posts: 1268
Credit: 8,421,616
RAC: 2,139
Message 30651 - Posted: 6 Jun 2017, 10:30:20 UTC - in response to Message 30647.  

The main limitation for most crunchers in my opinion: "It's not set and forget".


Why not? When do you have to intervene? We would really like this to just work out of the box so please let us know the reasons why it doesn't.

I don't have reasons for my own (not the average cruncher), but what I've seen here and read on other fora

    a. Crunchers have to enable VT-x in their BIOS when disabled by default.
    b. VirtualBox in the BOINC package is too old, so one has to download and install VBox yourself (incl. Extension pack).
    A lot of crunchers are reluctant to install other software besides of BOINC.
    c. BOINC sometimes set vbox disabled and does not correct this after Vbox is installed.
    d. When no jobs available for a VM, BOINC task creates an error in spite of the volunteers host is functioning well.


The existence of Yeti's great checklist itself is also a prove that's not a set and forget project.

ID: 30651 · Report as offensive     Reply Quote
Erich56

Send message
Joined: 18 Dec 15
Posts: 1686
Credit: 100,368,702
RAC: 102,022
Message 30652 - Posted: 6 Jun 2017, 10:51:24 UTC - in response to Message 30650.  

It has been rock solid for me for several weeks, and I have not had any work shortages. But I run it on a dedicated Ubuntu machine with 32 GB memory, which I don't reboot very often.

Any errors have cleared themselves up; no problems with VirtualBox.


Similar situation with me - I run 2 GPUGRID tasks + 8 LHC@Home tasks on a dedicated machine (Windows) with a 6+6(HT) core processor, and 32GB RAM.

In addition, both projects are run on 3 smaller PCs as well, plus 1 CMS even on a 4GB notebook.

Of course, time dedication needs more than just "set and forget" which I would not be so much interested in anyway. Everything together needs close watching from time to time, and turning some screws now and then - only this makes it interesting :-)
ID: 30652 · Report as offensive     Reply Quote
Profile Yeti
Volunteer moderator
Avatar

Send message
Joined: 2 Sep 04
Posts: 453
Credit: 193,369,412
RAC: 10,065
Message 30654 - Posted: 6 Jun 2017, 11:36:52 UTC

Laurence,

sorry that I have to say the following sentences, but you have asked for it:

  1. BOINC in actual releases isn't really good handling VMs and their memory-requierements. BOINC should improve it's handling with VM so that a cruncher never sees "postponed: ....." Each postponed-message shows that BOINC can't handle the actual situation right

  2. Why does the volunteer have to check if his machine is capable running VMs ? I feel, this has to be done by BOINC or you create a small tool that checks the machine of a volunteer and tells him exactly what has to be done

  3. Normal crunchers do not expect that there has to be done more than attaching the project. I remember times, people where running Theory weeks by weeks and were credited, but they never crunched something usefull because they were sitting behind firewalls

  4. VMs are not the same as "normal" BOINC-Apps. Normal BOINC-Apps run on idle-priority and in all the years we could allow 100% use of processors for BOINC, but the machine didn't get sluggish.

    Having a VM running, this run's on "normal" priority, so if you want to use your machines for normal work, it is necessary to find out the the max-Core-Numbers individual for each of your hosts. On my top-machine, I can only run 2x4-Core-Atlas (it is a 6-Core-CPU with HT). If I offer one more core for BOINC my machine gets sluggish.

  5. You remember that your VMs download large packets after starting a WU. Not every user has enough bandwith for this and look at CMS, they need so much upload-speed that this seem to be a bigger problem

  6. So far as I can see, you need enthuiastic users for running your VM-Projects.


Coming to the end I would like to point you to two important base-points:

First, invest into the needed improvements I mentioned above.

Second, If first is successfull finished, invest in some kind of marketing for running the VM-Projects

Marketing must not be expensive, but make it more attractiv to run the VMs.

Set up a challenge, let people win or earn a visit in CERN, meet with Key-People from their favorite project, show them the real LHC, ...

Ahh, I nearly forgot: LHC@Home is a project that spends very low credits in compare to other projects. So, a lot of crunchers, that are for credits, don't like to crunch LHC@Home. Not all people are interested in the science of CERN

You could change this by lifting up the credits for running VMs




Supporting BOINC, a great concept !
ID: 30654 · Report as offensive     Reply Quote
Erich56

Send message
Joined: 18 Dec 15
Posts: 1686
Credit: 100,368,702
RAC: 102,022
Message 30655 - Posted: 6 Jun 2017, 11:44:41 UTC - in response to Message 30654.  

Ahh, I nearly forgot: LHC@Home is a project that spends very low credits in compare to other projects ...

that's what I was wondering about to begin with. What's the reason for this?


You could change this by lifting up the credits for running VMs

I fully endorse
ID: 30655 · Report as offensive     Reply Quote
Profile Laurence
Project administrator
Project developer

Send message
Joined: 20 Jun 14
Posts: 372
Credit: 238,712
RAC: 0
Message 30656 - Posted: 6 Jun 2017, 11:46:16 UTC - in response to Message 30651.  


a. Crunchers have to enable VT-x in their BIOS when disabled by default.

This is something that we can't work around. We did have a look at Hyper-V for Windows but just end up in the same situation. The only hope is that the HEP Software Foundation comes up with some new algorithms that are better for us.


b. VirtualBox in the BOINC package is too old, so one has to download and install VBox yourself (incl. Extension pack).

This is something that can and should be fixed.


c. BOINC sometimes set vbox disabled and does not correct this after Vbox is installed.

Does this still happen? I thought this bug was fixed?


d. When no jobs available for a VM, BOINC task creates an error in spite of the volunteers host is functioning well.[/list]

This should be resolved with the auto kill switch we recently added.


The existence of Yeti's great checklist itself is also a prove that's not a set and forget project.

Yes and no. There will aways be situations where it doesn't work first time and the check list will help identify where the issue is. However, resorting to this should be the exception rather than the rule.
ID: 30656 · Report as offensive     Reply Quote
Previous · 1 · 2 · 3 · Next

Message boards : Number crunching : Imbalance between Subprojects


©2024 CERN