Message boards : Number crunching : How does task switching actually work?
Message board moderation

To post messages, you must log in.

AuthorMessage
CloverField

Send message
Joined: 17 Oct 06
Posts: 74
Credit: 51,502,912
RAC: 22,234
Message 42508 - Posted: 16 May 2020, 1:58:44 UTC

I am currently only the LHC@home project through bionic and I have selected to receive jobs from all applications.
However I noticed that it seems to basically get stuck running certain projects some times. Like it will get a batch of atlas tasks and then only run ATLAS for the rest of the day, Other times it will jump all over the place. What actually controls what order the tasks run in?
Is this something I can fix buy adjusting my switch between tasks setting?
ID: 42508 · Report as offensive     Reply Quote
computezrmle
Volunteer moderator
Volunteer developer
Volunteer tester
Help desk expert
Avatar

Send message
Joined: 15 Jun 08
Posts: 2386
Credit: 222,947,568
RAC: 137,248
Message 42512 - Posted: 16 May 2020, 8:29:12 UTC - in response to Message 42508.  

... What actually controls what order the tasks run in?

Random.
Is this something I can fix buy adjusting my switch between tasks setting?

No.

It has been explained a couple of times.
Feel yourself invited to use the forum's search form or just read recent threads.
https://lhcathome.cern.ch/lhcathome/forum_thread.php?id=5421&postid=42464

In addition to the post behind the link:
CERN runs more than just 1 server for loadbalancing reasons.
If your (ATLAS) work request gets answered by a server that has no ATLAS task in it's queue at that moment you will either get other work or a "no work available..." message.
ID: 42512 · Report as offensive     Reply Quote
CloverField

Send message
Joined: 17 Oct 06
Posts: 74
Credit: 51,502,912
RAC: 22,234
Message 42516 - Posted: 16 May 2020, 14:01:03 UTC - in response to Message 42512.  

Sorry I explained that poorly say I send a work request and I get ten atlas tasks, ten theory tasks, and ten cms tasks.
It will start running the atlas tasks for a little bit, get about half way in then switch to theory run those for a little bit, then switch to cms and so on.
Im wondering more why it doesnt run tasks to completion and instead jumps all over the place between the tasks I have downloaded.
ID: 42516 · Report as offensive     Reply Quote
CloverField

Send message
Joined: 17 Oct 06
Posts: 74
Credit: 51,502,912
RAC: 22,234
Message 42520 - Posted: 16 May 2020, 16:05:42 UTC

Heres an example. It was running 3 atlas tasks and 8 theory tasks for a bit. The atlas tasked finished and it started working on the rest of the theory task I had downloaded.
The scheduler got some work and downloaded some more atlas tasks, and instead of letting the theory tasks finish it paused them and started right up on the atlas tasks it just downloaded.
ID: 42520 · Report as offensive     Reply Quote
Crystal Pellet
Volunteer moderator
Volunteer tester

Send message
Joined: 14 Jan 10
Posts: 1268
Credit: 8,421,616
RAC: 2,139
Message 42522 - Posted: 16 May 2020, 16:26:31 UTC - in response to Message 42520.  

The problem is that the Theory's think they need 10 days, but the ATLAS-tasks have an earlier deadline.
BOINC is in panic mode. It thinks it can not finish the tasks before deadline and starts the one with the nearest deadline.
You may reduce the estimated time left for the Theory's by changing 864000 into 360000 in the project file Theory_2019_10_01.xml.
For the new started tasks the estimated left time will be 100 hours after the task had run for a while.
ID: 42522 · Report as offensive     Reply Quote
CloverField

Send message
Joined: 17 Oct 06
Posts: 74
Credit: 51,502,912
RAC: 22,234
Message 42530 - Posted: 17 May 2020, 23:08:59 UTC - in response to Message 42522.  

So it kind of worked. It still will start new atlas tasks as they come in over already running theory tasks.
However when it is done with the atlas tasks that it has downloaded it will let theory tasks run until the first one is complete and then it usually fetches new work.
Should I reduce the 360000 to something like 130000 or is this something I'm just going to have to live with.
ID: 42530 · Report as offensive     Reply Quote
Crystal Pellet
Volunteer moderator
Volunteer tester

Send message
Joined: 14 Jan 10
Posts: 1268
Credit: 8,421,616
RAC: 2,139
Message 42532 - Posted: 18 May 2020, 8:08:43 UTC - in response to Message 42530.  

Should I reduce the 360000 to something like 130000 or is this something I'm just going to have to live with.

The estimated time left is of no significance, cause BOINC don't know how long the job within the VM will run.
ID: 42532 · Report as offensive     Reply Quote
Mr P Hucker
Avatar

Send message
Joined: 12 Aug 06
Posts: 418
Credit: 5,667,249
RAC: 48
Message 42661 - Posted: 30 May 2020, 11:13:59 UTC

I don't like this task switching, it seems unnecessary. I changed "switch between applications" in Boinc to 100000 minutes, (i.e. never). Which means once you start something, finish it!
ID: 42661 · Report as offensive     Reply Quote
CloverField

Send message
Joined: 17 Oct 06
Posts: 74
Credit: 51,502,912
RAC: 22,234
Message 42666 - Posted: 30 May 2020, 12:04:08 UTC - in response to Message 42661.  

I don't like this task switching, it seems unnecessary. I changed "switch between applications" in Boinc to 100000 minutes, (i.e. never). Which means once you start something, finish it!

So I tried messing around with that as well and as far as I can tell that only applies when you are running multiple projects. Since I'm only running lhc at home it lets the tasks run to completion.
The actual "issue" seems to be running all the LHC@home projects at once. Since the atlas projects have such an earlier deadline then any off the other projects it likes to jump in the instant another task finishes, and since atlas tasks are multicore it will suspend the other jobs. The other virtual box projects really dont like this and it was causing me to have tons of errored/suck jobs that I would have to abort manually.
ID: 42666 · Report as offensive     Reply Quote
Mr P Hucker
Avatar

Send message
Joined: 12 Aug 06
Posts: 418
Credit: 5,667,249
RAC: 48
Message 42668 - Posted: 30 May 2020, 12:12:22 UTC - in response to Message 42666.  

I don't like this task switching, it seems unnecessary. I changed "switch between applications" in Boinc to 100000 minutes, (i.e. never). Which means once you start something, finish it!

So I tried messing around with that as well and as far as I can tell that only applies when you are running multiple projects. Since I'm only running lhc at home it lets the tasks run to completion.
The actual "issue" seems to be running all the LHC@home projects at once. Since the atlas projects have such an earlier deadline then any off the other projects it likes to jump in the instant another task finishes, and since atlas tasks are multicore it will suspend the other jobs. The other virtual box projects really dont like this and it was causing me to have tons of errored/suck jobs that I would have to abort manually.


Ah yes I get that too when Atlas appears. Boinc is really rubbish at multicore scheduling. I'd prefer it just left the single ones to run down a bit then use a few too many cores for a while. I had hundreds of single core Rosetta tasks, then it downloaded some 8 core Atlases. Once it got down to 7 Rosettas left, it paused them all and started an Atlas, leaving them all 95% done! Then eventually it realised that the Rosettas would end up missing the deadline and panicked, but failed to return them quick enough since Boinc got paused for a couple of hours with an exclusive application (game).

I saw a similar problem - I had a Theory task which was not going to meet the deadline, so instead of panicking and doing that immediately, it panicked and did a different LHC task in priority mode - completely insane!
ID: 42668 · Report as offensive     Reply Quote
CloverField

Send message
Joined: 17 Oct 06
Posts: 74
Credit: 51,502,912
RAC: 22,234
Message 42671 - Posted: 30 May 2020, 12:29:19 UTC - in response to Message 42668.  

Yeah I plan to build a atlas only box at some point in the future when I retire this comp.
That seems like the easiest way to fix the issue, as I'm not sure if the atlas team could adjust the deadlines.
I don't want to mess up their science just so I don't have to check on my comp once a day lol.
ID: 42671 · Report as offensive     Reply Quote
Mr P Hucker
Avatar

Send message
Joined: 12 Aug 06
Posts: 418
Credit: 5,667,249
RAC: 48
Message 42673 - Posted: 30 May 2020, 12:48:07 UTC - in response to Message 42671.  

Yeah I plan to build a atlas only box at some point in the future when I retire this comp.
That seems like the easiest way to fix the issue, as I'm not sure if the atlas team could adjust the deadlines.
I don't want to mess up their science just so I don't have to check on my comp once a day lol.


I have a couple of 24 core machines which run LHC (and other projects). Usually I have my LHC account set to hand out anything. For some reason this gives me virtually no Atlases, I can only assume they're popular and get used up by other people on Atlas only. This works well because I only ever have 0 or 1 Atlases running, so other single core tasks fill in the rest.

Sometimes I like to run Atlas for a while, I did it recently to see if I had enough RAM to do so, and because I was wondering why I never got any. So I turned the other projects off and set LHC to only send Atlas. That fits nicely, as I get 3 x 8 core Atlases running at once. I guess if you have for example a 12 core machine, you should set your LHC account to limit it to 6 cores per task, then you can run two 6s.
ID: 42673 · Report as offensive     Reply Quote
Keith T.
Avatar

Send message
Joined: 1 Mar 07
Posts: 47
Credit: 32,356
RAC: 0
Message 42745 - Posted: 1 Jun 2020, 17:22:23 UTC

This is more of a BOINC question than an LHC question.

The BOINC client software switches between tasks every 60 minutes by default, if you have more tasks In Progress than available Cores.

I have seen it try to run several tasks, but run out of memory, or disk space. It may then switch to a different task with lower expected resource usage.

The Event Log may be helpful, otherwise, you might get more information here https://boinc.berkeley.edu/forum_index.php
ID: 42745 · Report as offensive     Reply Quote
Mr P Hucker
Avatar

Send message
Joined: 12 Aug 06
Posts: 418
Credit: 5,667,249
RAC: 48
Message 42746 - Posted: 1 Jun 2020, 17:27:47 UTC - in response to Message 42745.  

This is more of a BOINC question than an LHC question.

The BOINC client software switches between tasks every 60 minutes by default, if you have more tasks In Progress than available Cores.

I have seen it try to run several tasks, but run out of memory, or disk space. It may then switch to a different task with lower expected resource usage.


Sometimes it does, sometimes it doesn't. I've not spotted any sensible decisions by Boinc! For example I had a machine with Rosettas and Universes queued. The Rosettas need a lot of RAM, the Universes need hardly any. Yet it was sitting for hours with 3 Rosettas running and a core sat idle. It would not start a Universe task which would easily have fitted into RAM. It had another Rosetta task that it wanted to run next, and had an error against it saying I had run out of RAM. [Facepalm]
ID: 42746 · Report as offensive     Reply Quote
Keith T.
Avatar

Send message
Joined: 1 Mar 07
Posts: 47
Credit: 32,356
RAC: 0
Message 42747 - Posted: 1 Jun 2020, 17:34:48 UTC - in response to Message 42746.  

There is apparently a BOINC Flag that can limit the number of Cores that a Project will use.
I have not managed to get it to work properly yet.
My 4 core Intel Atom machine can only manage 1 or 2 Rosetta tasks without struggling, but it also runs Einstein tasks on the GPU. It can manage 3 LHC or WCG, or 4 SETI tasks, if we ever get work from SETI again :(
ID: 42747 · Report as offensive     Reply Quote
Mr P Hucker
Avatar

Send message
Joined: 12 Aug 06
Posts: 418
Credit: 5,667,249
RAC: 48
Message 42748 - Posted: 1 Jun 2020, 17:43:47 UTC - in response to Message 42747.  
Last modified: 1 Jun 2020, 17:44:40 UTC

There is apparently a BOINC Flag that can limit the number of Cores that a Project will use.
I have not managed to get it to work properly yet.
My 4 core Intel Atom machine can only manage 1 or 2 Rosetta tasks without struggling, but it also runs Einstein tasks on the GPU. It can manage 3 LHC or WCG, or 4 SETI tasks, if we ever get work from SETI again :(


Yes, I use the flag in the Rosetta project folder. Max concurrent set to 3. I'm just irritated that it can't work this out by itself. If you're moving house, and have a box that you can fit only 3 large objects into, you put a smaller one in with them rather than leave empty space!

Not sure about SETI. They did say they were considering a different way of scanning for signals.
ID: 42748 · Report as offensive     Reply Quote
CloverField

Send message
Joined: 17 Oct 06
Posts: 74
Credit: 51,502,912
RAC: 22,234
Message 42761 - Posted: 2 Jun 2020, 0:57:02 UTC - in response to Message 42748.  

The main issue with that for me is ATLAS loves to give me 8 core tasks. Which then kick 8 other jobs to the side, and usually break them.
What I want my tasks to do is say hey an atlas 8 core is ready. Let 8 more tasks finish and then slot the atlas in the free space. I could limit ATLAS to one core task,
but that kinds defeats the point of a threadrippper no?
ID: 42761 · Report as offensive     Reply Quote
Mr P Hucker
Avatar

Send message
Joined: 12 Aug 06
Posts: 418
Credit: 5,667,249
RAC: 48
Message 42763 - Posted: 2 Jun 2020, 1:04:25 UTC - in response to Message 42761.  

The main issue with that for me is ATLAS loves to give me 8 core tasks. Which then kick 8 other jobs to the side, and usually break them.
What I want my tasks to do is say hey an atlas 8 core is ready. Let 8 more tasks finish and then slot the atlas in the free space. I could limit ATLAS to one core task,
but that kinds defeats the point of a threadrippper no?


This is a major flaw in the workings of the Boinc scheduler. You could try asking them to sort it, but they're a strange bunch. And you'll have to visit them in Github as they don't listen to anyone in the forums.

Although, why are your single core tasks breaking? The only problem I end up with is stuff not meeting deadlines. The Boinc scheduler is rubbish at that, it leaves things to the last minute, and if you happen to have the computer off or playing a game etc, you're a bit late sending them back.
ID: 42763 · Report as offensive     Reply Quote
CloverField

Send message
Joined: 17 Oct 06
Posts: 74
Credit: 51,502,912
RAC: 22,234
Message 42765 - Posted: 2 Jun 2020, 2:17:33 UTC - in response to Message 42763.  
Last modified: 2 Jun 2020, 2:18:22 UTC

The main issue with that for me is ATLAS loves to give me 8 core tasks. Which then kick 8 other jobs to the side, and usually break them.
What I want my tasks to do is say hey an atlas 8 core is ready. Let 8 more tasks finish and then slot the atlas in the free space. I could limit ATLAS to one core task,
but that kinds defeats the point of a threadrippper no?


This is a major flaw in the workings of the Boinc scheduler. You could try asking them to sort it, but they're a strange bunch. And you'll have to visit them in Github as they don't listen to anyone in the forums.

Although, why are your single core tasks breaking? The only problem I end up with is stuff not meeting deadlines. The Boinc scheduler is rubbish at that, it leaves things to the last minute, and if you happen to have the computer off or playing a game etc, you're a bit late sending them back.


It seems to be due to net work IO, only happens if the task switches in like the first ten minutes or however long it takes to configure itself, but when ATLAS forces the task swap, they are waiting to get something and when they come back on line they are still in that waiting state and just sit there forever.

In the case of theory they do something like this or they are completely unresponsive and you cant hit them at all through the vm console.

ID: 42765 · Report as offensive     Reply Quote

Message boards : Number crunching : How does task switching actually work?


©2024 CERN