Message boards : Number crunching : Possible explanation for "No Tasks sent ..."
Message board moderation

To post messages, you must log in.

AuthorMessage
Profile Gary Roberts

Send message
Joined: 22 Jul 05
Posts: 72
Credit: 3,962,626
RAC: 0
Message 24368 - Posted: 15 Jul 2012, 3:07:06 UTC

From time to time the server status page indicates plenty of tasks (often thousands) but repeated scheduler requests fail to get any. I see this quite a lot and others have commented about it. This particular message is a classic recent example.

I think I understand what is happening and I'm posting this in a separate thread because there are (if I'm correct) consequences that should be addressed.

When a bunch of new tasks are inserted into the DB, the 'ready to send' (RTS) number will start increasing and those hosts 'knocking on the door' will start receiving 'primary' tasks - those with the _0 or _1 extension on their name. The RTS will be fluctuating depending on the rate of creation versus the rate of distribution.

At one end of the spectrum, there will always be a proportion of hosts that fairly promptly trash new tasks and at the other end, there will be other hosts that completely fail to return at any stage. So there will be a steady ongoing flow of 'resend' tasks - those with an extension of _2 or greater - being added to the RTS pool. I'm assuming that a resend task is added to the pool as soon as the server becomes aware of the primary task failure. If you 'drill down' into WUs, you will find examples of such tasks labelled as 'unsent'.

I'm guessing that they show on the server status page as RTS. However they remain as 'unsent' for what seems like an unnecessarily long period.

I've just taken a look at the tasks list for some of my hosts. Using the 'show names' link I can find many very recent 'resend' tasks sent to a number of hosts. In every case I looked at, there was a considerable gap between the time of failure of the primary task and the time when the resend task was actually sent out. The shortest gap I saw was around 10 hours but most were in the range 12-20 hours. The server seems to want to 'sit on them' for a while.

I'm guessing that this is why you can see RTS tasks but you can't get them immediately.

Resend tasks have a shorter deadline - the project wants the results back ASAP and people understand this. So why can't they be distributed immediately? Is there some project config flag that controls this?

If you need a shorter deadline for resends, delaying their distribution seems counter-productive.


Cheers,
Gary.
ID: 24368 · Report as offensive     Reply Quote
Richard Haselgrove

Send message
Joined: 27 Oct 07
Posts: 186
Credit: 3,297,640
RAC: 0
Message 24375 - Posted: 16 Jul 2012, 13:46:32 UTC - in response to Message 24368.  

Although Gary has explained the general principle very clearly, I don't agree with the conclusion that the project is deliberately slowing down the distribution of resent tasks.

The set of tasks 'ready to send' is less of a 'pool', more of a linear pipeline or queue. From what I've seen, this project tends to create new jobs in relatively large batches: these will be processed first. Under recent circumstances, a significant proportion failed, causing a replacement task to be tacked on to the end of the queue. On its own, I think that would be sufficient to explain the delay before the replacements were sent out to volunteers - the 'first run' _0 and _1 from the batch all needed to be given a chance first.

Then, with no new batches being created while the staff wrestled with the new server configuration, we'll have reached a section of pipeline almost completely populated with _2 and above resent tasks. It looks as if the project has the Accelerating retries configuration in effect - whether deliberately, or as a side effect of the other server configuration issues, I'll leave to them.

That has two side-effects:

1) resent tasks are only allocated to 'reliable' hosts. Quite a few machines may have lost that status because of the project's server problems, and they may not get any work until a new batch of first-run tasks is available.

2) resent tasks are given a much shorter deadline. I'm seeing deadlines of half the already-tight 6 days 15.5 hours. If users are participating in other BOINC projects, and have set even a modest cache size, the server may judge that tasks won't be turned round in the 3-day 'resend' deadline interval, and decline to send tasks to that host.
ID: 24375 · Report as offensive     Reply Quote
Eric Mcintosh
Volunteer moderator
Project administrator
Project developer
Project tester
Project scientist

Send message
Joined: 12 Jul 11
Posts: 857
Credit: 1,619,050
RAC: 0
Message 24387 - Posted: 17 Jul 2012, 6:14:08 UTC - in response to Message 24375.  

Thankyou Richard; this seems very reasonable to me.
I can confirm we submit in batches. We have been trying
to reduce the famous "tail" so that we complete a a study.
I have proposed to Igor that I rtaher re-submit the last few cases
as "fresh" work on my side as we seem to be having great
problems with the BOINC features. And yes I have also
been thinking about "reliable" clients (when most of the
problems are in fact ours). I would rather just switch off or
blacklist clients with wrong numerical results, or who fail
"all the time". Eric.
ID: 24387 · Report as offensive     Reply Quote
Profile Gary Roberts

Send message
Joined: 22 Jul 05
Posts: 72
Credit: 3,962,626
RAC: 0
Message 24390 - Posted: 17 Jul 2012, 11:23:20 UTC - in response to Message 24375.  

Although Gary has explained the general principle very clearly, I don't agree with the conclusion that the project is deliberately slowing down the distribution of resent tasks.

Richard, thanks very much for your response. As usual, your insights are very helpful and provide food for thought. I'm sorry if you thought I was making conclusions - I was actually asking questions. I wasn't accusing the project of deliberately slowing down resends. I tried to point out that I was making assumptions and I asked if there was some configuration flag available to promote resends higher up the queue. I understand the FIFO queue so I shouldn't have referred to it as a 'pool'.

I freely admit I have almost zero knowledge of project configuration flags but at one point Igor referred to 'matchmaker' scheduling and so I googled it and did some reading. I think I understand the basic principle of creating a cache of jobs in shared memory and that matchmaker scheduling is a mechanism for selecting tasks preferentially from within that job cache. What I was really asking was if there was any mechanism for getting tasks from the general queue preferentially into the job cache so that they would have a chance of being distributed to hosts quickly.

... this project tends to create new jobs in relatively large batches: these will be processed first.

Yes, that's my observation too, but the insertion of the new jobs does take some time - the RTS on the server status page takes a while to reach the maximum. I can't say I've ever actually timed it though.

... It looks as if the project has the Accelerating retries configuration in effect - whether deliberately, or as a side effect of the other server configuration issues, I'll leave to them.

Yes, but the opening sentence mentions "... send timeout-generated retries ..." which perhaps is an oversight but seems to be suggesting it doesn't apply to error-generated retries. Reading on further then suggests that it can apply to all retries depending on which of the two flags <reliable_priority_on_over> or <reliable_priority_on_over_except_error> (if any) has been set non-zero. I can understand the urgency for a job that has timed out. Perhaps there's less need for urgency for a retry resulting from a download error??

1) resent tasks are only allocated to 'reliable' hosts. Quite a few machines may have lost that status because of the project's server problems, and they may not get any work until a new batch of first-run tasks is available.

The great majority of my hosts had many download errors and should have been 'unreliable'. Several that I've looked at have retries - a couple have large numbers of them. I guess it only takes a relatively few successful tasks to restore the 'reliable' status.

2) resent tasks are given a much shorter deadline. I'm seeing deadlines of half the already-tight 6 days 15.5 hours. If users are participating in other BOINC projects, and have set even a modest cache size, the server may judge that tasks won't be turned round in the 3-day 'resend' deadline interval, and decline to send tasks to that host.

I'm seeing the 3 days 7+ hours deadline on all resends but some of my hosts don't seem to have much trouble getting them. They are not being excluded by the server even though they have a 3 day cache.

One such host has been given around 250 tasks, another was given 450 tasks and a third was given 200. The host in most trouble is the first mentioned because there were around 130-150 resends with the very short deadline in the mix. The other two have a lot less and so have some time to attempt recovery.

There is no way the host in trouble can finish its cache in the allowed time so I've had to abort around 60 so far with probably more to follow. I think this host was over-supplied because the time estimate of a bunch of tasks was way too short (35 mins instead of close to 4 hrs. As soon as the first ones of these were finished, the estimate was adjusted and the host went into panic mode - where it still is despite the aborts. I can now let it run for a bit using 4 cores (E@H has been put on hold) so it might be able to work through the remaining tasks. There is a bit of time before I'll need to assess this again.

I've aborted more than 20 tasks on the host that had 200. I've not aborted any yet on the host with 450 as it has already worked off the resends and the next deadline is a few days away. I'll wait another 24 hours to see if it happens to get a bunch of shorties but some sort of cull seems inevitable. On this host, I actually saw a whole bunch of tasks with a 30-40 min estimate change to around 4 hours when I first noticed the huge cache and started to investigate. I assume it was much the same deal on the other two. There are other hosts with around 100+ tasks in their caches but they seem to be coping at the moment.

I agree completely that it's quite unfair to give hosts that support multiple projects, resends with such a short deadline. I don't know why the server isn't excluding my hosts which already have a 3 day E@H cache on board. I don't seem to be suffering the fate that I should be :-).

Cheers,
Gary.
ID: 24390 · Report as offensive     Reply Quote
Profile Gary Roberts

Send message
Joined: 22 Jul 05
Posts: 72
Credit: 3,962,626
RAC: 0
Message 24391 - Posted: 17 Jul 2012, 12:10:26 UTC - in response to Message 24387.  

... I can confirm we submit in batches. We have been trying
to reduce the famous "tail" so that we complete a a study.

Here is a suggestion for you to consider. Let's say you have a batch of 100K WUs to submit. How easy would it be for you to submit them in sub-batches of say 5K WUs at a time with say 30 minutes (or even longer) between each sub-batch. Perhaps you could design a script that automates this.

There always seem to be a number of hosts that trash tasks fairly quickly. By leaving the time gap, the retries for any such trashed tasks will be inserted into the queue before the second sub-batch is added. This process should continue with each successive sub-batch. By the time the 100K WUs are all added, the retries for a large fraction of the download errors and early compute errors should already be distributed through the queue and not clumped together at the end. The retries for these could have the normal deadline, since they will be going out concurrently with normal tasks. Also, they could be sent to all hosts rather than just reliable ones.

Much later, the 'deadline miss' retries will be needed. Maybe by using the <reliable_priority_on_over_except_error> flag, (described at the link Richard provided), you could assign the higher priority (much shorter deadline??) to just these retries, rather than to all of them as is the case now?
Hopefully there will be significantly fewer of these and they will come over a period so no one host should get a large number of them in one hit.

Cheers,
Gary.
ID: 24391 · Report as offensive     Reply Quote
Profile Igor Zacharov
Volunteer moderator
Project administrator
Project developer
Project tester
Project scientist

Send message
Joined: 16 May 11
Posts: 79
Credit: 111,419
RAC: 0
Message 24392 - Posted: 18 Jul 2012, 0:33:02 UTC - in response to Message 24391.  

these are all very good suggestions and thank you for the analysis.
We use the following configuragion flags at the moment:

1
230400
0.100000
0.5
0
1
10
5
1
1
2
3
1
18000
0

I found a critical parameter is . It seems that scheduler would only consider hosts "reliable" with that many jobs on record. Therefore, bringing it down allows more work to flow.

Please, tell me if you have any other suggestions.

skype id: igor-zacharov
ID: 24392 · Report as offensive     Reply Quote
[AF>FAH-Addict.net]toTOW

Send message
Joined: 9 Oct 10
Posts: 77
Credit: 3,671,357
RAC: 0
Message 24399 - Posted: 19 Jul 2012, 6:47:39 UTC

This time, we're out of work ...
ID: 24399 · Report as offensive     Reply Quote

Message boards : Number crunching : Possible explanation for "No Tasks sent ..."


©2024 CERN