21) Message boards : Number crunching : Long delays in jobs (Message 23501)
Posted 14 Oct 2011 by Profile Krunchin-Keith [USA]
Post:
Here is an example of a workunit where two copies of the task were issued at approximately the same time and one of those copies was aborted very soon afterwards. It's now about 18 hours after the event and a resend task still remains unsent. Irrespective of the priority level or shortened deadline of a resend, I would have thought it would be good to issue resends rather promptly.

This is another example showing exactly the same (no resend after many hours) behaviour.


Hmmmm. Both of those tasks are still unsent after 48 hours. The scheduler has been running every time I've looked for the past 36 hours. It really does look like resends go to the back of the queue.

Well except, I assumed the tasks Id was issed as tasks were created. Those are in the 74's and I'm now getting 78's so If the queue is FIFO they should have gone. Even by date, those were creted 11 oct and the ones I jhsut got were created 13 Oct. So how could it be FIFO if ones created later are being sent and one created days ago haven't moved.

Also it would not make sense to put higher priority tasks at the end of the queue, that defeats the purpose of priority.

There seems to me for some reason the scheduler is totally ignoring them and not sending them anywhere.
22) Message boards : Number crunching : Long delays in jobs (Message 23482)
Posted 12 Oct 2011 by Profile Krunchin-Keith [USA]
Post:
I still have Igor looking in to this issue.

It appears all the settings we came up with are correct and working.

Tasks start at priority 0. There are priority 1's in the database, only way they get there (at this time) is by the resend mechinism and it kicks the priority up by one for each attempted resend. There were no 2's.

The scheduling used is default job scheduling, which according to the documents handles this.

There are ample hosts to handle resends with the criteria of finishing a task in half the normal time (actually less) which is about half the hosts, eliminate another 10% for those with too many errors and we still have some 3,300 available out of 7,500 or so making daily credit.

Don't worry if your host has errors, your host can become relaible again as it turns in good results. The above numbers will of course change every time a new host connects, someone disconnects or more good work is done.

Even counting the number of resends in the database, if every host got an equal share, they would get less than 2 each, hard to divide, some might get 1, some 2 if they could be handed out equally that is. So at this point that to me seems that there certainly enough hosts to handle any resends in quick time. I more expect some hosts might get 4, 8 or 12 for the larger cpu's and then some won't get any, it all depends i guess on how much work they request and when, luck of the draw.

I've checked some of the examples people posted in this thread, even my own account has some work units still pending with a third unsent. None of those examples seem to have been sent. Also on my account I can't see any of the tasks completed that were sent to me were a resend, I was one of the two initial. Not all my hosts fall into the 'reliable' term, which is a misnomer, it is more like 'fast return capable without error'. So i would think i would of had at least 1 in the last three days, not none, not since the switch to v530.10. There was one on 530.09 just as Igor made the correct settings and just before the version change.

This is the mystery why they are still unsent and I asked Igor to see if he can figure it out.

ps,
This method of resending only to the fastest hosts is also a 'reward' for doing good work quickly and not returning errors slowing down the results for the scientists. The reward is you get an extra few tasks.
23) Message boards : Number crunching : Long delays in jobs (Message 23477)
Posted 12 Oct 2011 by Profile Krunchin-Keith [USA]
Post:
Maybe because the scheduler and validator shows "NOT RUNNING".
24) Message boards : Number crunching : Long delays in jobs (Message 23474)
Posted 12 Oct 2011 by Profile Krunchin-Keith [USA]
Post:
I've noticed that too. Apparently resends are inserted at the back of the queue and it's a FIFO queue. If they had gone to the front of the queue they surely would have been issued by now.


Well no, not from what i read. When a task is reissued, the priority is supoosed to increment by 1.

Higher priority tasks are suppose to go out first. It might depend on the shared memory queue already being full, but if the task does say resissue, then it is in the queue waitying to be sent. If it jsut showed the aborted and other task, and not a third one yet, then i'd be worried.

I'll have Igor check and we will review the settings again.


A prompt reissue would definitely be the right thing If tasks are generated on the fly, that is results returned now are used to generate tasks to be sent 10 minutes from now. That doesn't appear to be the case at this project.


This would depend on a host that fits into the qualifications to contact the server. Quite possible since all host might already have a full cahce of work, they might be busy. With some current tasks running 8 hours, it could be at lest 8 hours before a reliable hosts contacts the server, then also it has to be adifferent one from the other two the tasks was already sent to.

Remember too the one per cpu limit, reliable hosts have to finsih at least one of what they have before getting another. Also remember resource shares, this might limit some hosts from asking for more work immediately. There are a lot of factors to consider.

I'd only get worried if the resend does not dissapear within 24 to 36 hours.

We are watching the results to see what happens, maybe some adjustsments are necessary, maybe not, give it some time to see what happens first before jumping to any conclusion.


Is it optimal even when tasks are not generated on the fly? Well, the fast-reliable hosts that would have received those resends had they been sent immediately are probably still returning results fast and reliably so there is no time lost there. What will happen is the tail of the queue, where all the resends are presumably accumulating, will be crunched exclusively by fast-reliable hosts if they are using match maker scheduling. In that case, if there are more hi-priority resends than fast-reliables, there will be resends waiting in queue while normal hosts are denied tasks because all remaining tasks are designated for fast-reliables. That might not be optimal. Depends on the ratio of resends to fast-reliables and just how much faster and more reliable a fast-reliable is.

At the moment we picked settings that allow a large number of hosts to attempt this. If you have lots of errors you're out, also when you abort a bunch i think your hosts reliablity drops. It is per host, not per user. We made it to exclude slow hosts taking longer than the 3.5 days that the 7 day deadline is going to get reduced to. As this project is known for putting up work and it dissapeareing within a day, i don't see a problem that there won't be enough reliable hosts to crucnh any resneds, and really resends should not be that many compared to the total number of tasks in a batch, less than 1%. It is not like that 90% or results need to be resent a third time.
25) Message boards : Number crunching : Long delays in jobs (Message 23457)
Posted 11 Oct 2011 by Profile Krunchin-Keith [USA]
Post:
I wonder how much lag time there is at the beginning of the batch. In between batches when there is no work available (except resends) hosts will go into progressive update backoff and not contact Sixtrack server for up to 24 hours. If for example most hosts are at the 12 hour backoff point (not an unreasonable scenario) when a new batch is released, it will take at least 12 hours before most hosts request work. To avoid that delay, I believe there is a setting on the server that tells hosts to update every X hours. If X were set to 2 hours then the delay at the beginning of the batch would be at most 2 hours. The downside is that such a setting would increase traffic to the server but maybe the server can handle it.

edit added:

The setting is called next_rpc_delay.


I don't think there is that much here. Look at the server status page.

Since yesterday when new work was added, already in less than 24 hours, more like 12, 1/2 of it is in progress already.
26) Message boards : Number crunching : Long delays in jobs (Message 23437)
Posted 10 Oct 2011 by Profile Krunchin-Keith [USA]
Post:
It is easier to wait and see if I get work. I can't abort all tasks as some would be on my home or work systems, i can't be in two places at once. I don't have remote access to all hosts.

See here

I got another task previously sent 4 times and it is alrady 6 days old before getting to me. 2 aborted, 2 inconclusive, the second person had it 5 days before returning. My try at it has a deadline at 3.5 days, this is an accelerated retry as the current setting is to multiply deadline by 0.5 (7*0.5=3.5)

This one happens to be at work, i'm at home. I will at some point abort one to see if the next try get accelerated more. I think this one will be half done before i get to work.

-

Being that some of these started on the 4th, the first 7 day deadline is tomorrow. Altough this could drag on some as any resends before yesterday would be at the old deadline with stilll 7 days. Let's watch the number outstanding now dropping just under 9,000, it should begin to drop rapidly as results wrap up and any resends get accelerated status. This batch will give mixed results. We need to get them all done, then watch clsoely the next batch of work from once it starts to see what happens.


I have one wu on my Intel ATOM, that will be finished in 28 hours.
Is there a way to transfer that to a faster CPU or will it help the project if I kill the wu?

No it cannot be transfered. If you kill it therre is a change it gets sent out to someone that won't complete it until day 3.5, It would be best for you to just complete it as is.

The point now is to reduce lag time near the end of the batch, we have reduced the turnaround on resends from 7 days to 3.5 days deadlines and they are only suppose to be sent to hosts with quick turn around times, so if your average is 6 days, sorry but you will not get any resends. This in the long run will help the scientists by reducing batch wait time. In the past, one would fail at deadline, get resent and then another 7 days fail, and then get resent, now you're into day 21, then day 28 etc. We have reduced this hopefully to 7-10 days, maybe 13, just guessing it won't go much beyond that. The other option is to shorten that 7 day deadline which cuts alot of people out of getting any work at all, so we found a compromise for users and scientists. At this point this is still a trial to see how things go.
27) Message boards : Number crunching : Long delays in jobs (Message 23435)
Posted 10 Oct 2011 by Profile Krunchin-Keith [USA]
Post:
It is easier to wait and see if I get work. I can't abort all tasks as some would be on my home or work systems, i can't be in two places at once. I don't have remote access to all hosts.

See here

I got another task previously sent 4 times and it is alrady 6 days old before getting to me. 2 aborted, 2 inconclusive, the second person had it 5 days before returning. My try at it has a deadline at 3.5 days, this is an accelerated retry as the current setting is to multiply deadline by 0.5 (7*0.5=3.5)

This one happens to be at work, i'm at home. I will at some point abort one to see if the next try get accelerated more. I think this one will be half done before i get to work.

-

Being that some of these started on the 4th, the first 7 day deadline is tomorrow. Altough this could drag on some as any resends before yesterday would be at the old deadline with stilll 7 days. Let's watch the number outstanding now dropping just under 9,000, it should begin to drop rapidly as results wrap up and any resends get accelerated status. This batch will give mixed results. We need to get them all done, then watch clsoely the next batch of work from once it starts to see what happens.
28) Message boards : Number crunching : Tasks v530.09 crashing (Message 23430)
Posted 9 Oct 2011 by Profile Krunchin-Keith [USA]
Post:
Yeah something is not working.

I checked one of the x32 hosts, it shows 530.09 in boincmanage as running.

The task shows a v0.00 on the website.

I also looked into the lhc folder and slots for the task, there is only a 530.9 application.

on the website also tasks show as v530.08 and then v0.00 for the x32 hosts, no v530.09s appear in the task lists list, but for the x64 hosts it only shows v530.09 as run. Very odd ?
29) Message boards : Number crunching : Problem after updating computing profile (Message 23428)
Posted 9 Oct 2011 by Profile Krunchin-Keith [USA]
Post:
Note to Igor,

Yes "computing preferences" are missing 2 items.
I guess since we determined your server software is now a year old, that there have been some improvements you are missing.

In computing preferences (missing):
under processor useage:
Suspend work when non-BOINC CPU usage is above
0 means no restriction
Enforced by version 6.10.30+
Defualt is 25%

under nertwork preferences is
Transfer at most
Enforced by version 6.10.46+ --- Mbytes every --- days

I did check and test4theory has the new items. I checked some others, some do some don't. If annyone else has problems, try ageless' suggestion or find another project you are attached to and use that project as your master preferences as diederiks did.
30) Message boards : Number crunching : Long delays in jobs (Message 23427)
Posted 9 Oct 2011 by Profile Krunchin-Keith [USA]
Post:
with help of Keith, I have implemented the reliable/trusted host settings correctly now.
I believe the execution turn-around will improve. Will monitor and see.

Thank you much!


Glad I could help, I learned something too. Now if I can only remember the next time the issue comes up on another project is another question.

It may be working.
See work unit
My host # 9920670 was the last one sent the task after 2 failures and 1 completed and the one sent to me was only seconds after the 1 completed/valid was returned, It ran almost 8 hours and the return time is 8 hours and 13 minutes after being sent, so my host did it promptly and returned it immediately.

So apparently the retries are getting accelerated.

You'll just have to run some batches, and see if in the end they come back quicker.
31) Message boards : Number crunching : Tasks v530.09 crashing (Message 23426)
Posted 9 Oct 2011 by Profile Krunchin-Keith [USA]
Post:
we don't have much architectural choices when specifying which app version to run.

I have now retracted 530.9 (deleted) for all generic x86 Windows and Linux,
leaving 530.9 specifically only for platforms which report with AMD_x86_64 and Intel EM64T processors back to the server.

Please, check if that works for you.



I guess it is working.

My x64's have no work, but the last ones done were 530.09

My x32 are showing v0.00

I also see "Database Error" appear now on a lot of the website task pages when viewing results. Some pages say it twice. example
32) Message boards : Number crunching : Cuda App. (Message 23414)
Posted 8 Oct 2011 by Profile Krunchin-Keith [USA]
Post:
Read the post by Eric

Also this question was already asked 2 weeks ago in thread 3361
33) Message boards : Number crunching : Tasks v530.09 crashing (Message 23408)
Posted 7 Oct 2011 by Profile Krunchin-Keith [USA]
Post:
I don't know if the 32-bit apps were compiled that way.


The app sent to my Linux 64 bit machine is 32 bit. Run the file command against it and see. Don't know if the Windows app for 64 bit arch is 32 or 64.


In windows 7 x64 task manager, a lot of programs have *32 next to them, including sixtrack, i assume that means 32bit.


Applications:
Microsoft Windows (98 or later) running on an Intel x86-compatible CPU 	530.09 	4 Oct 2011 15:31:31 UTC
Microsoft Windows running on an AMD x86_64 or Intel EM64T CPU 	530.09 	4 Oct 2011 15:31:31 UTC
Linux running on an Intel x86-compatible CPU 	530.09 	4 Oct 2011 15:31:31 UTC
Linux running on an AMD x86_64 or Intel EM64T CPU 	530.09 	4 Oct 2011 15:31:31 UTC


The above means they support those archs but doesn't necessarily mean they have a 64 bit app for 64 bit arch.

34) Message boards : Number crunching : Too low credits granted in LHC (Message 23407)
Posted 7 Oct 2011 by Profile Krunchin-Keith [USA]
Post:
Please keep this discussion on topic about credits.
=
The other situation is being dealt with and does not need discussing.
=
=
Additional reminder, please everyone read forum rules, they appear to the left when you compose a post, there is also a link "more info" for you to read.
35) Message boards : Number crunching : Tasks v530.09 crashing (Message 23384)
Posted 6 Oct 2011 by Profile Krunchin-Keith [USA]
Post:
As I see your other computers are returning 530.09 OK lets assume it is the laptop only.

I also see the wingmen on those tasks completed ok, vaiting valaidation so that kind of eliminates bad work units.

Also to find out what error codes mean, you can look at the result detail once it is returned to the project. then you can go to the unofficial boinc wiki and search for error code, there is a list of error codes that sometimes gives more detail on a certain error.

-168 is ERR_FTOK

I don't know what that means, search the net I found

ERR_FTOK -168

BOINC cannot get file token (key) for semaphores.

I still don't know this one.

Have you restarted boinc or windows since the errors ?

Always try the simple things first.

If so and those don't help, then try a project reset on the laptop, quite possibly the download of a new app went wrong. A reset will download again and that may have been the problem.
36) Message boards : Number crunching : Long delays in jobs (Message 23381)
Posted 6 Oct 2011 by Profile Krunchin-Keith [USA]
Post:
Igor
I had covered this before in the quotta thread, I guess you missed this.

Here it is again, i think this would help reduce the long times after the initial flow, helping those scientist types get the whole batch back sooner.

There is a section in the documentation called "Accelerating retries"
I think you should read this section and use this method also. Basically what it does, if a host returns bad results that host is marked unrelaible. Hosts that return good results are marked relaible. A bad host can become relaiable after if it stops turning in bad work and keeps on returning good work. When a result is returned bad, it's priority gets increased. This option resends those higher priority results to known reliable hosts. It does two things, rewards relaible hosts with more avaialble work chance and reduces turn around by not constanlty sending results to other bad hosts. It only affect work needing resend, ie a third, forth or more try after the initial 2 have a go at it. You can also mark work in advance as a higher priority and it gets sent only to these relaible hosts, like if you had some small study you need quick turnaround on.
37) Message boards : Number crunching : Long delays in jobs (Message 23380)
Posted 6 Oct 2011 by Profile Krunchin-Keith [USA]
Post:
And limit 1 wu/cpu.


That limit is already in use, See the Daily Quota discussion.
38) Message boards : Number crunching : Long delays in jobs (Message 23378)
Posted 6 Oct 2011 by Profile Krunchin-Keith [USA]
Post:
Since this would be off topic to resource zero, i started a new thread.

Igor wrote:

...

This brings me to another thought, however. Physicists complain about the long
outfliers of a study. The bulk of first jobs comes quickly, then last jobs take very long, since they are sitting and waiting somewhere.
We may need to tune the system to squize tighter deadlines.

I would like to collect opinions about this first.

Well you can control this.

First send work to 2 people.

If a third result is needed for verification due to inconclusive or even inf 1 errored out, you can set the resend to a higher priority. Not sure how you do that exactly,might take a custom modification in your validator.

But then there is a setting, i think, for the scheduler to resend higher priorities only to trusted hosts, those hosts that have quick return rates on work.

This would reduce lag time on having results keep being sent due to an errors, overloaded caches where hosts wait until deadline trouble to return a result. All those deadlines add up every time a task is resent and 7 days becomes 14, 21, 28 before you have all the results. If it was 7 at most on the first and then resent to a trusted host, most likely you would have the third match on day 8, reduced time in those cases.

I'm sure i read that in the documentation somewhere.

Your other option would be to only send results out once and any inconclusie or errors process in house. That might be more work on your part than the first option.
39) Message boards : Number crunching : In my team, but not in my team. (Message 23320)
Posted 2 Oct 2011 by Profile Krunchin-Keith [USA]
Post:
If I do that, will I still be the founder when I return?

Should be, it happend to me and I was founder after rejoing team I already was on and founder of.

40) Message boards : Number crunching : In my team, but not in my team. (Message 23317)
Posted 2 Oct 2011 by Profile Krunchin-Keith [USA]
Post:
I appear to be in a state of flux team wise. If I look at my account, I can see my credit etc. and that I am a member, indeed, the founder of my team. Yet I do not appear in the teams stats. On my personal side, I see I am...

>>> Founder but not member of

... somewhat odd. I can see my team mates there.

There was a problem when they imported members and teams and some netries in the datbase are not correct. Just rejoin the correct team and it should be ok from then on.


Previous 20 · Next 20


©2024 CERN