Message boards :
Number crunching :
Long delays in jobs
Message board moderation
Author | Message |
---|---|
Send message Joined: 2 Sep 04 Posts: 209 Credit: 1,482,496 RAC: 0 |
Since this would be off topic to resource zero, i started a new thread. Igor wrote:
Well you can control this. First send work to 2 people. If a third result is needed for verification due to inconclusive or even inf 1 errored out, you can set the resend to a higher priority. Not sure how you do that exactly,might take a custom modification in your validator. But then there is a setting, i think, for the scheduler to resend higher priorities only to trusted hosts, those hosts that have quick return rates on work. This would reduce lag time on having results keep being sent due to an errors, overloaded caches where hosts wait until deadline trouble to return a result. All those deadlines add up every time a task is resent and 7 days becomes 14, 21, 28 before you have all the results. If it was 7 at most on the first and then resent to a trusted host, most likely you would have the third match on day 8, reduced time in those cases. I'm sure i read that in the documentation somewhere. Your other option would be to only send results out once and any inconclusie or errors process in house. That might be more work on your part than the first option. |
Send message Joined: 9 Aug 05 Posts: 36 Credit: 7,698,293 RAC: 0 |
Don´t send jobs to the slower computers. I have an amd sempron wich is quite slow. And there is a lot of slow pentium 4 out there. An application for Gpu would also increase the computing power availiable to the project. And limit 1 wu/cpu. It will reduce the time the results take to go back to your servers. |
Send message Joined: 2 Sep 04 Posts: 209 Credit: 1,482,496 RAC: 0 |
|
Send message Joined: 2 Sep 04 Posts: 209 Credit: 1,482,496 RAC: 0 |
Igor I had covered this before in the quotta thread, I guess you missed this. Here it is again, i think this would help reduce the long times after the initial flow, helping those scientist types get the whole batch back sooner. There is a section in the documentation called "Accelerating retries" I think you should read this section and use this method also. Basically what it does, if a host returns bad results that host is marked unrelaible. Hosts that return good results are marked relaible. A bad host can become relaiable after if it stops turning in bad work and keeps on returning good work. When a result is returned bad, it's priority gets increased. This option resends those higher priority results to known reliable hosts. It does two things, rewards relaible hosts with more avaialble work chance and reduces turn around by not constanlty sending results to other bad hosts. It only affect work needing resend, ie a third, forth or more try after the initial 2 have a go at it. You can also mark work in advance as a higher priority and it gets sent only to these relaible hosts, like if you had some small study you need quick turnaround on. |
Send message Joined: 25 Jan 11 Posts: 179 Credit: 83,858 RAC: 0 |
Keith's suggestion is an excellent idea. Please implement the reliable host idea ASAP. Before you reduce the deadline, why not send an email to every volunteer who has a host that has not contacted the new server. The email would request that they detach from the old URL and attach to the new URL. Many, many volunteers don't read the news or forums. The email would also reach users who are interested in LHC but gave up and detached because there was no work for so many months. I wouldn't mind a deadline of 5 days or even 4 but suggest trying the email first. |
Send message Joined: 2 Sep 08 Posts: 5 Credit: 121,460 RAC: 0 |
I agree with jujube. I read a message in SETI's forum, otherwise I wouldn't know that LHC was again active. |
Send message Joined: 16 May 11 Posts: 79 Credit: 111,419 RAC: 0 |
with help of Keith, I have implemented the reliable/trusted host settings correctly now. I believe the execution turn-around will improve. Will monitor and see. Thank you much! skype id: igor-zacharov |
Send message Joined: 2 Sep 04 Posts: 209 Credit: 1,482,496 RAC: 0 |
with help of Keith, I have implemented the reliable/trusted host settings correctly now. Glad I could help, I learned something too. Now if I can only remember the next time the issue comes up on another project is another question. It may be working. See work unit My host # 9920670 was the last one sent the task after 2 failures and 1 completed and the one sent to me was only seconds after the 1 completed/valid was returned, It ran almost 8 hours and the return time is 8 hours and 13 minutes after being sent, so my host did it promptly and returned it immediately. So apparently the retries are getting accelerated. You'll just have to run some batches, and see if in the end they come back quicker. |
Send message Joined: 25 Jan 11 Posts: 179 Credit: 83,858 RAC: 0 |
You'll just have to run some batches, and see if in the end they come back quicker. Another way to test to see if it's working is to get a new Sixtrack task, abort it (hopefully before you start it), then check to see whether the deadline on the resend is shorter than normal or just normal. Of course that doesn't tell if the resend is issued with a high priority flag on it, just if the deadline is shorter. Darn, they're out of work atm so can't test it now. |
Send message Joined: 2 Sep 04 Posts: 209 Credit: 1,482,496 RAC: 0 |
It is easier to wait and see if I get work. I can't abort all tasks as some would be on my home or work systems, i can't be in two places at once. I don't have remote access to all hosts. See here I got another task previously sent 4 times and it is alrady 6 days old before getting to me. 2 aborted, 2 inconclusive, the second person had it 5 days before returning. My try at it has a deadline at 3.5 days, this is an accelerated retry as the current setting is to multiply deadline by 0.5 (7*0.5=3.5) This one happens to be at work, i'm at home. I will at some point abort one to see if the next try get accelerated more. I think this one will be half done before i get to work. - Being that some of these started on the 4th, the first 7 day deadline is tomorrow. Altough this could drag on some as any resends before yesterday would be at the old deadline with stilll 7 days. Let's watch the number outstanding now dropping just under 9,000, it should begin to drop rapidly as results wrap up and any resends get accelerated status. This batch will give mixed results. We need to get them all done, then watch clsoely the next batch of work from once it starts to see what happens. |
Send message Joined: 5 Oct 08 Posts: 12 Credit: 1,108,455 RAC: 0 |
It is easier to wait and see if I get work. I can't abort all tasks as some would be on my home or work systems, i can't be in two places at once. I don't have remote access to all hosts. I have one wu on my Intel ATOM, that will be finished in 28 hours. Is there a way to transfer that to a faster CPU or will it help the project if I kill the wu? |
Send message Joined: 2 Sep 04 Posts: 209 Credit: 1,482,496 RAC: 0 |
It is easier to wait and see if I get work. I can't abort all tasks as some would be on my home or work systems, i can't be in two places at once. I don't have remote access to all hosts. No it cannot be transfered. If you kill it therre is a change it gets sent out to someone that won't complete it until day 3.5, It would be best for you to just complete it as is. The point now is to reduce lag time near the end of the batch, we have reduced the turnaround on resends from 7 days to 3.5 days deadlines and they are only suppose to be sent to hosts with quick turn around times, so if your average is 6 days, sorry but you will not get any resends. This in the long run will help the scientists by reducing batch wait time. In the past, one would fail at deadline, get resent and then another 7 days fail, and then get resent, now you're into day 21, then day 28 etc. We have reduced this hopefully to 7-10 days, maybe 13, just guessing it won't go much beyond that. The other option is to shorten that 7 day deadline which cuts alot of people out of getting any work at all, so we found a compromise for users and scientists. At this point this is still a trial to see how things go. |
Send message Joined: 25 Jan 11 Posts: 179 Credit: 83,858 RAC: 0 |
I wonder how much lag time there is at the beginning of the batch. In between batches when there is no work available (except resends) hosts will go into progressive update backoff and not contact Sixtrack server for up to 24 hours. If for example most hosts are at the 12 hour backoff point (not an unreasonable scenario) when a new batch is released, it will take at least 12 hours before most hosts request work. To avoid that delay, I believe there is a setting on the server that tells hosts to update every X hours. If X were set to 2 hours then the delay at the beginning of the batch would be at most 2 hours. The downside is that such a setting would increase traffic to the server but maybe the server can handle it. edit added: The setting is called next_rpc_delay. |
Send message Joined: 2 Sep 04 Posts: 209 Credit: 1,482,496 RAC: 0 |
I wonder how much lag time there is at the beginning of the batch. In between batches when there is no work available (except resends) hosts will go into progressive update backoff and not contact Sixtrack server for up to 24 hours. If for example most hosts are at the 12 hour backoff point (not an unreasonable scenario) when a new batch is released, it will take at least 12 hours before most hosts request work. To avoid that delay, I believe there is a setting on the server that tells hosts to update every X hours. If X were set to 2 hours then the delay at the beginning of the batch would be at most 2 hours. The downside is that such a setting would increase traffic to the server but maybe the server can handle it. I don't think there is that much here. Look at the server status page. Since yesterday when new work was added, already in less than 24 hours, more like 12, 1/2 of it is in progress already. |
Send message Joined: 25 Jan 11 Posts: 179 Credit: 83,858 RAC: 0 |
I think 75% would be in progress if they had <next_rpc_delay>2</next_rpc_delay> set. To verify this one way or the other would take a lot of runs and a lot of data points. If the server can take the extra traffic it makes sense to turn the function on if they're looking for ways to speed up batch completion without shortening the deadline. |
Send message Joined: 22 Jul 05 Posts: 72 Credit: 3,962,626 RAC: 0 |
You'll just have to run some batches, and see if in the end they come back quicker. Here is an example of a workunit where two copies of the task were issued at approximately the same time and one of those copies was aborted very soon afterwards. It's now about 18 hours after the event and a resend task still remains unsent. Irrespective of the priority level or shortened deadline of a resend, I would have thought it would be good to issue resends rather promptly. This is another example showing exactly the same (no resend after many hours) behaviour. Cheers, Gary. |
Send message Joined: 25 Jan 11 Posts: 179 Credit: 83,858 RAC: 0 |
I've noticed that too. Apparently resends are inserted at the back of the queue and it's a FIFO queue. If they had gone to the front of the queue they surely would have been issued by now. A prompt reissue would definitely be the right thing IF tasks are generated on the fly, that is results returned now are used to generate tasks to be sent 10 minutes from now. That doesn't appear to be the case at this project. Is it optimal even when tasks are not generated on the fly? Well, the fast-reliable hosts that would have received those resends had they been sent immediately are probably still returning results fast and reliably so there is no time lost there. What will happen is the tail of the queue, where all the resends are presumably accumulating, will be crunched exclusively by fast-reliable hosts if they are using match maker scheduling. In that case, if there are more hi-priority resends than fast-reliables, there will be resends waiting in queue while normal hosts are denied tasks because all remaining tasks are designated for fast-reliables. That might not be optimal. Depends on the ratio of resends to fast-reliables and just how much faster and more reliable a fast-reliable is. |
Send message Joined: 2 Sep 04 Posts: 209 Credit: 1,482,496 RAC: 0 |
I've noticed that too. Apparently resends are inserted at the back of the queue and it's a FIFO queue. If they had gone to the front of the queue they surely would have been issued by now. Well no, not from what i read. When a task is reissued, the priority is supoosed to increment by 1. Higher priority tasks are suppose to go out first. It might depend on the shared memory queue already being full, but if the task does say resissue, then it is in the queue waitying to be sent. If it jsut showed the aborted and other task, and not a third one yet, then i'd be worried. I'll have Igor check and we will review the settings again.
This would depend on a host that fits into the qualifications to contact the server. Quite possible since all host might already have a full cahce of work, they might be busy. With some current tasks running 8 hours, it could be at lest 8 hours before a reliable hosts contacts the server, then also it has to be adifferent one from the other two the tasks was already sent to. Remember too the one per cpu limit, reliable hosts have to finsih at least one of what they have before getting another. Also remember resource shares, this might limit some hosts from asking for more work immediately. There are a lot of factors to consider. I'd only get worried if the resend does not dissapear within 24 to 36 hours. We are watching the results to see what happens, maybe some adjustsments are necessary, maybe not, give it some time to see what happens first before jumping to any conclusion.
At the moment we picked settings that allow a large number of hosts to attempt this. If you have lots of errors you're out, also when you abort a bunch i think your hosts reliablity drops. It is per host, not per user. We made it to exclude slow hosts taking longer than the 3.5 days that the 7 day deadline is going to get reduced to. As this project is known for putting up work and it dissapeareing within a day, i don't see a problem that there won't be enough reliable hosts to crucnh any resneds, and really resends should not be that many compared to the total number of tasks in a batch, less than 1%. It is not like that 90% or results need to be resent a third time. |
Send message Joined: 9 Aug 05 Posts: 36 Credit: 7,698,293 RAC: 0 |
I have several tasks which are "validation inconclusive" for almost 24h now, and haven´t been sent to a 3º host yet. |
Send message Joined: 2 Sep 04 Posts: 209 Credit: 1,482,496 RAC: 0 |
Maybe because the scheduler and validator shows "NOT RUNNING". |
©2025 CERN