Long delays in jobs

Author	Message
Filipe Send message Joined: 9 Aug 05 Posts: 36 Credit: 7,698,293 RAC: 0	Message 23478 - Posted: 12 Oct 2011, 16:27:14 UTC - in response to Message 23477. The scheduler shows as "RUNNING" on the status page. ID: 23478 · Reply Quote

KAMasud Send message Joined: 7 Oct 06 Posts: 114 Credit: 23,192 RAC: 0	Message 23480 - Posted: 12 Oct 2011, 18:25:17 UTC - in response to Message 23477. LoL. You want some one to restart crunching? ID: 23480 · Reply Quote

Krunchin-Keith [USA] Volunteer moderator Project tester Volunteer developer Volunteer tester Send message Joined: 2 Sep 04 Posts: 209 Credit: 1,482,496 RAC: 0	Message 23482 - Posted: 12 Oct 2011, 20:09:12 UTC Last modified: 13 Oct 2011, 14:01:25 UTC I still have Igor looking in to this issue. It appears all the settings we came up with are correct and working. Tasks start at priority 0. There are priority 1's in the database, only way they get there (at this time) is by the resend mechinism and it kicks the priority up by one for each attempted resend. There were no 2's. The scheduling used is default job scheduling, which according to the documents handles this. There are ample hosts to handle resends with the criteria of finishing a task in half the normal time (actually less) which is about half the hosts, eliminate another 10% for those with too many errors and we still have some 3,300 available out of 7,500 or so making daily credit. Don't worry if your host has errors, your host can become relaible again as it turns in good results. The above numbers will of course change every time a new host connects, someone disconnects or more good work is done. Even counting the number of resends in the database, if every host got an equal share, they would get less than 2 each, hard to divide, some might get 1, some 2 if they could be handed out equally that is. So at this point that to me seems that there certainly enough hosts to handle any resends in quick time. I more expect some hosts might get 4, 8 or 12 for the larger cpu's and then some won't get any, it all depends i guess on how much work they request and when, luck of the draw. I've checked some of the examples people posted in this thread, even my own account has some work units still pending with a third unsent. None of those examples seem to have been sent. Also on my account I can't see any of the tasks completed that were sent to me were a resend, I was one of the two initial. Not all my hosts fall into the 'reliable' term, which is a misnomer, it is more like 'fast return capable without error'. So i would think i would of had at least 1 in the last three days, not none, not since the switch to v530.10. There was one on 530.09 just as Igor made the correct settings and just before the version change. This is the mystery why they are still unsent and I asked Igor to see if he can figure it out. ps, This method of resending only to the fastest hosts is also a 'reward' for doing good work quickly and not returning errors slowing down the results for the scientists. The reward is you get an extra few tasks. ID: 23482 · Reply Quote

jujube Send message Joined: 25 Jan 11 Posts: 179 Credit: 83,858 RAC: 0	Message 23498 - Posted: 14 Oct 2011, 4:17:27 UTC - in response to Message 23464. Here is an example of a workunit where two copies of the task were issued at approximately the same time and one of those copies was aborted very soon afterwards. It's now about 18 hours after the event and a resend task still remains unsent. Irrespective of the priority level or shortened deadline of a resend, I would have thought it would be good to issue resends rather promptly. This is another example showing exactly the same (no resend after many hours) behaviour. Hmmmm. Both of those tasks are still unsent after 48 hours. The scheduler has been running every time I've looked for the past 36 hours. It really does look like resends go to the back of the queue. ID: 23498 · Reply Quote

Krunchin-Keith [USA] Volunteer moderator Project tester Volunteer developer Volunteer tester Send message Joined: 2 Sep 04 Posts: 209 Credit: 1,482,496 RAC: 0	Message 23501 - Posted: 14 Oct 2011, 12:06:25 UTC - in response to Message 23498. Here is an example of a workunit where two copies of the task were issued at approximately the same time and one of those copies was aborted very soon afterwards. It's now about 18 hours after the event and a resend task still remains unsent. Irrespective of the priority level or shortened deadline of a resend, I would have thought it would be good to issue resends rather promptly. This is another example showing exactly the same (no resend after many hours) behaviour. Hmmmm. Both of those tasks are still unsent after 48 hours. The scheduler has been running every time I've looked for the past 36 hours. It really does look like resends go to the back of the queue. Well except, I assumed the tasks Id was issed as tasks were created. Those are in the 74's and I'm now getting 78's so If the queue is FIFO they should have gone. Even by date, those were creted 11 oct and the ones I jhsut got were created 13 Oct. So how could it be FIFO if ones created later are being sent and one created days ago haven't moved. Also it would not make sense to put higher priority tasks at the end of the queue, that defeats the purpose of priority. There seems to me for some reason the scheduler is totally ignoring them and not sending them anywhere. ID: 23501 · Reply Quote

Filipe Send message Joined: 9 Aug 05 Posts: 36 Credit: 7,698,293 RAC: 0	Message 23502 - Posted: 14 Oct 2011, 12:42:05 UTC http://lhcathomeclassic.cern.ch/sixtrack/workunit.php?wuid=323462 Check this task. Almost 72hours as passed without beeing reissued. Is it LIFO? ID: 23502 · Reply Quote

jujube Send message Joined: 25 Jan 11 Posts: 179 Credit: 83,858 RAC: 0	Message 23512 - Posted: 15 Oct 2011, 1:19:49 UTC - in response to Message 23502. I think if it was LIFO we would receive tasks created now followed by tasks created earlier, then tasks created now followed by tasks created earlier, with that pattern repeating. I think they're being ignored. It will be interesting to see what happens when the batch nears the end and then runs down to zero tasks. ID: 23512 · Reply Quote

Filipe Send message Joined: 9 Aug 05 Posts: 36 Credit: 7,698,293 RAC: 0	Message 23516 - Posted: 15 Oct 2011, 10:02:35 UTC Just have been reissued! now that the "task ready to send" are runnig low. But with this Lifo Configuration, thatÂ´s a long and unnecessary wait for the scientists. If this tasks had a FIFO sonfiguration, the older batch wolud terminate faster. I think if it was LIFO we would receive tasks created now followed by tasks created earlier, then tasks created now followed by tasks created earlier, with that pattern repeating. I think they're being ignored. It will be interesting to see what happens when the batch nears the end and then runs down to zero tasks. ID: 23516 · Reply Quote

Krunchin-Keith [USA] Volunteer moderator Project tester Volunteer developer Volunteer tester Send message Joined: 2 Sep 04 Posts: 209 Credit: 1,482,496 RAC: 0	Message 23518 - Posted: 15 Oct 2011, 10:39:02 UTC Last modified: 15 Oct 2011, 10:39:18 UTC Yes, I see they are moving now. But the mystery is still why so long a delay. One of the eight I got as a resend, started as a 530.08 started on the 4th, so it took 10 days to get resent. This is not good and certainly is not 'accelerated'. But looking at the deadline of the resends, they are shortened by 0.5 which is the value we used, so the 7 days gets shortened to 3.5 days. So part of the mechanism is working. If that part works then also the priority should be increased. I went back to see what hosts they were sent to, 4 of them dissapeared. Oh, my one host already finished them and returned them while i was writing, but it has the other four and is working on those. All 8 got to one host. I was hoping to see some on others so i could look at the turnaround time on those, the one host has less than a day average, so it at least falls in the paramaters we set. My one host doing those now, got plenty of other tasks from here in the last four days, it would have had plenty of time to do these resends had they been sent first but was given other work. This still leaves the question though why the scheduler waited, if the tasks had a higher priority they should have moved first. ID: 23518 · Reply Quote

Siegfried Niklas Send message Joined: 8 Oct 11 Posts: 4 Credit: 183,067 RAC: 0	Message 23519 - Posted: 15 Oct 2011, 11:45:07 UTC Looking for a WU with "validation inconclusive" I found this host. Looks like it became suddenly unstable. Within the next hour the host most likely crashed all ~450 WUs immediately after the download (Maximum daily WU quota per CPU = 80, 6-core CPU> 460). Do anybody know a possibility to identify this kind off rapid WU transfer automatically to stop it earlier (server side)? ID: 23519 · Reply Quote

Filipe Send message Joined: 9 Aug 05 Posts: 36 Credit: 7,698,293 RAC: 0	Message 23520 - Posted: 15 Oct 2011, 13:53:05 UTC We may need 1 or 2 days without a new batch to clear this out. ID: 23520 · Reply Quote

jujube Send message Joined: 25 Jan 11 Posts: 179 Credit: 83,858 RAC: 0	Message 23521 - Posted: 15 Oct 2011, 18:17:01 UTC LOL! Even my slow old Celeron received a high priority resend (it has a 3 day deadline)!! So it doesn't take much to be deemed fast and reliable around here, just a stable machine and a very small cache. That Celeron runs Sixtrack and Test4Therory each with 50% resource share and .1 day cache. It isn't fast but I guess the turnaround time is good enough. I can't explain why the resends are delayed to the end of the batch but I really don't think it matters because they get sent to fast-reliable hosts anyway. When they get resent is important only if the project is generating WUs on the fly. They're not generating on the fly here so there is no disadvantage in delaying them until the end of the batch. ID: 23521 · Reply Quote

jujube Send message Joined: 25 Jan 11 Posts: 179 Credit: 83,858 RAC: 0	Message 23522 - Posted: 15 Oct 2011, 18:27:59 UTC - in response to Message 23519. Looking for a WU with "validation inconclusive" I found this host. Looks like it became suddenly unstable. Within the next hour the host most likely crashed all ~450 WUs immediately after the download (Maximum daily WU quota per CPU = 80, 6-core CPU> 460). Do anybody know a possibility to identify this kind off rapid WU transfer automatically to stop it earlier (server side)? Why stop this host? Since this host spends only 5 seconds on a task, it is efficiently converting WUs to a 3 day deadline. Therefore it is speeding up the completion of the batch. Yes, there is a small waste of bandwidth but it's very small. ID: 23522 · Reply Quote

Siegfried Niklas Send message Joined: 8 Oct 11 Posts: 4 Credit: 183,067 RAC: 0	Message 23523 - Posted: 15 Oct 2011, 20:07:22 UTC - in response to Message 23522. Last modified: 15 Oct 2011, 20:39:26 UTC Looking for a WU with "validation inconclusive" I found this host. Looks like it became suddenly unstable. Within the next hour the host most likely crashed all ~450 WUs immediately after the download (Maximum daily WU quota per CPU = 80, 6-core CPU> 460). Do anybody know a possibility to identify this kind off rapid WU transfer automatically to stop it earlier (server side)? Why stop this host? Since this host spends only 5 seconds on a task, it is efficiently converting WUs to a 3 day deadline. Therefore it is speeding up the completion of the batch. Yes, there is a small waste of bandwidth but it's very small. I am not interested in theories - I am only orientated on practice! The WU 366581 is already "Unsent" - 450 others may also be unsent? WHY ?? - and how to avoid? AND "jujube" LOL! - what about?? updating - the WU 366581 is since "15 Oct 2011 20:28:35 UTC" "In progress" Crunching Family - Forum ID: 23523 · Reply Quote

Krunchin-Keith [USA] Volunteer moderator Project tester Volunteer developer Volunteer tester Send message Joined: 2 Sep 04 Posts: 209 Credit: 1,482,496 RAC: 0	Message 23524 - Posted: 15 Oct 2011, 22:42:21 UTC - in response to Message 23519. Last modified: 15 Oct 2011, 22:53:36 UTC Looking for a WU with "validation inconclusive" I found this host. Looks like it became suddenly unstable. Within the next hour the host most likely crashed all ~450 WUs immediately after the download (Maximum daily WU quota per CPU = 80, 6-core CPU> 460). Do anybody know a possibility to identify this kind off rapid WU transfer automatically to stop it earlier (server side)? No nothing automatic except the quota system which is designed to do exactly that. The only way to reduce this situation is lower quota for everybody. But then when a batch of short run tasks are issued, it would be easy to meet the quota for a day in minutes and not earn much credit the rest of the day, thus the current 80 quota. If the host is producing errors, then that hosts' quota is lowered automatically towards 1, until it produces valid results, then it can earn a higher quota back up to the max set by the project. The only way to stop this kind of thing is for users to stop hiding their hosts, then someone could send them a PM saying, hey check your host. Maybe they will see they also are earning not much credit and look into it. ps, latest check shows all tasks have been issued. now let us see how long it takes to return those 24,000 still processing. ID: 23524 · Reply Quote

jujube Send message Joined: 25 Jan 11 Posts: 179 Credit: 83,858 RAC: 0	Message 23525 - Posted: 15 Oct 2011, 23:35:21 UTC - in response to Message 23523. Looking for a WU with "validation inconclusive" I found this host. Looks like it became suddenly unstable. Within the next hour the host most likely crashed all ~450 WUs immediately after the download (Maximum daily WU quota per CPU = 80, 6-core CPU> 460). Do anybody know a possibility to identify this kind off rapid WU transfer automatically to stop it earlier (server side)? Why stop this host? Since this host spends only 5 seconds on a task, it is efficiently converting WUs to a 3 day deadline. Therefore it is speeding up the completion of the batch. Yes, there is a small waste of bandwidth but it's very small. I am not interested in theories - I am only orientated on practice! The WU 366581 is already "Unsent" - 450 others may also be unsent? WHY ?? - and how to avoid? Oriented only on practice? Well it is plain to see that in practice that host is converting tasks from a 7 day deadline to a 3 day deadline rather efficiently. That is a fact. Another fact is that reducing tasks to a 3 day deadline will speed up completion of the batch. That's not theory, that's fact. The fact that they were/are unsent is not a big concern because they'll get resent eventually. There is no advantage in sending them immediately unless the project is generating WUs on the fly, which thay are not. If they hadn't been converted to 3 day deadline they might have waited until 7 days to be returned. Thus the batch has been speeded up. Do you have any problem with that on either a practical level or theoretical level? AND "jujube" LOL! - what about?? I was amused, in practice and not theoretically, by the fact that my slow old Celeron has been deemed fast and reliable. Anything wrong with that "Siegfried Niklas"? ID: 23525 · Reply Quote

Zapped Sparky Send message Joined: 22 Oct 08 Posts: 26 Credit: 75,214 RAC: 0	Message 23527 - Posted: 16 Oct 2011, 1:28:39 UTC - in response to Message 23524. Looking for a WU with "validation inconclusive" I found this host. Looks like it became suddenly unstable. Within the next hour the host most likely crashed all ~450 WUs immediately after the download (Maximum daily WU quota per CPU = 80, 6-core CPU> 460). Do anybody know a possibility to identify this kind off rapid WU transfer automatically to stop it earlier (server side)? No nothing automatic except the quota system which is designed to do exactly that. The only way to reduce this situation is lower quota for everybody. But then when a batch of short run tasks are issued, it would be easy to meet the quota for a day in minutes and not earn much credit the rest of the day, thus the current 80 quota. If the host is producing errors, then that hosts' quota is lowered automatically towards 1, until it produces valid results, then it can earn a higher quota back up to the max set by the project. The only way to stop this kind of thing is for users to stop hiding their hosts, then someone could send them a PM saying, hey check your host. Maybe they will see they also are earning not much credit and look into it. ps, latest check shows all tasks have been issued. now let us see how long it takes to return those 24,000 still processing. Unfortunately people on Seti have tried PMing hosts who have their computers unhidden in order to alert them to problems their computer may be having, and I can honestly say practically 99% don't even get a response. Reducing hosts that produce nothing but errors to one task per day (or even black listing them so they get no tasks at all) may help. I know it's not in the spirit of boinc but it seems it is something that should be taken seriously and necessary steps taken in order to reduce errors. ID: 23527 · Reply Quote

jujube Send message Joined: 25 Jan 11 Posts: 179 Credit: 83,858 RAC: 0	Message 23528 - Posted: 16 Oct 2011, 2:42:05 UTC - in response to Message 23527. I've tried PMing at Test4Theory too. I got 1 response from 100 PMs. The reason is people set it and forget it and rarely even come back to the project message boards to check anything let alone to see if they have any PMs to read. A year or more ago CPDN sent emails (not PMs) to owners of hosts that were crashing task after task. They also blacklisted those hosts temporarily. The email asked the owners to post in the CPDN forum and state that they had received the email. Then they receive instructions to remedy the situation. When they report back that they've carried out the instructions their computer is allowed more tasks. If it still crashes tasks it gets blacklisted again until the owner follows further instructions. The emails went out about a year ago and people are still reporting in that they just received the email. That shows how little care and attention some people pay to BOINC after they set it up. Sixtrack should do something similar and while they're at it they should send emails to all the owners who have joined the project but haven't had a computer contact the scheduler at the new URL. There are probably a few thousand computers whose owners want to contribute but don't know the URL has changed. If the admins want to do that they can send me a zipped dump of the database and I'll write a script to comb the database for task crashing computers and computers that haven't contacted the scheduler for a while and send the owners an email. ID: 23528 · Reply Quote

Richard Mitnick Send message Joined: 20 Dec 07 Posts: 69 Credit: 599,151 RAC: 0	Message 23529 - Posted: 16 Oct 2011, 11:43:31 UTC I recently added a sixth machine (ID 9937283) to work on SixTrack. I got one WU, it finished successfully. Then I got the messages of "Project has no work available". So, I came here looking for some sort of outlook on general job availability,and I found this thread. Everyone who knows me know that I am not technically adept, just zealous for the LHC. First, while I do know that there are people who build computers, maybe cheap Linux boxes whatever, and use BOINC projects to "race" them, I have never before seen the term crunching farms. That said, every BOINC project needs to deal with this phenomenon, and they do it. If there is something unique about SixTrack, then SixTrack needs to find a way of eliminating the crunching farms. The strength of any project is not measured by numbers of machines, it is measured by numbers of users. If I am one user with 100 machines, any time I like I can simply up and quit any project, and poof, 100 machines are gone. And, I know from my T4T experience that you guys can see who is doing what (jujube). LHC@home managed in its origins to squander almost all of 90,000 users (BOINCstats). I suggest that this was because "try it, you'll like it" failed for a lack of work. You need to keep crunchers busy. If that means that you somehow eliminate crunching farms, then that is what you need to do. We have a new opportunity to do something great. My personal wish is to see the two LHC@home projects each vie with SETI@home in TeraFLOPS. That is a tall order, and maybe we will never get there. But, we need to try to be the best that we can. If there is any error in my logic, I apologize and will certainly read any rebuttal. Please check out my blog http://sciencesprings.wordpress.com http://facebook.com/sciencesprings ID: 23529 · Reply Quote

Ageless Send message Joined: 18 Sep 04 Posts: 143 Credit: 27,645 RAC: 0	Message 23530 - Posted: 16 Oct 2011, 11:43:47 UTC - in response to Message 23528. If the admins want to do that they can send me a zipped dump of the database and I'll write a script to comb the database for task crashing computers and computers that haven't contacted the scheduler for a while and send the owners an email. And you can be trusted, how? Just on your pretty face? ;-) Seriously, the admins shouldn't send anyone a dump of the database. It makes the project not very trustworthy if they do. Jord BOINC FAQ Service ID: 23530 · Reply Quote

LHC@home