Long delays in jobs

Author	Message
tullio Send message Joined: 19 Feb 08 Posts: 708 Credit: 4,336,250 RAC: 0	Message 23532 - Posted: 16 Oct 2011, 15:41:03 UTC - in response to Message 23529. Last modified: 16 Oct 2011, 15:47:22 UTC Maybe you haven't read the term "crunching farm" but you certainly know what a cluster is. All the top participants in Einstein@home are clusters of computers doing excellent work. That is not a problem for a scientific project where people are more interested in producing results than in credits. As you know, I am presently running CERN jobs in T4T without getting any credit, because of a bug the wrapper. Tullio Also, Linux boxes are not necessarily cheap. I use a SUN M20 WS which cost me more than a standard Windows PC and I am constantly upgrading it. ID: 23532 · Reply Quote

jujube Send message Joined: 25 Jan 11 Posts: 179 Credit: 83,858 RAC: 0	Message 23533 - Posted: 16 Oct 2011, 16:39:07 UTC - in response to Message 23529. If there is something unique about SixTrack, then SixTrack needs to find a way of eliminating the crunching farms. The strength of any project is not measured by numbers of machines, it is measured by numbers of users. If by strength you mean computing power then the number and type of machines is the measure of strength, IMHO. 100 users all with old P4 CPUs can't match the computing power of 1 user with 50 i7s. And, I know from my T4T experience that you guys can see who is doing what (jujube). Anybody can see what I can see if they know where and how to look. It's no secret. Click on any username and drill down through the links, unless they have their computers hidden. LHC@home managed in its origins to squander almost all of 90,000 users (BOINCstats). I suggest that this was because "try it, you'll like it" failed for a lack of work. You need to keep crunchers busy. If that means that you somehow eliminate crunching farms, then that is what you need to do. Why? They do real work and the 1 task per core at a time quota spreads the work around. Sixtrack will always be an on and off project. There will always be long spells of no work. The reason is this project finishes the computations needed for a particular LHC configuration and then there is nothing to do until LHC scientists change the configuration. One of the reasons this project has work now is because in the not too distant future they are going to upgrade the LHC and change the configuration. The calculations we do now are in preparation for that and possibly other configuration changes. When that's done there might be another long dry spell. I don't think stretching out the length of time we take to do the calculations is what the scientists want. Anyway, during those dry spells BOINC can keep crunchers busy with other projects. When a user pulls 100 machines from a project its always a sad thing but I don't think it's a valid reason to not allow large farms. Every project is grateful for whatever crunching power they can get and they all get used to the fact that crunchers come and crunchers go. ID: 23533 · Reply Quote

Filipe Send message Joined: 9 Aug 05 Posts: 36 Credit: 8,032,908 RAC: 15,198	Message 23534 - Posted: 16 Oct 2011, 17:20:13 UTC Sixtrack will always be an on and off project. There will always be long spells of no work. The reason is this project finishes the computations needed for a particular LHC configuration and then there is nothing to do until LHC scientists change the configuration. One of the reasons this project has work now is because in the not too distant future they are going to upgrade the LHC and change the configuration. Well. Eric has said is working on an Mac and may be on a GPU apllication for next year. So i hope/believe we have work for a few months at least. Since the project has been moved to the CERN have we not already acomplish more work than what we have done in the last years? ID: 23534 · Reply Quote

Richard Mitnick Send message Joined: 20 Dec 07 Posts: 69 Credit: 599,151 RAC: 0	Message 23535 - Posted: 16 Oct 2011, 17:51:16 UTC tullio- Einstein never seems to be in trouble for a dirth of work, so clusters are not a problem for Einstein. jujube- Thanks for your response. I fully understand the reasoning of the importance of farms, clusters, whatever they are called. For me, the signal question is why out of 92,000 total users, the project was down prior to this latest burst of energy to about 2500 users and I think 0.07 TeraFLOPS (that last number may be incorrect, someone might be able to correct it). But now we are at 4900 users, 2.155 Teraflops from almost 9000 machines. It has always been true that people could attach to multiple projects and thus always have filled cores. I am attached to about 12 different projects in varied groups based on the varying power of my machines and the needs of the projects. I am never out of work. But instead of just letting cores do other work, users left in droves. That is a problem with which the project must deal. Those current numbers work out to 1.84 machines/user average. I myself have now 6 machines attached. I do not think that I qualify as a farm or cluster. Nor does 1.84 imply any thing like that. Sure, there might be a small number of users with a large number of machines. And, the 92,000 is a total of different users over a period years, not all at any one time. But, it is not a lot of years. The question is why so many left and how we keep going in our current direction. I have nothing but the greatest respect for this project and its people, and great hopes for the project's continued success. Please check out my blog http://sciencesprings.wordpress.com http://facebook.com/sciencesprings ID: 23535 · Reply Quote

Siegfried Niklas Send message Joined: 8 Oct 11 Posts: 4 Credit: 183,067 RAC: 0	Message 23536 - Posted: 16 Oct 2011, 18:51:18 UTC - in response to Message 23524. Looking for a WU with "validation inconclusive" I found this host. Looks like it became suddenly unstable. Within the next hour the host most likely crashed all ~450 WUs immediately after the download (Maximum daily WU quota per CPU = 80, 6-core CPU> 460). Do anybody know a possibility to identify this kind off rapid WU transfer automatically to stop it earlier (server side)? No nothing automatic except the quota system which is designed to do exactly that. The only way to reduce this situation is lower quota for everybody. But then when a batch of short run tasks are issued, it would be easy to meet the quota for a day in minutes and not earn much credit the rest of the day, thus the current 80 quota. If the host is producing errors, then that hosts' quota is lowered automatically towards 1, until it produces valid results, then it can earn a higher quota back up to the max set by the project. The only way to stop this kind of thing is for users to stop hiding their hosts, then someone could send them a PM saying, hey check your host. Maybe they will see they also are earning not much credit and look into it. ps, latest check shows all tasks have been issued. now let us see how long it takes to return those 24,000 still processing. Thanks for the answer The only way to stop this kind of thing is for users to stop hiding their hosts, then someone could send them a PM saying, hey check your host Mea culpa - My hosts are now visible. (I decided years ago to hide my hosts - don't ask me why - Err.. I don't remember...) http://lhcathomeclassic.cern.ch/sixtrack/hosts_user.php?userid=221086 http://de.boincstats.com/stats/boinc_user_graph.php?pr=bo&id=d36c2879c46c05cab64b721e1710725e Crunching Family - Forum ID: 23536 · Reply Quote

jujube Send message Joined: 25 Jan 11 Posts: 179 Credit: 83,858 RAC: 0	Message 23537 - Posted: 16 Oct 2011, 20:28:14 UTC - in response to Message 23535. The question is why so many left and how we keep going in our current direction. This project was one of the first. As other projects came onstream perhaps some of the early volunteers here became more interested in other projects. The long dry spells didn't help either and I have heard the application used to crash a lot a few years back and the website was down a lot. I'm sure a lot of interested volunteers just don't know the URL is changed. I think it would be a very good idea to send an email to those who haven't contacted Sixtrack since it re-opened and explain that the app is very stable, the website is stable and that they have to attach to the new URL if they want work. I bet that would bring back 4,000 users or more which would speed up the completion of a batch considerably. And they need to get in touch with David Anderson and ask him to change the URL in the all_projects_list.xml so that when people bring up the Add Project Wizard it will show the correct URL for Sixtrack. Right now the wizard has the old URL from the old project. One email from an admin is all it takes. ID: 23537 · Reply Quote

Richard Mitnick Send message Joined: 20 Dec 07 Posts: 69 Credit: 599,151 RAC: 0	Message 23538 - Posted: 16 Oct 2011, 21:27:56 UTC I am going to stick my nose in here just one more time. The first most important responsibility of any organization or institution is to survive. This project has been given a second life. There are projects running on BOINC software which have more than one research target. one example is rosetta@home. Another is SETI@home. Both of these projects look at more than one kind of problem. Rosetta not only designs software used by other research projects (HPF in WCG); it also works in Malaria, anthrax, herpes. In protein folding it work in cancer, AIDS, Alsheimers. SET@home does not only Classic SETI, based in narrowband signals; it also does Astropulse, based in "broader-band short time pulses". The above mentioned WCG based HPF project, at the Bonneau Lab at New York University did not only Human Proteome Folding, but also the Human Microbiome Project. Also at WCG, the Discovering Dengue Drugs - Together project has worked on Dengue Fever, West Nile Virus, Yellow Fever. The point is Sixtrack can be a new beginning, but it need not be any kind of end point. LHC@home 2.0, suffering its own growing pains, is not covering all of the opportunities for work at CERN. I looked high and low at their site, I don't find for which experiments they are working. But, there are four, and they cannot be working for all of them. Also, at CERN, the LHC is not the end of the line. Next up will be the Compact Linear Collider (CLIC), or, just maybe the International Linear Collider. These are more places to look for work to keep this project up and running. There is nothing sacrosanct about the name "SixTrack". It's almost brandy new. Maybe a better name would be LHC@home/subatomics or LHC@home/quanta. A lot of people thought when the Tevatron was done that Fermilab was done. Fermilab is alive and kicking, with a whole bunch of new projects. And that means real money, winched with a crowbar from a U.S. Congress filled with mediocrity. So, Fermilab did a really good job of staying in the game. And, last, we are going to save the James Webb Space Telescope, late and over budget and doomed, but oh so necessary. I personally will never detach from this project, unless I am kicked off. We need to attract and keep people and machines, that is my interest here. Please check out my blog http://sciencesprings.wordpress.com http://facebook.com/sciencesprings ID: 23538 · Reply Quote

tullio Send message Joined: 19 Feb 08 Posts: 708 Credit: 4,336,250 RAC: 0	Message 23539 - Posted: 17 Oct 2011, 2:03:11 UTC I think that the most attractive feature for a project is to provide a stable flow of work. LHC@home has given little or no work for years, and people have left the project.Projects like Einstein@home provide work even if gravitational waves data are not available. So dr. Bruce Allen has given us a way to look for binary pulsars from Arecibo and Parkes data and also, from Fermi gamma-ray orbital telescope, for gamma-ray pulsars just to keep us working until the LIGO interferometers are upgraded to Advanced LIGO. Same for QMC@home. Some projects disappeared, like AQUA@home and QuantumFIRE@home. But we still have a menu to choose from in BOINC. Tullio ID: 23539 · Reply Quote

Richard Mitnick Send message Joined: 20 Dec 07 Posts: 69 Credit: 599,151 RAC: 0	Message 23540 - Posted: 17 Oct 2011, 15:52:16 UTC - in response to Message 23539. Tullio- That is very much the point. Keep work flowing. Some people are interested in what it is, some are not, but Einstein is keeping its users happy. That is exactly what I am looking for. Please check out my blog http://sciencesprings.wordpress.com http://facebook.com/sciencesprings ID: 23540 · Reply Quote

Siegfried Niklas Send message Joined: 8 Oct 11 Posts: 4 Credit: 183,067 RAC: 0	Message 23541 - Posted: 17 Oct 2011, 19:33:28 UTC - in response to Message 23525. Last modified: 17 Oct 2011, 19:40:39 UTC AND "jujube" LOL! - what about?? I was amused, in practice and not theoretically, by the fact that my slow old Celeron has been deemed fast and reliable. Anything wrong with that "Siegfried Niklas"? No, nothing wrong about that! I am a german cruncher with limited knowledge of the english language. I misunderstood your words - sorry! Crunching Family - Forum ID: 23541 · Reply Quote

jujube Send message Joined: 25 Jan 11 Posts: 179 Credit: 83,858 RAC: 0	Message 23542 - Posted: 17 Oct 2011, 23:35:03 UTC - in response to Message 23541. Niklas, It's OK. We all make mistakes occasionally :-) ID: 23542 · Reply Quote

Gary Roberts Send message Joined: 22 Jul 05 Posts: 72 Credit: 3,962,626 RAC: 0	Message 23543 - Posted: 18 Oct 2011, 1:41:45 UTC - in response to Message 23538. I am going to stick my nose in here just one more time.... I wish you wouldn't :-). You started spruiking the project in the Cafe and that's all fine and laudable there but rather out of place here. It seems you even realise you are continuing to hijack this thread and yet that doesn't seem to bother you. This thread was specifically started by a mod to air thoughts and ideas about how to deal with the slow returning 'tail' of tasks that always seems to slow down the completion of a run. By users making observations, we now know that 'resend' tasks are only sent out after the main body of work is exhausted. Initially this seemed to be a curious policy but the fact that resends are sent out with shortened deadlines to supposedly trusted and fast hosts seems to make up at least partially for the delay in issuing them. Do you have any thoughts about that? It would certainly be appropriate to post them here. As part of your evangelism, you have made a couple of curious statements. In your first post in this thread, you stated: ....SixTrack needs to find a way of eliminating the crunching farms. You implied that crunching farms were responsible for the lack of work for your recently added sixth machine. And yet, with six machines, you're really a farmer yourself :-)! You mightn't actually have a cattle ranch (yet) but you're at least on the way :-). Let me let you into a little secret! All projects love farms for all the obvious reasons. You are never going to convince any project that farms should be eliminated. You also stated: The strength of any project is not measured by numbers of machines, it is measured by numbers of users. If I am one user with 100 machines, any time I like I can simply up and quit any project, and poof, 100 machines are gone. There are two things wrong with that. Firstly, the strength of a project is entirely dependent on machines and not users. A project doesn't really care how many people actually own the attached computers. If a project had the option of accepting 1 new user with 1000 computers OR 500 new users each with a single machine, which offer do you think they would take :-). Secondly, you vastly overrate the effect of 1 user (100 computers) leaving. It would actually be like a drop in the ocean even for this project. Take a look at the server status page. At this point it shows that 240 new computers came on line in the last 24 hours. So 100 computers leaving wouldn't even be noticed. And the other thing you are ignoring is that the owners of big farms don't move their machines for capricious reasons. They have very clear reasons for choosing a project in the first place and they only tend to leave for a limited number of reasons, mostly to do with work supply. Of course there are other reasons like large teams having races using a particular project for a limited period. Do you really think that projects would be upset about this simply because the large influx is going to vanish at the end of the race period? Of course not! They view the work done during the race as an added bonus. In a followup post you stated: For me, the signal question is why out of 92,000 total users, the project was down prior to this latest burst of energy to about 2500 users .... The question is why so many left and how we keep going in our current direction. That's really two separate questions and the answer to both seems rather obvious. People left because there was essentially no work (or even signs of life) for many, many months. They became disenchanted with the lack of even basic snippets of information that might have kept up their interest. They took the view that they were being ignored. Secondly, we keep going in the current direction simply by maintaining ongoing work if possible and advising users in advance if there are likely to be periods of no work. The current rapid progress shows one thing clearly - people are interested in the project and are willing to forgive past poor behaviour. We don't (yet) need a PR campaign to get users - people are rapidly returning of their own accord. In fact, a PR campaign could easily be counter productive if the project is unable to keep supplying work as it is currently doing. Far better to think about a campaign when (if) the project gets to the stage where the ready-to-send tasks obviously can't be handled by the attached hosts. We are quite away from that point at the moment. Hopefully we will get there at some stage and then you can start your own PR thread in the cafe. I suspect that the interest in the LHC will generate the extra users anyway without too much PR being needed. Cheers, Gary. ID: 23543 · Reply Quote

geo... Send message Joined: 18 Sep 04 Posts: 7 Credit: 2,320,210 RAC: 0	Message 23544 - Posted: 18 Oct 2011, 1:57:35 UTC - in response to Message 23543. Thank you Gary... ID: 23544 · Reply Quote

Gary Roberts Send message Joined: 22 Jul 05 Posts: 72 Credit: 3,962,626 RAC: 0	Message 23545 - Posted: 18 Oct 2011, 2:47:37 UTC Earlier in this thread, I linked to two examples where tasks aborted didn't immediately result in resends being issued. They were prepared but remained unsent until all primary tasks were issued. In both cases there was a 4 day delay in issuing the resends. Initially I thought this would hamper the cleanup of the dregs of the run. When there were no primary tasks left, there was quite a spike of resends then available, presumably to those hosts regarded as "fast and reliable" and this seemed to go rather well. My own hosts got quite a few with the reduced deadline and they were completed quite quickly. The two examples I linked two were also done quite quickly, even though one was by a 'slow' host (by modern standards) :-). So the policy seems to have worked well but it would be good to get some admin feedback on how much of an improvement there actually was. With the extra small run of new tasks injected several hours ago but now (apparently) exhausted, it's a bit hard to use the status page as an indication. Cheers, Gary. ID: 23545 · Reply Quote

Gary Roberts Send message Joined: 22 Jul 05 Posts: 72 Credit: 3,962,626 RAC: 0	Message 23546 - Posted: 18 Oct 2011, 4:28:12 UTC This quorum represents an interesting example of what appears to be a bug in the BOINC server software. You can see that 4 copies were sent out before 2 were selected as agreeing sufficiently to allow validation. By tracing the issue and subsequent return times, we can see that, from the initial pair, there was no agreement and both would have been marked as "inconclusive" when first returned. The first resend was issued more than 4 days later and it was fairly quickly aborted. About 12 hours after the aborted task was returned, it was reissued, this time to one of my hosts which promptly completed and returned it. What should have happened at this point is that the first two "inconclusives", together with the freshly completed 4th task should have been re-assessed and the non-agreeing one should have been marked as invalid, rather than being left as a "pending inconclusive". The project doesn't seem to be using fully up-to-date server software so, if this is indeed a bug, it may well have been corrected in a later version. If not, it should be reported to the BOINC Devs. Cheers, Gary. ID: 23546 · Reply Quote

Zapped Sparky Send message Joined: 22 Oct 08 Posts: 26 Credit: 75,214 RAC: 0	Message 23549 - Posted: 18 Oct 2011, 17:37:56 UTC - in response to Message 23417. @Gary Roberts, from Igor's message 9th Oct with help of Keith, I have implemented the reliable/trusted host settings correctly now. I believe the execution turn-around will improve. Will monitor and see. Thank you much! Looks like from that quorum of yours and a couple of tasks sent my way that the reliable/trusted host update kicked in (or finally got around to the tasks) on the 15th. wu 279308 wu 279207 In both cases I'm the third host. I can't say it's confirmation (being only a few tasks) but it looks like it's starting to work as planned. ID: 23549 · Reply Quote

Gary Roberts Send message Joined: 22 Jul 05 Posts: 72 Credit: 3,962,626 RAC: 0	Message 23552 - Posted: 19 Oct 2011, 4:44:38 UTC - in response to Message 23549. @Gary Roberts, from Igor's message 9th Oct Thanks for taking the trouble but I'm not sure why you felt you needed to quote that message at me. I was fully aware of it and had digested it when I first read it some time ago. I'd stated earlier in this thread that I'd seen plenty of examples of 'shortened deadline' resends. Looks like from that quorum of yours .... Well, no, that quorum I linked to has nothing to do with the fast reliable hosts experiment. I highlighted it to draw attention to what appears to be a BOINC server bug. That quorum has been completed but one of its members has been left in an inconsistent state. The task deemed to be invalid has been left as 'pending' and I imagine this might prevent the quorum from being deleted at the appropriate time. In the past there were plenty of examples of pending tasks left cluttering up the database long after the main body of tasks had been deleted. I don't want to see that happening again. ... and a couple of tasks sent my way that the reliable/trusted host update kicked in (or finally got around to the tasks) on the 15th. I'm sure you are correct and those quorums you linked to had short deadline resends which were sent to you. However, since they are now completed and the green deadline information has been replaced with the actual reporting times in black, it is no longer possible for us to now see what the deadline was at the time the resends were first issued. That's not a problem of course, it just means you can only observe it at the time and not later. The criteria for a fast reliable host might need a bit of tweaking if this and this occur with any frequency. In these two examples, a short deadline resend was issued when a primary task failed. In both those cases, the first resend timed out and that triggered a second short deadline resend. It's quite possible that the first resend might get returned late and so complete the quorum. If that happens, I trust the second resends will still be awarded credit if they get completed before their own shortened deadlines. I was interested to note that the computer used for the first resend in the first of the examples given had a turnaround time of close to 4 days - hardly what you would call 'fast and reliable' :-). Cheers, Gary. ID: 23552 · Reply Quote

jujube Send message Joined: 25 Jan 11 Posts: 179 Credit: 83,858 RAC: 0	Message 23553 - Posted: 19 Oct 2011, 6:14:27 UTC - in response to Message 23552. Last modified: 19 Oct 2011, 6:21:07 UTC The criteria for a fast reliable host might need a bit of tweaking if this and this occur with any frequency. In these two examples, a short deadline resend was issued when a primary task failed. In both those cases, the first resend timed out and that triggered a second short deadline resend. It's quite possible that the first resend might get returned late and so complete the quorum. If that happens, I trust the second resends will still be awarded credit if they get completed before their own shortened deadlines. I was interested to note that the computer used for the first resend in the first of the examples given had a turnaround time of close to 4 days - hardly what you would call 'fast and reliable' :-). Yes your 2 examples prove the criteria definitely needs tweaking. The current settings pretty much guarantee some resends will time out and require yet another resend, as is evident in your two examples. In the first example the current settings allowed a task with a task with 3 day deadline to be sent to a host with a 4 day average turnaround. That doesn't make any sense. If the shortened deadline is 3 days then resends should be issued only to hosts with an average turnaround time of somewhat less than 3 days. Since the host's turnaround time is an average rather than a maximum, you can expect the host to sometimes take more than it's average turnaround time. That is exactly what happened in the second of your examples where the resend went to a host with a 2.75 day average turnaround and timed out. Thus for a 3 day deadline the <reliable_max_avg_turnaround> setting should probably be no greater than 216,000 (2.5 days). Maybe 2 days would be even better. Though decreasing the turnaround time criteria eliminates more hosts from the fast-reliable pool, the probability that the resend will time out decreases. If you're trying to speed up the batch you definitely don't want to cause another resend. It's probably better to issue 3 resends (1 per core at a time) to a host with a 1 day average turnaround than to send any of those 3 to a host with a 3 day turnaround time, if the shortened deadline is 3 days. ID: 23553 · Reply Quote

VALDIS Send message Joined: 22 May 11 Posts: 2 Credit: 132,444 RAC: 0	Message 23554 - Posted: 19 Oct 2011, 9:49:25 UTC If you guys crunched as much as you chatted, you all would be up in the 100's of million credits. I myself have only one machine that runs 24/7 at 100% across the board and I still manage to do work on 3D Studio Max and Maya. So why don't you all just crank up your work-load percentage and REALLY contribute to the scientific community. By Dec. 15th I will be finished building my new machine, which should be capable of atleast a million credits/day. Come on guys, I know your machines can do better.....JUST LET THEM RIP, yes you can still play your silly games, without missing a beat. ID: 23554 · Reply Quote

tullio Send message Joined: 19 Feb 08 Posts: 708 Credit: 4,336,250 RAC: 0	Message 23556 - Posted: 19 Oct 2011, 13:09:57 UTC Last modified: 19 Oct 2011, 13:10:19 UTC I don't care about credits. I've been running CERN jobs in BOINC_VM for days without getting a single credit because of a bug in the Test4Theory wrapper in version 6.05. Now it is in version 7.01 and I am slowly going up the credits ladder. I have been running it since November 28 and my user number is 10. Tullio ID: 23556 · Reply Quote

LHC@home