Message boards :
Number crunching :
no more work?
Message board moderation
Previous · 1 · 2 · 3 · 4 · 5 · Next
Author | Message |
---|---|
Send message Joined: 25 Jan 11 Posts: 179 Credit: 83,858 RAC: 0 |
If you're not receiving work it's probably because the only tasks left in the queue are resends (tasks that failed to verify or returned late). The Sixtrack server is configured to issue resends only to computers that are rated fast and reliable. Fast means the computer has a short task turnaround time. Reliable means a high percentage of its results verify. If you're not receiving work then your computer probably isn't on the list of fast reliable hosts. What can you do to get on that list? 1) Reduce your computer's task turnaround time by keeping a small cache. My computer's turnaround time is 0.43 days and I am receiving work. 2) Make sure your computer doesn't crash tasks. So look at your computer's details here on the website and check out its turnaround time and see if it's results are validating. |
Send message Joined: 18 Sep 04 Posts: 163 Credit: 1,682,370 RAC: 0 |
If you're not receiving work it's probably because the only tasks left in the queue are resends (tasks that failed to verify or returned late). Guess, or verified info? Michael Team Linux Users Everywhere |
Send message Joined: 25 Jan 11 Posts: 179 Credit: 83,858 RAC: 0 |
Discussion in the "Long delays in jobs" thread confirms the resends are being sent to the tail end of the queue therefore we can infer that at some point there will be only resends in the queue (unless they create more new WUs). Since 29 Oct 2011 21:14:21 UTC, I've received nothing but resends so I am very sure there is nothing but resends left in the queue. The turnaround times for your 4 computers ranges from 2 days to 2.9 days. I don't know what the turnaround time is set at on the server but from your turnaround times and the fact you're receiving no work it's plausible that turnaround time must be lower than 2 to qualify. On the other hand, maybe your client just doesn't want any Sixtrack work at this time. Does the log say "not requesting work" when you update Sixtrack? If it says it's requesting work but doesn't receive any then it must be your turnaround times aren't fast enough. You may have noticed Brian Alexander's turnaround time is 0. Notice also that his computer was attached on Oct. 27 and it has completed only 1 task. From that I assume the server thinks his computer hasn't turned in enough results to determine if it's fast so it doesn't get any work either. BTW, I think the acceptable turnaround time should be 3. 2 seems a little low but that's for the admins to decide, not me. |
Send message Joined: 2 Sep 04 Posts: 209 Credit: 1,482,496 RAC: 0 |
I keep getting: The problem is your computer has done only done 1 task and it has not been validated yet. You have a "0 turn around time". I think until that task validates this will not be increased, but i'm not sure on that. You'll either have to wait it out. You might try a project reset, but I do not know what effect that will have (I make no guarrantee it will fix anything or not mess anything up). |
Send message Joined: 23 Oct 04 Posts: 358 Credit: 1,439,205 RAC: 0 |
|
Send message Joined: 2 Sep 04 Posts: 209 Credit: 1,482,496 RAC: 0 |
Why my computers get no work, even there was no invalid result returned, and the server status is 5'006 ready to send? What messages are you getting ? It is hard to answer without more details. It could be something like, there is only linux work left and you have windows, or the other way around. Have you reached your quota for the day ? what version client are you doing and are you pressing update or waiting for the client to naturally request workl, it makes adifference. |
Send message Joined: 23 Oct 04 Posts: 358 Credit: 1,439,205 RAC: 0 |
Why my computers get no work, even there was no invalid result returned, and the server status is 5'006 ready to send? oops, sorry but all infos you can see on computers-page (they are not anominous)^^ , OS = Windows 7 only this message , when I start BM...: Time: UTC+1 01.11.2011 02:58:58 LHC@home 1.0 Sending scheduler request: To fetch work. 01.11.2011 02:58:58 LHC@home 1.0 Requesting new tasks 01.11.2011 02:58:59 LHC@home 1.0 Scheduler request completed: got 0 new tasks 01.11.2011 02:58:59 LHC@home 1.0 Message from server: No work sent greetz littleBouncer |
Send message Joined: 2 Sep 04 Posts: 209 Credit: 1,482,496 RAC: 0 |
Why my computers get no work, even there was no invalid result returned, and the server status is 5'006 ready to send? Not sure why you linked to my computers ?
That usually means the server queue is empty. Remember the server status page is cached, it can be empty within minutes of showing lots of work, especially now that there are much more users and active hosts. |
Send message Joined: 18 Sep 04 Posts: 143 Credit: 27,645 RAC: 0 |
Not sure why you linked to my computers ? He didn't. http://lhcathomeclassic.cern.ch/sixtrack/hosts_user.php is the general link to "your" computers. When I click on it, I see mine. ;) Jord BOINC FAQ Service |
Send message Joined: 25 Jan 11 Posts: 179 Credit: 83,858 RAC: 0 |
When Sixtrack is out of work (as it is now) it says: Tue 01 Nov 2011 09:46:10 AM MDT | LHC@home 1.0 | (Project has no jobs available) That proves there was work when littleBouncer requested work. The reason he didn't get any is because his turnaround time is too high. The same has happened to several posters in this thread. It has nothing to do with Linux vs. Windows because my Linux box gets tasks that have been sent to Windows hosts. They do not use homogeneous redundancy here, Sixtrack tasks can and do go to either OS. little Bouncer's turnaround time is 2.1 days on one of his hosts and 1.82 on his other host. IMHO, the turnaround time requirement seems a little low. |
Send message Joined: 2 Sep 04 Posts: 209 Credit: 1,482,496 RAC: 0 |
Oops sorry, i was up til 3am last night and had to come in early to work today, my brain is fried. The current turn around limit is 129600 seconds (1.5 days) This is what the scientists want, which would not be a problem if the mechanism functioned correctly. And considering over 5,000 tasks dissapeared in less than a day, there is ample computers to handle resends. Normally these should send out earlier as needed and not all be saved up, so this problem at the end would not be noticed. What will happen if things worked correctly is the quicker hosts will get resends instead of normal time work and normal work would be sent to the slower hosts, so more hosts get work. I think also new work was held up so the queue could empty out all the tasks backlogged that have been waiting over 10 days to resend (they are now over 17 days old at least from when they started) because the scheduler mechanism is not working. We think it is an older scheduler and somewhere after that the way it handles resends was changed, so the options in the docs are for a newer scheduler and the one in use does not recognize them, so it malfunctions. Igor has plans to do an update, when he can find time in his schedule. There was some reason this was being held off, but since T4T did a sucessful one, the reason may be nulled now and an update can proceed, time permitting. |
Send message Joined: 23 Oct 04 Posts: 358 Credit: 1,439,205 RAC: 0 |
Thanks guys for any reply^^ I must say LHC is not the only project my machines are crunching. They work also for Einstein and pirates. For that I set connect to internet every: 2 days and cache of 0.25 -> that will sure return the work within 2.1 day. Before I wrote my first post here, I tried (try) to get work with following preferences : connect to internet every: 0 days and cache at 0.25 and I suspended all other projects except LHC 1.0, then I updated my BM-Client, with the result: message from server: no work sent. It seems that those turnaround datas are still in function, so I have to detach and reattach when LHC 1.0 has new work. sorry for my english, I speak normally german^^ greetz littleBouncer BTW: sorry for the 'bad link', I wanted to show my computers data |
Send message Joined: 25 Jan 11 Posts: 179 Credit: 83,858 RAC: 0 |
The current turn around limit is 129600 seconds (1.5 days) This is what the scientists want, which would not be a problem if the mechanism functioned correctly. And considering over 5,000 tasks dissapeared in less than a day, there is ample computers to handle resends. Over 5,000 in less than a day is impressive. If anybody wants to get in on the resends all they need to do is decrease their cache and get their turnaround time below 1.5 days, easily done. |
Send message Joined: 23 Oct 04 Posts: 358 Credit: 1,439,205 RAC: 0 |
The current turn around limit is 129600 seconds (1.5 days) This is what the scientists want, which would not be a problem if the mechanism functioned correctly. And considering over 5,000 tasks dissapeared in less than a day, there is ample computers to handle resends. Decreasing alone don't help. You have to detach and then reattach after decreasing the cache. Because the former turnaround time is the parameter in use!! |
Send message Joined: 2 Sep 04 Posts: 209 Credit: 1,482,496 RAC: 0 |
The current turn around limit is 129600 seconds (1.5 days) This is what the scientists want, which would not be a problem if the mechanism functioned correctly. And considering over 5,000 tasks dissapeared in less than a day, there is ample computers to handle resends. It was more like, less than 14 hours. Considering that and that we now have over 7000 users and 11,000 hosts (with recent credit) I think there is enough. What we need to do is first get the entire mechanism functioning properly before trying to adjust anymore the time / requirments for the "reliable hosts". Consider too the 1/cpu limit and the short compute time 8 hours for average work if it is started proptly that is. The resends are going to be from several sources (abort, detach, inconclusive, timeout), but the longest is the timeout because a host didn't start it within the 7 days, most likely the 2nd result is already done, so to wait any longer for the third to complete is what the scientists want to avoid. We have allowed another 3.5 days deadline, but hoping that most of the "relaiable hosts" will return it faster. So under normal circumstances most work should be completed within 8 days, reducing the time batches go on. As it is now, some are older than that because of the 10 day delay between the timeout and the resend which is not good, that makes some of the tasks over 18 days old from when they started. I beleive once this delay problem is solved, that also the "no more work" issue will decrease too. There will not have to be a hold up to sumbit new work. The old work will still clear out first, resends mixed in along the way. |
Send message Joined: 6 Dec 10 Posts: 9 Credit: 1,000,912 RAC: 0 |
When do you think we will have more work? thanks |
Send message Joined: 22 Jul 05 Posts: 72 Credit: 3,962,626 RAC: 0 |
... This is what the scientists want, which would not be a problem if the mechanism functioned correctly.... When the Devs are looking at whether or not things are functioning correctly, perhaps they might like to consider examples like this quorum from which a number of interesting observations can be made. At first glance it looks quite straight forward - two deadline misses, one of which was completed 3 hours after the deadline so one of the two resends that were created wasn't actually needed. But take a closer look. The two deadline misses were created around 6:12 UTC on 27 October. At this point, two 'unsent' resend tasks would have been added to the quorum. Three hours later at 9:11 UTC, one of the deadline misses was then returned. At that point one of the 'unsent' resends should have been cancelled with a status of 'didn't need'. I've actually seen this happen at Einstein so I know BOINC can do this. Those two 'unsents' sat around for 4 days and were finally sent out on 31 October. It was nearly a day later on 1 November before the extra task was actually cancelled with a status of 'redundant result'. So why did it take so long for the system to realise that only one resend would actually be needed? It should have been able to make that decision back on 27 October at 9:11 UTC. A BOINC bug, I guess. On the more general question of how to achieve the more rapid finalisation of batches of work, I think it would be beneficial to 'educate' the volunteers about how best to contribute if they really want to. Just take a look at the quorum I linked to above. Drill down to the details page for the two hosts who received the primary tasks. Notice that those hosts have turnaround times of 4.37 and 6.65 days respectively. So both those hosts are rather unsuited to the needs of the scientists. If you can put a 1.5 day turnaround limit on resends, why not put say a 4 day limit on primary tasks? If people insist on having 7 day work caches then the project should decline their offer to do work under such unsuitable conditions. There could be a big, obvious notice about the limitation on the home page and the message in BOINC Manager could say. "No work sent - your turnaround time is longer than 4 days. To get work, please reduce your work cache size and detach/reattach." People could then decide if they really wanted to support the project or not. The actual primary deadline puzzles me. It's often quoted as 7 days but the deadline miss task in the above quorum gives a precise figure of 6 days 15 hours 32 minutes and 14 seconds (or 574334 seconds). Why such odd-ball numbers? If the scientists really need to finalise a run quickly, why not reduce the deadline a little (to, say, 5 days) to prevent the last primary tasks issued from potentially hanging around for up to 7 days before any 'deadline miss' resends for them could be sent out? Sure, some people will complain, but probably will get used to it when they understand the needs of the project. Cheers, Gary. |
Send message Joined: 2 Sep 08 Posts: 5 Credit: 121,460 RAC: 0 |
If the scientists really need to finalise a run quickly, why not reduce the deadline a little (to, say, 5 days) to prevent the last primary tasks issued from potentially hanging around for up to 7 days before any 'deadline miss' resends for them could be sent out? Sure, some people will complain, but probably will get used to it when they understand the needs of the project. Why not send each WU to 3 computers instead of 2, with a minimum quorum of 2, keeping the deadline at 7 days? Yes, there will be some wasting of computational resources, but with less resends and faster finalisation. |
Send message Joined: 25 Jan 11 Posts: 179 Credit: 83,858 RAC: 0 |
Why not send each WU to 3 computers instead of 2, with a minimum quorum of 2, keeping the deadline at 7 days? Yes, there will be some wasting of computational resources, but with less resends and faster finalisation. There would be more than "some" waste of computational resources, there would be a huge waste of resources. Also, there is absolutely no way you can get a big job done faster by unnecessarily repeating parts of the job. Your plan would not only waste a huge amount of CPU time, it would actually make the job take longer. The best way to get the job done quicker is to wait and see if a WU needs a resend and then issue the resend to a fast reliable host and/or reduce the deadline. Reducing the deadline to 5 days would not eliminate hosts with a 6.6 day turnaround time as Gary R. seems to think. Those hosts would still receive tasks if the scheduler thinks they are able to complete the tasks before the deadline. I doubt that any of the hosts attached to this project needs more than 4 days to crunch a Sixtrack task so there is no reason a 5 day deadline wouldn't work for everybody. Until the admins/scientists decide whether or not the last batch of WUs completed fast enough there isn't much point in worrying about shorter deadlines. |
Send message Joined: 2 Sep 08 Posts: 5 Credit: 121,460 RAC: 0 |
Reducing the deadline to 5 days would not eliminate hosts with a 6.6 day turnaround time as Gary R. seems to think. Those hosts would still receive tasks if the scheduler thinks they are able to complete the tasks before the deadline. I doubt that any of the hosts attached to this project needs more than 4 days to crunch a Sixtrack task so there is no reason a 5 day deadline wouldn't work for everybody. You'd be correct if all computers run 24/7... With a deadline of 5 days and a turnaround of 6.6 days, you'll have a lot more tasks to resend, and the scheduler won't send any new task to those computers that run, e.g. 12 hs 5 days a week. If someone downloads a task in a Friday, and doesn't run it Saturday and Sunday, will have only 3 days to the deadline. A bigger chance to not get it completed in time... And one more task to resend. And another user excluded. |
©2024 CERN