Message boards : Number crunching : no more work?
Message board moderation

To post messages, you must log in.

Previous · 1 · 2 · 3 · 4 · 5 · Next

AuthorMessage
Profile jujube

Send message
Joined: 25 Jan 11
Posts: 179
Credit: 83,858
RAC: 0
Message 23609 - Posted: 31 Oct 2011, 5:44:49 UTC - in response to Message 23608.  
Last modified: 31 Oct 2011, 5:52:09 UTC

If you're not receiving work it's probably because the only tasks left in the queue are resends (tasks that failed to verify or returned late). The Sixtrack server is configured to issue resends only to computers that are rated fast and reliable. Fast means the computer has a short task turnaround time. Reliable means a high percentage of its results verify. If you're not receiving work then your computer probably isn't on the list of fast reliable hosts.

What can you do to get on that list?

1) Reduce your computer's task turnaround time by keeping a small cache. My computer's turnaround time is 0.43 days and I am receiving work.

2) Make sure your computer doesn't crash tasks.

So look at your computer's details here on the website and check out its turnaround time and see if it's results are validating.
ID: 23609 · Report as offensive     Reply Quote
Michael Karlinsky
Avatar

Send message
Joined: 18 Sep 04
Posts: 163
Credit: 1,192,543
RAC: 0
Message 23610 - Posted: 31 Oct 2011, 8:31:23 UTC - in response to Message 23609.  

If you're not receiving work it's probably because the only tasks left in the queue are resends (tasks that failed to verify or returned late).


Guess, or verified info?

Michael

Team Linux Users Everywhere
ID: 23610 · Report as offensive     Reply Quote
Profile jujube

Send message
Joined: 25 Jan 11
Posts: 179
Credit: 83,858
RAC: 0
Message 23611 - Posted: 31 Oct 2011, 17:22:17 UTC - in response to Message 23610.  

Discussion in the "Long delays in jobs" thread confirms the resends are being sent to the tail end of the queue therefore we can infer that at some point there will be only resends in the queue (unless they create more new WUs).

Since 29 Oct 2011 21:14:21 UTC, I've received nothing but resends so I am very sure there is nothing but resends left in the queue.

The turnaround times for your 4 computers ranges from 2 days to 2.9 days. I don't know what the turnaround time is set at on the server but from your turnaround times and the fact you're receiving no work it's plausible that turnaround time must be lower than 2 to qualify. On the other hand, maybe your client just doesn't want any Sixtrack work at this time. Does the log say "not requesting work" when you update Sixtrack? If it says it's requesting work but doesn't receive any then it must be your turnaround times aren't fast enough.

You may have noticed Brian Alexander's turnaround time is 0. Notice also that his computer was attached on Oct. 27 and it has completed only 1 task. From that I assume the server thinks his computer hasn't turned in enough results to determine if it's fast so it doesn't get any work either.

BTW, I think the acceptable turnaround time should be 3. 2 seems a little low but that's for the admins to decide, not me.
ID: 23611 · Report as offensive     Reply Quote
Profile Krunchin-Keith [USA]
Volunteer moderator
Project tester
Volunteer developer
Volunteer tester
Avatar

Send message
Joined: 2 Sep 04
Posts: 209
Credit: 1,482,496
RAC: 0
Message 23612 - Posted: 31 Oct 2011, 17:29:05 UTC - in response to Message 23608.  

I keep getting:
Sun 30 Oct 2011 10:04:21 PM CDT LHC@home 1.0 Message from server: (won't finish in time) BOINC runs 98.9% of time, computation enabled 99.6% of that


I have an AMD Phenom II x4 965 Liquid cooled - 8 Gig of ram Running Linux on SATA II raid arrays... What do you mean it won't finish it time??? It doesn't get too much faster than that


The problem is your computer has done only done 1 task and it has not been validated yet. You have a "0 turn around time". I think until that task validates this will not be increased, but i'm not sure on that. You'll either have to wait it out. You might try a project reset, but I do not know what effect that will have (I make no guarrantee it will fix anything or not mess anything up).
ID: 23612 · Report as offensive     Reply Quote
Profile littleBouncer
Avatar

Send message
Joined: 23 Oct 04
Posts: 358
Credit: 1,439,205
RAC: 0
Message 23613 - Posted: 31 Oct 2011, 17:49:08 UTC

Why my computers get no work, even there was no invalid result returned, and the server status is 5'006 ready to send?

Even when I suspend all other projects, I get no work.....

greetz littleBouncer
ID: 23613 · Report as offensive     Reply Quote
Profile Krunchin-Keith [USA]
Volunteer moderator
Project tester
Volunteer developer
Volunteer tester
Avatar

Send message
Joined: 2 Sep 04
Posts: 209
Credit: 1,482,496
RAC: 0
Message 23614 - Posted: 31 Oct 2011, 19:27:50 UTC - in response to Message 23613.  

Why my computers get no work, even there was no invalid result returned, and the server status is 5'006 ready to send?

Even when I suspend all other projects, I get no work.....

greetz littleBouncer

What messages are you getting ? It is hard to answer without more details. It could be something like, there is only linux work left and you have windows, or the other way around. Have you reached your quota for the day ? what version client are you doing and are you pressing update or waiting for the client to naturally request workl, it makes adifference.
ID: 23614 · Report as offensive     Reply Quote
Profile littleBouncer
Avatar

Send message
Joined: 23 Oct 04
Posts: 358
Credit: 1,439,205
RAC: 0
Message 23617 - Posted: 1 Nov 2011, 2:09:17 UTC - in response to Message 23614.  

Why my computers get no work, even there was no invalid result returned, and the server status is 5'006 ready to send?

Even when I suspend all other projects, I get no work.....

greetz littleBouncer

What messages are you getting ? It is hard to answer without more details. It could be something like, there is only linux work left and you have windows, or the other way around. Have you reached your quota for the day ? what version client are you doing and are you pressing update or waiting for the client to naturally request workl, it makes adifference.


oops, sorry but all infos you can see on computers-page (they are not anominous)^^ , OS = Windows 7
only this message , when I start BM...:

Time: UTC+1
01.11.2011 02:58:58 LHC@home 1.0 Sending scheduler request: To fetch work.
01.11.2011 02:58:58 LHC@home 1.0 Requesting new tasks
01.11.2011 02:58:59 LHC@home 1.0 Scheduler request completed: got 0 new tasks
01.11.2011 02:58:59 LHC@home 1.0 Message from server: No work sent

greetz littleBouncer
ID: 23617 · Report as offensive     Reply Quote
Profile Krunchin-Keith [USA]
Volunteer moderator
Project tester
Volunteer developer
Volunteer tester
Avatar

Send message
Joined: 2 Sep 04
Posts: 209
Credit: 1,482,496
RAC: 0
Message 23618 - Posted: 1 Nov 2011, 15:13:58 UTC - in response to Message 23617.  

Why my computers get no work, even there was no invalid result returned, and the server status is 5'006 ready to send?

Even when I suspend all other projects, I get no work.....

greetz littleBouncer

What messages are you getting ? It is hard to answer without more details. It could be something like, there is only linux work left and you have windows, or the other way around. Have you reached your quota for the day ? what version client are you doing and are you pressing update or waiting for the client to naturally request workl, it makes adifference.


oops, sorry but all infos you can see on computers-page (they are not anominous)^^ , OS = Windows 7
only this message , when I start BM...:


Not sure why you linked to my computers ?


Time: UTC+1
01.11.2011 02:58:58 LHC@home 1.0 Sending scheduler request: To fetch work.
01.11.2011 02:58:58 LHC@home 1.0 Requesting new tasks
01.11.2011 02:58:59 LHC@home 1.0 Scheduler request completed: got 0 new tasks
01.11.2011 02:58:59 LHC@home 1.0 Message from server: No work sent

greetz littleBouncer


That usually means the server queue is empty. Remember the server status page is cached, it can be empty within minutes of showing lots of work, especially now that there are much more users and active hosts.
ID: 23618 · Report as offensive     Reply Quote
Profile Ageless
Avatar

Send message
Joined: 18 Sep 04
Posts: 143
Credit: 27,645
RAC: 0
Message 23619 - Posted: 1 Nov 2011, 15:37:45 UTC - in response to Message 23618.  

Not sure why you linked to my computers ?

He didn't. http://lhcathomeclassic.cern.ch/sixtrack/hosts_user.php is the general link to "your" computers. When I click on it, I see mine. ;)


Jord

BOINC FAQ Service
ID: 23619 · Report as offensive     Reply Quote
Profile jujube

Send message
Joined: 25 Jan 11
Posts: 179
Credit: 83,858
RAC: 0
Message 23620 - Posted: 1 Nov 2011, 16:01:19 UTC - in response to Message 23618.  
Last modified: 1 Nov 2011, 16:03:49 UTC


Time: UTC+1
01.11.2011 02:58:58 LHC@home 1.0 Sending scheduler request: To fetch work.
01.11.2011 02:58:58 LHC@home 1.0 Requesting new tasks
01.11.2011 02:58:59 LHC@home 1.0 Scheduler request completed: got 0 new tasks
01.11.2011 02:58:59 LHC@home 1.0 Message from server: No work sent

greetz littleBouncer


That usually means the server queue is empty. Remember the server status page is cached, it can be empty within minutes of showing lots of work, especially now that there are much more users and active hosts.


When Sixtrack is out of work (as it is now) it says:

Tue 01 Nov 2011 09:46:10 AM MDT | LHC@home 1.0 | (Project has no jobs available)


That proves there was work when littleBouncer requested work. The reason he didn't get any is because his turnaround time is too high. The same has happened to several posters in this thread. It has nothing to do with Linux vs. Windows because my Linux box gets tasks that have been sent to Windows hosts. They do not use homogeneous redundancy here, Sixtrack tasks can and do go to either OS.

little Bouncer's turnaround time is 2.1 days on one of his hosts and 1.82 on his other host. IMHO, the turnaround time requirement seems a little low.
ID: 23620 · Report as offensive     Reply Quote
Profile Krunchin-Keith [USA]
Volunteer moderator
Project tester
Volunteer developer
Volunteer tester
Avatar

Send message
Joined: 2 Sep 04
Posts: 209
Credit: 1,482,496
RAC: 0
Message 23621 - Posted: 1 Nov 2011, 16:54:02 UTC - in response to Message 23620.  


Time: UTC+1
01.11.2011 02:58:58 LHC@home 1.0 Sending scheduler request: To fetch work.
01.11.2011 02:58:58 LHC@home 1.0 Requesting new tasks
01.11.2011 02:58:59 LHC@home 1.0 Scheduler request completed: got 0 new tasks
01.11.2011 02:58:59 LHC@home 1.0 Message from server: No work sent

greetz littleBouncer


That usually means the server queue is empty. Remember the server status page is cached, it can be empty within minutes of showing lots of work, especially now that there are much more users and active hosts.


When Sixtrack is out of work (as it is now) it says:

Tue 01 Nov 2011 09:46:10 AM MDT | LHC@home 1.0 | (Project has no jobs available)


That proves there was work when littleBouncer requested work. The reason he didn't get any is because his turnaround time is too high. The same has happened to several posters in this thread. It has nothing to do with Linux vs. Windows because my Linux box gets tasks that have been sent to Windows hosts. They do not use homogeneous redundancy here, Sixtrack tasks can and do go to either OS.

little Bouncer's turnaround time is 2.1 days on one of his hosts and 1.82 on his other host. IMHO, the turnaround time requirement seems a little low.

Oops sorry, i was up til 3am last night and had to come in early to work today, my brain is fried.

The current turn around limit is 129600 seconds (1.5 days) This is what the scientists want, which would not be a problem if the mechanism functioned correctly. And considering over 5,000 tasks dissapeared in less than a day, there is ample computers to handle resends. Normally these should send out earlier as needed and not all be saved up, so this problem at the end would not be noticed. What will happen if things worked correctly is the quicker hosts will get resends instead of normal time work and normal work would be sent to the slower hosts, so more hosts get work.

I think also new work was held up so the queue could empty out all the tasks backlogged that have been waiting over 10 days to resend (they are now over 17 days old at least from when they started) because the scheduler mechanism is not working. We think it is an older scheduler and somewhere after that the way it handles resends was changed, so the options in the docs are for a newer scheduler and the one in use does not recognize them, so it malfunctions.

Igor has plans to do an update, when he can find time in his schedule. There was some reason this was being held off, but since T4T did a sucessful one, the reason may be nulled now and an update can proceed, time permitting.
ID: 23621 · Report as offensive     Reply Quote
Profile littleBouncer
Avatar

Send message
Joined: 23 Oct 04
Posts: 358
Credit: 1,439,205
RAC: 0
Message 23622 - Posted: 1 Nov 2011, 17:29:11 UTC
Last modified: 1 Nov 2011, 17:33:47 UTC

Thanks guys for any reply^^

I must say LHC is not the only project my machines are crunching. They work also for Einstein and pirates. For that I set connect to internet every: 2 days and cache of 0.25 -> that will sure return the work within 2.1 day.

Before I wrote my first post here, I tried (try) to get work with following preferences : connect to internet every: 0 days and cache at 0.25 and I suspended all other projects except LHC 1.0, then I updated my BM-Client, with the result: message from server: no work sent.

It seems that those turnaround datas are still in function, so I have to detach and reattach when LHC 1.0 has new work.

sorry for my english, I speak normally german^^
greetz littleBouncer

BTW: sorry for the 'bad link', I wanted to show my computers data
ID: 23622 · Report as offensive     Reply Quote
Profile jujube

Send message
Joined: 25 Jan 11
Posts: 179
Credit: 83,858
RAC: 0
Message 23623 - Posted: 1 Nov 2011, 17:31:23 UTC - in response to Message 23621.  

The current turn around limit is 129600 seconds (1.5 days) This is what the scientists want, which would not be a problem if the mechanism functioned correctly. And considering over 5,000 tasks dissapeared in less than a day, there is ample computers to handle resends.


Over 5,000 in less than a day is impressive. If anybody wants to get in on the resends all they need to do is decrease their cache and get their turnaround time below 1.5 days, easily done.
ID: 23623 · Report as offensive     Reply Quote
Profile littleBouncer
Avatar

Send message
Joined: 23 Oct 04
Posts: 358
Credit: 1,439,205
RAC: 0
Message 23624 - Posted: 1 Nov 2011, 17:41:31 UTC - in response to Message 23623.  

The current turn around limit is 129600 seconds (1.5 days) This is what the scientists want, which would not be a problem if the mechanism functioned correctly. And considering over 5,000 tasks dissapeared in less than a day, there is ample computers to handle resends.


Over 5,000 in less than a day is impressive. If anybody wants to get in on the resends all they need to do is decrease their cache and get their turnaround time below 1.5 days, easily done.


Decreasing alone don't help. You have to detach and then reattach after decreasing the cache. Because the former turnaround time is the parameter in use!!
ID: 23624 · Report as offensive     Reply Quote
Profile Krunchin-Keith [USA]
Volunteer moderator
Project tester
Volunteer developer
Volunteer tester
Avatar

Send message
Joined: 2 Sep 04
Posts: 209
Credit: 1,482,496
RAC: 0
Message 23625 - Posted: 1 Nov 2011, 18:11:35 UTC - in response to Message 23623.  

The current turn around limit is 129600 seconds (1.5 days) This is what the scientists want, which would not be a problem if the mechanism functioned correctly. And considering over 5,000 tasks dissapeared in less than a day, there is ample computers to handle resends.


Over 5,000 in less than a day is impressive. If anybody wants to get in on the resends all they need to do is decrease their cache and get their turnaround time below 1.5 days, easily done.

It was more like, less than 14 hours. Considering that and that we now have over 7000 users and 11,000 hosts (with recent credit) I think there is enough.

What we need to do is first get the entire mechanism functioning properly before trying to adjust anymore the time / requirments for the "reliable hosts". Consider too the 1/cpu limit and the short compute time 8 hours for average work if it is started proptly that is. The resends are going to be from several sources (abort, detach, inconclusive, timeout), but the longest is the timeout because a host didn't start it within the 7 days, most likely the 2nd result is already done, so to wait any longer for the third to complete is what the scientists want to avoid. We have allowed another 3.5 days deadline, but hoping that most of the "relaiable hosts" will return it faster. So under normal circumstances most work should be completed within 8 days, reducing the time batches go on. As it is now, some are older than that because of the 10 day delay between the timeout and the resend which is not good, that makes some of the tasks over 18 days old from when they started. I beleive once this delay problem is solved, that also the "no more work" issue will decrease too. There will not have to be a hold up to sumbit new work. The old work will still clear out first, resends mixed in along the way.
ID: 23625 · Report as offensive     Reply Quote
candido

Send message
Joined: 6 Dec 10
Posts: 9
Credit: 452,259
RAC: 0
Message 23626 - Posted: 1 Nov 2011, 21:02:19 UTC - in response to Message 23625.  

When do you think we will have more work?
thanks
ID: 23626 · Report as offensive     Reply Quote
Profile Gary Roberts

Send message
Joined: 22 Jul 05
Posts: 72
Credit: 3,962,626
RAC: 0
Message 23627 - Posted: 2 Nov 2011, 1:40:13 UTC - in response to Message 23621.  

... This is what the scientists want, which would not be a problem if the mechanism functioned correctly....

When the Devs are looking at whether or not things are functioning correctly, perhaps they might like to consider examples like this quorum from which a number of interesting observations can be made. At first glance it looks quite straight forward - two deadline misses, one of which was completed 3 hours after the deadline so one of the two resends that were created wasn't actually needed.

But take a closer look. The two deadline misses were created around 6:12 UTC on 27 October. At this point, two 'unsent' resend tasks would have been added to the quorum. Three hours later at 9:11 UTC, one of the deadline misses was then returned. At that point one of the 'unsent' resends should have been cancelled with a status of 'didn't need'. I've actually seen this happen at Einstein so I know BOINC can do this.

Those two 'unsents' sat around for 4 days and were finally sent out on 31 October. It was nearly a day later on 1 November before the extra task was actually cancelled with a status of 'redundant result'. So why did it take so long for the system to realise that only one resend would actually be needed? It should have been able to make that decision back on 27 October at 9:11 UTC. A BOINC bug, I guess.

On the more general question of how to achieve the more rapid finalisation of batches of work, I think it would be beneficial to 'educate' the volunteers about how best to contribute if they really want to. Just take a look at the quorum I linked to above. Drill down to the details page for the two hosts who received the primary tasks. Notice that those hosts have turnaround times of 4.37 and 6.65 days respectively. So both those hosts are rather unsuited to the needs of the scientists. If you can put a 1.5 day turnaround limit on resends, why not put say a 4 day limit on primary tasks? If people insist on having 7 day work caches then the project should decline their offer to do work under such unsuitable conditions. There could be a big, obvious notice about the limitation on the home page and the message in BOINC Manager could say. "No work sent - your turnaround time is longer than 4 days. To get work, please reduce your work cache size and detach/reattach." People could then decide if they really wanted to support the project or not.

The actual primary deadline puzzles me. It's often quoted as 7 days but the deadline miss task in the above quorum gives a precise figure of 6 days 15 hours 32 minutes and 14 seconds (or 574334 seconds). Why such odd-ball numbers? If the scientists really need to finalise a run quickly, why not reduce the deadline a little (to, say, 5 days) to prevent the last primary tasks issued from potentially hanging around for up to 7 days before any 'deadline miss' resends for them could be sent out? Sure, some people will complain, but probably will get used to it when they understand the needs of the project.

Cheers,
Gary.
ID: 23627 · Report as offensive     Reply Quote
Amauri

Send message
Joined: 2 Sep 08
Posts: 5
Credit: 116,142
RAC: 91
Message 23628 - Posted: 2 Nov 2011, 2:25:47 UTC - in response to Message 23627.  

If the scientists really need to finalise a run quickly, why not reduce the deadline a little (to, say, 5 days) to prevent the last primary tasks issued from potentially hanging around for up to 7 days before any 'deadline miss' resends for them could be sent out? Sure, some people will complain, but probably will get used to it when they understand the needs of the project.


Why not send each WU to 3 computers instead of 2, with a minimum quorum of 2, keeping the deadline at 7 days? Yes, there will be some wasting of computational resources, but with less resends and faster finalisation.
ID: 23628 · Report as offensive     Reply Quote
Profile jujube

Send message
Joined: 25 Jan 11
Posts: 179
Credit: 83,858
RAC: 0
Message 23629 - Posted: 2 Nov 2011, 3:17:54 UTC - in response to Message 23628.  
Last modified: 2 Nov 2011, 3:18:32 UTC

Why not send each WU to 3 computers instead of 2, with a minimum quorum of 2, keeping the deadline at 7 days? Yes, there will be some wasting of computational resources, but with less resends and faster finalisation.


There would be more than "some" waste of computational resources, there would be a huge waste of resources. Also, there is absolutely no way you can get a big job done faster by unnecessarily repeating parts of the job. Your plan would not only waste a huge amount of CPU time, it would actually make the job take longer.

The best way to get the job done quicker is to wait and see if a WU needs a resend and then issue the resend to a fast reliable host and/or reduce the deadline.

Reducing the deadline to 5 days would not eliminate hosts with a 6.6 day turnaround time as Gary R. seems to think. Those hosts would still receive tasks if the scheduler thinks they are able to complete the tasks before the deadline. I doubt that any of the hosts attached to this project needs more than 4 days to crunch a Sixtrack task so there is no reason a 5 day deadline wouldn't work for everybody.

Until the admins/scientists decide whether or not the last batch of WUs completed fast enough there isn't much point in worrying about shorter deadlines.
ID: 23629 · Report as offensive     Reply Quote
Amauri

Send message
Joined: 2 Sep 08
Posts: 5
Credit: 116,142
RAC: 91
Message 23630 - Posted: 2 Nov 2011, 5:24:15 UTC - in response to Message 23629.  
Last modified: 2 Nov 2011, 5:32:08 UTC

Reducing the deadline to 5 days would not eliminate hosts with a 6.6 day turnaround time as Gary R. seems to think. Those hosts would still receive tasks if the scheduler thinks they are able to complete the tasks before the deadline. I doubt that any of the hosts attached to this project needs more than 4 days to crunch a Sixtrack task so there is no reason a 5 day deadline wouldn't work for everybody.


You'd be correct if all computers run 24/7... With a deadline of 5 days and a turnaround of 6.6 days, you'll have a lot more tasks to resend, and the scheduler won't send any new task to those computers that run, e.g. 12 hs 5 days a week. If someone downloads a task in a Friday, and doesn't run it Saturday and Sunday, will have only 3 days to the deadline. A bigger chance to not get it completed in time... And one more task to resend. And another user excluded.
ID: 23630 · Report as offensive     Reply Quote
Previous · 1 · 2 · 3 · 4 · 5 · Next

Message boards : Number crunching : no more work?


©2019 CERN