21) Message boards : Number crunching : Long delays in jobs (Message 23552)
Posted 19 Oct 2011 by Profile Gary Roberts
Post:
@Gary Roberts, from Igor's message 9th Oct

Thanks for taking the trouble but I'm not sure why you felt you needed to quote that message at me. I was fully aware of it and had digested it when I first read it some time ago. I'd stated earlier in this thread that I'd seen plenty of examples of 'shortened deadline' resends.

Looks like from that quorum of yours ....

Well, no, that quorum I linked to has nothing to do with the fast reliable hosts experiment. I highlighted it to draw attention to what appears to be a BOINC server bug. That quorum has been completed but one of its members has been left in an inconsistent state. The task deemed to be invalid has been left as 'pending' and I imagine this might prevent the quorum from being deleted at the appropriate time. In the past there were plenty of examples of pending tasks left cluttering up the database long after the main body of tasks had been deleted. I don't want to see that happening again.

... and a couple of tasks sent my way that the reliable/trusted host update kicked in (or finally got around to the tasks) on the 15th.

I'm sure you are correct and those quorums you linked to had short deadline resends which were sent to you. However, since they are now completed and the green deadline information has been replaced with the actual reporting times in black, it is no longer possible for us to now see what the deadline was at the time the resends were first issued. That's not a problem of course, it just means you can only observe it at the time and not later.

The criteria for a fast reliable host might need a bit of tweaking if this and this occur with any frequency. In these two examples, a short deadline resend was issued when a primary task failed. In both those cases, the first resend timed out and that triggered a second short deadline resend. It's quite possible that the first resend might get returned late and so complete the quorum. If that happens, I trust the second resends will still be awarded credit if they get completed before their own shortened deadlines. I was interested to note that the computer used for the first resend in the first of the examples given had a turnaround time of close to 4 days - hardly what you would call 'fast and reliable' :-).
22) Message boards : Number crunching : no more work? (Message 23551)
Posted 19 Oct 2011 by Profile Gary Roberts
Post:
... it has been 2+ days ....

If it gets to 2+ months, it might be reason to complain :-).

This project provides work in batches. The scientists need work for a batch finished, returned and analysed so that the parameters for the next batch can be determined. It is quite normal for there to be some time needed for this to happen before a new one can be launched. The best way to minimise this time is to minimise the time it takes to clean up the dregs of the previous batch.
23) Message boards : Number crunching : Long delays in jobs (Message 23546)
Posted 18 Oct 2011 by Profile Gary Roberts
Post:
This quorum represents an interesting example of what appears to be a bug in the BOINC server software. You can see that 4 copies were sent out before 2 were selected as agreeing sufficiently to allow validation. By tracing the issue and subsequent return times, we can see that, from the initial pair, there was no agreement and both would have been marked as "inconclusive" when first returned. The first resend was issued more than 4 days later and it was fairly quickly aborted. About 12 hours after the aborted task was returned, it was reissued, this time to one of my hosts which promptly completed and returned it.

What should have happened at this point is that the first two "inconclusives", together with the freshly completed 4th task should have been re-assessed and the non-agreeing one should have been marked as invalid, rather than being left as a "pending inconclusive". The project doesn't seem to be using fully up-to-date server software so, if this is indeed a bug, it may well have been corrected in a later version. If not, it should be reported to the BOINC Devs.
24) Message boards : Number crunching : Long delays in jobs (Message 23545)
Posted 18 Oct 2011 by Profile Gary Roberts
Post:
Earlier in this thread, I linked to two examples where tasks aborted didn't immediately result in resends being issued. They were prepared but remained unsent until all primary tasks were issued. In both cases there was a 4 day delay in issuing the resends. Initially I thought this would hamper the cleanup of the dregs of the run.

When there were no primary tasks left, there was quite a spike of resends then available, presumably to those hosts regarded as "fast and reliable" and this seemed to go rather well. My own hosts got quite a few with the reduced deadline and they were completed quite quickly. The two examples I linked two were also done quite quickly, even though one was by a 'slow' host (by modern standards) :-).

So the policy seems to have worked well but it would be good to get some admin feedback on how much of an improvement there actually was. With the extra small run of new tasks injected several hours ago but now (apparently) exhausted, it's a bit hard to use the status page as an indication.
25) Message boards : Number crunching : Long delays in jobs (Message 23543)
Posted 18 Oct 2011 by Profile Gary Roberts
Post:
I am going to stick my nose in here just one more time....

I wish you wouldn't :-). You started spruiking the project in the Cafe and that's all fine and laudable there but rather out of place here. It seems you even realise you are continuing to hijack this thread and yet that doesn't seem to bother you.

This thread was specifically started by a mod to air thoughts and ideas about how to deal with the slow returning 'tail' of tasks that always seems to slow down the completion of a run. By users making observations, we now know that 'resend' tasks are only sent out after the main body of work is exhausted. Initially this seemed to be a curious policy but the fact that resends are sent out with shortened deadlines to supposedly trusted and fast hosts seems to make up at least partially for the delay in issuing them. Do you have any thoughts about that? It would certainly be appropriate to post them here.

As part of your evangelism, you have made a couple of curious statements. In your first post in this thread, you stated:
....SixTrack needs to find a way of eliminating the crunching farms.

You implied that crunching farms were responsible for the lack of work for your recently added sixth machine. And yet, with six machines, you're really a farmer yourself :-)! You mightn't actually have a cattle ranch (yet) but you're at least on the way :-). Let me let you into a little secret! All projects love farms for all the obvious reasons. You are never going to convince any project that farms should be eliminated.

You also stated:
The strength of any project is not measured by numbers of machines, it is measured by numbers of users. If I am one user with 100 machines, any time I like I can simply up and quit any project, and poof, 100 machines are gone.

There are two things wrong with that. Firstly, the strength of a project is entirely dependent on machines and not users. A project doesn't really care how many people actually own the attached computers. If a project had the option of accepting 1 new user with 1000 computers OR 500 new users each with a single machine, which offer do you think they would take :-).

Secondly, you vastly overrate the effect of 1 user (100 computers) leaving. It would actually be like a drop in the ocean even for this project. Take a look at the server status page. At this point it shows that 240 new computers came on line in the last 24 hours. So 100 computers leaving wouldn't even be noticed. And the other thing you are ignoring is that the owners of big farms don't move their machines for capricious reasons. They have very clear reasons for choosing a project in the first place and they only tend to leave for a limited number of reasons, mostly to do with work supply. Of course there are other reasons like large teams having races using a particular project for a limited period. Do you really think that projects would be upset about this simply because the large influx is going to vanish at the end of the race period? Of course not! They view the work done during the race as an added bonus.

In a followup post you stated:
For me, the signal question is why out of 92,000 total users, the project was down prior to this latest burst of energy to about 2500 users
....

The question is why so many left and how we keep going in our current direction.

That's really two separate questions and the answer to both seems rather obvious. People left because there was essentially no work (or even signs of life) for many, many months. They became disenchanted with the lack of even basic snippets of information that might have kept up their interest. They took the view that they were being ignored. Secondly, we keep going in the current direction simply by maintaining ongoing work if possible and advising users in advance if there are likely to be periods of no work. The current rapid progress shows one thing clearly - people are interested in the project and are willing to forgive past poor behaviour. We don't (yet) need a PR campaign to get users - people are rapidly returning of their own accord. In fact, a PR campaign could easily be counter productive if the project is unable to keep supplying work as it is currently doing. Far better to think about a campaign when (if) the project gets to the stage where the ready-to-send tasks obviously can't be handled by the attached hosts. We are quite away from that point at the moment.

Hopefully we will get there at some stage and then you can start your own PR thread in the cafe. I suspect that the interest in the LHC will generate the extra users anyway without too much PR being needed.
26) Message boards : Number crunching : Tasks v530.09 crashing (Message 23488)
Posted 13 Oct 2011 by Profile Gary Roberts
Post:
EDIT: This is the tasks list if you want to follow its progress.

That host has been crunching for close to a day now and has already returned several tasks, a couple of which have already validated. The two longest running tasks took just over 7 hours (~26Ksecs) and there are two more partly completed that, at the current rate of progress (eg 50% complete in 3.5 hrs and 15% complete in 1 hr), will also take about the same. It's way too early to reach final conclusions but it does seem likely that 26Ksecs might turn out to be the time taken for 'full running' tasks on this host. If so, I'm sure glad that host is not running version 530.10 :-).

There's one annoying downside to this which I should mention before people complain about it. You can understand the problem if you take a look at this particular quorum, which is the validated 26Ksecs task from my host and the task of the wingman which just happens to be an I7-2600K (and perhaps overclocked as well, although it does have HT enabled). Notice that the wingman task took around 30 mins longer to crunch and the OS was Windows 7. I wouldn't think that my host should be able to compare this favourably with an I7-2600K (even with the HT impedimant) unless the 530.10 app is not using the full CPU capabilities. Because the two finishing times are close enough, the claims for credit aren't too disparate. Imagine what would have happened if the wingman host had been running Linux and had taken twice as long. Its claim would have been doubled (close to 300) and, when averaged with my 122 claim, would have resulted in an award of around 200 or so. I'd be laughing but my wingman would be crying :-).

So the problem is that all Linux hosts running the 530.10 app will be severely penalised credit-wise if they happen to be matched up with either a Windows wingman or one running the Linux 530.9 app (like my host). I'm sorry that Linux hosts will be affected by this but I think shorter running times are much more important so I'll continue (unless the Admins think otherwise) to use the 530.9 app for as long as it takes to fix the problem. Those participants with Linux hosts with the correct CPU capabilities (SSE2 and above, it would seem) could consider running the 530.9 app under AP, particularly if the Admins approve and perhaps publish the proper instructions. To assist with this, I've actually sent some details about this to Keith.
27) Message boards : Number crunching : Tasks v530.09 crashing (Message 23472)
Posted 12 Oct 2011 by Profile Gary Roberts
Post:
Out of interest, I've just grabbed the 530.9 linux app from the download directory and created an app_info.xml file to run it under AP on a linux laptop that is running E@H tasks at the moment. It's a 2.6GHz core 2 Duo so it should have a reasonable turnaround time. I've just launched it and it has downloaded two tasks. With 50/50 resource shares, one core should be doing LHC tasks all the time. I'll see how it goes over the next couple of days.

At 2.6GHz and the older architecture, there's no way it should be able to keep up with your Sandy Bridge. So, if my max crunch time is significantly less than the 55K secs you are getting, the benefits of running 530.9 for the appropriate CPU capabilities will be proved.

EDIT: This is the tasks list if you want to follow its progress.
28) Message boards : Number crunching : Tasks v530.09 crashing (Message 23467)
Posted 12 Oct 2011 by Profile Gary Roberts
Post:
IMHO, 530.08 was definitely slower on Linux by a factor of 2 or more. With 530.09 Linux and Windows were very close. Now 530.10 is showing a big difference again.

Your tasks list for your Linux host seems to show that quite clearly seeing as you have a 2X disparity in maximum run time and a large enough sample size of tasks for both versions to justify the conclusion. The 530.9 version is readily available in the download directory and you could easily run it under AP (anonymous platform) if you wanted to regain your higher throughput. I've just started running classic on a couple of Windows hosts and I'm considering loading it onto some Linux hosts as well. If I do, I'll certainly use AP until the problem is solved.

@Eric, the Devs at Einstein@Home support different apps for different CPU capabilities quite transparently. A host joining up will be sent a version of the app that is matched to its CPU capabilities. If you ask them what they did, it may allow you to solve your problem.
29) Message boards : Number crunching : SixTrack and LHC@home status (Message 23466)
Posted 12 Oct 2011 by Profile Gary Roberts
Post:
Are you sure about this?
Tasks in high priority can occupy the processor longer than a normal task would but when that task is finished your other projects will receive above normal CPU time to balance the extra CPU time the high priority task received.

Yes, as long as you take a long term view, say around a month, rather than just a few days or so. BOINC keeps track of just how much time each project has had and will try (if not interfered with and when the reason for high priority (HP) has passed) to pay back the CPU time needed to honour the resource shares you have chosen. It may not always be successful at this. You should not underestimate the ability of computer owners to choose an "impossible" combination of things like mix of projects, the variability of project deadlines, resource shares, work cache size, time on fraction, work availability, etc, :-). If you see BOINC using HP mode frequently, it's a pretty good indication that you are using 'difficult' settings. Often simply reducing the work cache size, particularly if you support quite a few projects, will lower or remove the use of HP mode.

The ability to use HP mode isn't really a 'problem' - it's really a very good feature to allow BOINC to better manage things. People often expect every project in their 'mix' to always have tasks on board and will sometimes increase the work cache size to try to 'force' this. They will also increase cache size in an attempt to outlast project outages. That's fine if you run a single or very small number of projects, but a recipe for problems if you want to run several. If you run quite a few projects you should minimise work cache size and allow BOINC to download new tasks just when they are really needed. That way you lessen the risk of having lots of 'stale' tasks that end up either missing deadlines or having to be processed in panic mode. BOINC will be much better able to honour your desired resource shares that way.
30) Message boards : Number crunching : Long delays in jobs (Message 23464)
Posted 12 Oct 2011 by Profile Gary Roberts
Post:
You'll just have to run some batches, and see if in the end they come back quicker.


Another way to test to see if it's working is to get a new Sixtrack task, abort it (hopefully before you start it), then check to see whether the deadline on the resend is shorter than normal or just normal. Of course that doesn't tell if the resend is issued with a high priority flag on it, just if the deadline is shorter.

Darn, they're out of work atm so can't test it now.

Here is an example of a workunit where two copies of the task were issued at approximately the same time and one of those copies was aborted very soon afterwards. It's now about 18 hours after the event and a resend task still remains unsent. Irrespective of the priority level or shortened deadline of a resend, I would have thought it would be good to issue resends rather promptly.

This is another example showing exactly the same (no resend after many hours) behaviour.
31) Message boards : Number crunching : pending credit from 2005 (Message 19513)
Posted 22 Apr 2008 by Profile Gary Roberts
Post:
any way to clean these up?


And while we're into keeping things clean and tidy, I wonder if there's a way to eliminate the clutter caused by people who insist on creating new threads rather than adding their complaint to one of the large number of already existing threads on this very topic?


32) Message boards : Number crunching : The new look bugs (Message 18642)
Posted 27 Nov 2007 by Profile Gary Roberts
Post:
Any hope of getting jump to first unread post fixed?

What she said :)

What they said. :-)

LOL. I'm glad my request is so popular. :-)

Who said what?


Kathryn wrote:
Any hope of getting jump to first unread post fixed?


Another request for this please.

33) Message boards : Number crunching : Server Status (Message 17762)
Posted 10 Aug 2007 by Profile Gary Roberts
Post:
... why is your font always like that?!


You too can have the font of your dreams :).

Just try a few BBCode tags to reflect your personal style :).

34) Message boards : Number crunching : Server Status (Message 17756)
Posted 9 Aug 2007 by Profile Gary Roberts
Post:

Server Status


Up, Out of work
150 workunits in progress
255 concurrent connections


More concurrent connections than WU's........


Do you realise that "out of work" means exactly what it says - there are zero workunits available to send?

The concurrent connections has nothing to do with available work. It's just the number of machines "banging on the door" at that instant in time :).

The "workunits in progress" has nothing to do with available work. It's simply a reminder that there are slow (or more likely overloaded) machines out there that haven't yet got around to crunching and returning what they were previously sent - perhaps many days ago. Of course it's also a measure of "resends" that were sent out more recently for work that has missed the deadline or errored out in some way. However, none of that is "available" now, unless there happens to be a need for an odd further resend or two.


It appears that the WU's went to the wrong computers this time.


Why do you think that?


(I should have got 100 instead of just 2 that I finished in 3.5hrs)


You obviously need to stick your hand up higher next time :).



35) Message boards : Number crunching : wrong url ? (Message 17591)
Posted 29 Jul 2007 by Profile Gary Roberts
Post:

So, here is the question again: Do I really need to detach / reattach all my boxes ?


As others pointed out, detaching/reattaching plus subsequent merging is one way of solving the problem. You end up with the newest host ID and you lose track of when that host originally joined the project. Also, there is probably some risk of not being able to merge and therefore a risk of losing the individual stats for that host. I'm not really sure how significant that risk is as I haven't taken that path and tried it out.

Here is what I did which worked for me:-


  • Stop BOINC
  • Correct the string for the master_url tag in three files, account_lhc*.xml, client_state.xml, statistics_lhc*.xml
  • In the projects folder, change the name of the lhc subdir to lhcathome.cern.ch_lhcathome
  • In the main BOINC folder make a similar name change to 5 files, account_lhc*.xml, master_lhc*.xml, sched_reply_lhc*.xml, sched_request_lhc*.xml and statistics_lhc*.xml
  • Restart BOINC



Some of the filename changes are cosmetic, ie to not have new files created with the old ones still hanging around. The name change for the project subdir also prevents an unnecessary new copy of all the executables from being downloaded.

All in all, it just takes a couple of minutes and I get to keep my old host ID, and my individual host stats on the statistics tab of Boinc Manager.

36) Message boards : Number crunching : Initial Replication (Message 17590)
Posted 29 Jul 2007 by Profile Gary Roberts
Post:
... swap back to other projects that have WUs on a regular basis - no electricity saved there ......


You're absolutely right! More science gets done but no electricity is saved. It's just like designing more fuel efficient engines. People will probably just travel more miles and no fuel will likely be saved so we don't need those better engines either :).

As to wasting the time - that's a personal opinion, and one I don't hold. I'm happy to have LHC WUs, myself.


I'm very happy to have meaningful work from LHC. It's just that I'd like to see some other science project have the benefit of the time spent on 40,000 redundant LHC results. Just a personal opinion ... :).

37) Message boards : Number crunching : Initial Replication (Message 17589)
Posted 29 Jul 2007 by Profile Gary Roberts
Post:
.... cows, and they don't take kindly to corks!


Ahhh yes! We have a lot of talk about clean coal technology so maybe we need a clean and green cow technology ;).

But what do we do about flatulent humans ... :).


38) Message boards : Number crunching : Initial Replication (Message 17575)
Posted 27 Jul 2007 by Profile Gary Roberts
Post:
and now people are complaining that there are too MANY work units. there is just no pleasing BOINC users.....



Did you actually understand the message you were responding to? The OP didn't say anything about work units. His comment was about initial replication (IR).

I think I saw Neasan mention somewhere that there were about 20,000 work units in the last series. Whether the IR is 1, 3, 5, or 101, there are still 20,000 WUs. So the OP wasn't complaining about too many WUs as you assert.

The OP has actually raised an issue (admittedly debated many times in the past) that is probably worthy of further debate due to changing circumstances. As I understand it, a couple of years ago the scientists wanted quick answers so they could get on with the design, with each stage depending on the answers to the previous simulations. I would imagine that things are rather different now as the time for commissioning should be fast approaching.

To get quick answers back then, the scientists opted to be rather profligate with crunching resources. Do they really need to be that way now? With the current push to limit greenhouse emissions, I would have thought that now is the time to cut back on unnecessary crunching. With an IR=5, there are 100,000 results in the last series. If IR=3 there would only be 60,000 and a lot of electricity gets saved.

Sure, more will be sent out to cover any that fail to return by the deadline or fail to validate, but that happens anyway. To cut down on the total time to get all results back, just reduce the deadline from 7 days to 4 days so that a 4th result gets sent out more quickly when needed. Recent versions of BOINC seen to be good at not sending work where there is a risk of it not completing in time so a 4 day deadline would restrict those people who were deliberately trying to fill huge caches. Neasan mentioned at one stage that he was giving some thought to ways to stop people being greedy by setting large caches. A nice short deadline would certainly add some encouragement to limiting the number of results per CPU that people attempt to get.

39) Message boards : LHC@home Science : WU's (Message 17367)
Posted 16 Jul 2007 by Profile Gary Roberts
Post:

15/07/2007 11:02:04 PM|lhcathome|Requesting 864000 seconds of new work


Yes you were lucky to be in the right place at just the right time :).

This project rarely has work and when there is work, the deadline is usually 7 days. It's not really a good idea to set your cache to 10 days (864000 seconds) because of the possibility that you could get work that you can't even start in time, let alone return. Even if you could do it in time, much of it would be "wasted effort" because the quorum would have already been completed on many work units well before you would be able to start.

It's much more "project friendly" to set a reasonable cache size.

40) Message boards : Number crunching : Can't Access Work Units (Message 17343)
Posted 13 Jul 2007 by Profile Gary Roberts
Post:
Jaysus, we barely get it fixed and the work disappears.



What did you expect with such a small number released? :).



Previous 20 · Next 20


©2024 CERN