1) Message boards : Number crunching : Why, oh why, oh why?? (Message 24777)
Posted 4 Sep 2012 by Profile Gary Roberts
Post:
... This is an arrogant behaviour, more akin to bullying ....

If you are going to make such inflammatory statements, it really pays to get your facts straight. Since your main project seems to be Seti, you should try to understand the consequences of your decision to choose LHC as a secondary project.

I'm guessing you have a fairly large cache setting in order to cope with periods of no work from Seti. I'm also guessing that you have given Seti the bulk of the resource share so that when Seti work is available, your host will prefer that work. If those are both true, you will make it very difficult for BOINC to cope if you have very short deadline secondary project(s). When Seti can't supply, your client will probably try to load up with extra work from a project that can supply. With a low resource share for the secondary project, BOINC will need to use high priority to get this extra work done by deadline if you subsequently get plenty of Seti work at a later stage.

If you really need to 'blame' somebody, try Seti. If they always supplied work when requested, you wouldn't need an excessive cache and BOINC would rarely need to panic.
2) Message boards : Number crunching : Credits (Message 24672)
Posted 21 Aug 2012 by Profile Gary Roberts
Post:
... Please, look at your credits, if you care, and if you find problems, discrepancies or have comments, write to this thread or to my private inbox. I will be looking at the system on wednesday (22nd of august) again.

I'm not concerned about credit but I am interested in pointing at examples where there seems to be a discrepancy with what is supposed to have happened.

If I understand you correctly, your intention was that the credit for long running tasks (perhaps ONLY those that still remain in the online database) should have been reviewed and adjusted upwards by a factor of approximately 10 above what would normally be expected for a standard 1M turn task.

For many of my hosts, 1M turn tasks take around 12K - 18K secs and are usually awarded credit in the range of 90 - 180 or thereabouts. On that basis, I would expect 10M turn tasks that run the full distance to take around 120 - 180Ksecs (ie around 40 hours) and then be awarded in excess of 1000 credits on average.

I've had a look at this particular tasks list for one of my hosts. I found the three oldest entries that must have been 10M tasks because the run times are in excess of 100Ksecs. The credits for these three are 234.78, 1190.23 and 234.82. The run times are 151,176.70, 182,475.70 and 100,447.60. The WUIDs are 2503410, 2531308 and 2515332. I've recorded all this in case they get deleted shortly.

My intention is simply to report the fact that out of three long running tasks that are still in the database, only one seems to have been identified and adjusted. I'm not expecting any particular action. I just thought you might be interested to know.
3) Message boards : Number crunching : Long WU's (Message 24660)
Posted 20 Aug 2012 by Profile Gary Roberts
Post:
So it is definately something with AP, and probably also something with v6.2.x, but probably only when in combination with AP.

I've had other commitments since I wrote my original message last Friday so I'm sorry I'm only just now able to find time to catch up with the various replies.

Particular thanks to Richard for pointing out the implications of the client reporting zero run time. I know I'm using an old client but I have particular reasons for doing that. This old client version does report both run time and CPU time for tasks running normally. It's only when AP is invoked that the run time is set to zero.

AP has now been removed from all my hosts. It took a while for the caches to drain on the last few of them as they had 100+ tasks on board at the time NNT was set last week. This is apparently a further artifact of the old client and AP combination. When running normally, the limit of 4 tasks per core is enforced. When running under AP the limit disappears and the client keeps receiving tasks whenever it makes a request.

I would like to apologise to all those who received zero credit for tasks when paired with one of my AP hosts. Now that none of them ever seem to run the generic app, there is no reason for me to use AP here again. I'm sorry it took a while for me to realise the extent of the problem and that it could be worked around by getting rid of AP.

There was one aspect that puzzled me until just now. When some tasks were receiving zero credit, why didn't all tasks, since all had zero run times. I just noticed that those tasks receiving zero credit all seem to be the _0 task which in turn seems to become the canonical result in a two-task quorum. Seemingly, if my task didn't become the canonical result then all was fine.
4) Message boards : Number crunching : Long WU's (Message 24626)
Posted 17 Aug 2012 by Profile Gary Roberts
Post:
I also had such a monster:
http://lhcathomeclassic.cern.ch/sixtrack/workunit.php?wuid=2515279

Spending 110 hours (more ) on it and only getting 190.57 Credits.
Thats not fair ;(

Everybody who had one of those 10M turn tasks suffered the same fate and Eric has already accepted blame for the oversight - both for the limit on credit and for the inadequate deadline. Not much use complaining further.

But what about this particular example. It's a completed and validated quorum where one host took 180Ksecs and the other took 303Ksecs and the credit award was 0.00 for both.

It's not an isolated event. Here is a small list of completed and validated quorums where zero credit was given. There are lots more like these.

2566284
2566253
2566221
2566220
2566215
2566214
2561956
2560652
2560651
2560648

The common factor is that one of the hosts participating in all those quorums is running the sse3 version of the application under the anonymous platform mechanism (AP). When the various versions were first released, there were problems with the detection of CPU capabilities and all my hosts (even though sse3 capable) were being sent the much slower generic app. I solved that problem by forcing the use of the sse3 version with AP.

At that time I didn't see any problem with credit awards. It's possible I wasn't paying close enough attention but I do believe all validated tasks were receiving normal credit. When the CPU detection was improved, I started removing AP from my hosts as caches drained. I wasn't in any particular hurry - I was making the transition when convenient. I still had quite a few machines to go when I started noticing the zero credit awards. Not every result gets zero credit. At least half or slightly more get normal credit. It seems to be a pretty random thing.

I reported it to Igor and Eric over a week ago but the behaviour continues. The caches for the last couple of AP hosts should drain today so when AP is removed on those hosts, that will be the end of the problem for me at present. The problem should be investigated so that future use of AP is not compromised.
5) Message boards : Number crunching : Long WU's (Message 24623)
Posted 17 Aug 2012 by Profile Gary Roberts
Post:
.... If I'm better off aborting the task and starting on something else let me know. I don't like wasting CPU time, even if it's a slow CPU.

Because there are already two validated tasks for that workunit, the quorum is already complete and you should immediately abort your now unnecessary copy.

If it could be completed before deadline, it could also receive credit. Once the deadline passes, you will not receive credit so (from what you say) you should abort it immediately and stop wasting time.
6) Message boards : Number crunching : Private Messages (Message 24534)
Posted 8 Aug 2012 by Profile Gary Roberts
Post:
I just tried to send a PM to Eric and Igor with a copy to myself. I like to send a copy to myself so that my inbox will always have a copy of what I've sent. If you don't include yourself, you wont have a permanent record if the recipient doesn't include your text in their reply (if any).

The system seems to have problems sending 3 copies of one message. The response I got was (the typo belongs to the system - not me :-).)

You are not allowed to send privates messages so often. Please wait some time before sending more messages.

The 'problem' must be a limit on the number of recipients since this is the only PM I've tried to send recently. Maybe this restriction could be eased a bit.

EDIT: I just tried resending the message with only two recipients, Igor and myself. This time it worked.
7) Message boards : Number crunching : No Tasks ??? (Message 24459)
Posted 30 Jul 2012 by Profile Gary Roberts
Post:
If batches are taking too long to run then you might have someone take a look at any limitations that have been set for the maximum number of Tasks a client can run in any one day.

Is there such a limit? I don't think there is but I might be wrong. There is certainly a limit on tasks in progress (currently 4 per CPU core) but as soon as a finished task is returned, it seems to be able to be replaced without limit during the day as long as there are tasks 'ready to send' on the server. I haven't noticed any host being told that it has reached a 'daily quota'.

At one time, there were so many number crunching farms out there that as soon as a batch of tasks appeared on the server they got sucked up by these farms and people who wanted to participate but were running only one or two systems never got anything.

I think this is a misunderstanding of what happened in those days. At the time that 'farm bashing' started happening, batches of work were small and the gaps between batches were weeks and months. Nobody got much work to do. The way BOINC is designed, if requests for work are unsuccessful, the gap between requests will lengthen to the point that the client may not initiate a work request for as long as 24 hours. If a small batch of work suddenly appears, the lucky hosts will be those that just happen to ask for work at the right time. It has nothing to do with whether any particular host is part of a farm or not.

I am sure that those individuals (whether owing a single host or a farm) who wanted work desperately enough, could arrange for any particular host to bypass the BOINC backoff and so have a much higher chance of requesting work at the fortuitous moment when new tasks first appeared. I suspect enough people were doing this to deplete the batch so quickly that hosts in extended backoff didn't stand a chance.

I have notice in the past few days that if I try to "pump" the Project I will eventually get a message that I have gotten the maximum number of Tasks for the day.

The message actually says

This computer has reached a limit on tasks in progress.


which is fine since hosts are being prevented from caching such large numbers of tasks that others would be prevented from getting their share.
8) Message boards : Number crunching : How is the SSE3 thing coming along? (Message 24393)
Posted 18 Jul 2012 by Profile Gary Roberts
Post:
... I've attached 15 hosts to the project, all of which are sse3 capable. They all received the generic executable and all were crunching very slowly. As soon as the versions appeared back in the download area - about 24 hrs ago - I grabbed copies of the sse3 versions...

....

The speedup using the sse3 version is quite impressive!

Hopefully, you will be able to come up with a reliable mechanism for detecting the CPU capabilities and the manual effort of managing AP can be dispensed with.

I wrote the above earlier in this thread. Yesterday, I attached a new host - mainly to see if there had been any change in the ability of the project to detect host capabilities. I have done this a few times but it was a couple of days since I last tried it. When I added the new host, I didn't bother using the 'attach to project' mechanism in BOINC Manager - although that would have been by far the easiest method :-).

I've been keen to support this project for quite a few years and in earlier days before the 'big drought', I had quite a few hosts (with low HostIDs) attached. When the project had very long periods of no work, these hosts were allocated solely to E@H and removed from LHC. The entries for them dropped out of sight in the LHC hosts list but are all still there.

In the intervening period, a lot of these hosts were decommissioned and/or upgraded - Coppermine and Tualatin PIIIs became Core 2 quads and Athlon XPs became Phenom II quads. Their principal task is to crunch for E@H but I'd also like to allocate some time to LHC.

It is relatively easy, if you understand the state file (client_state.xml) and you know how to edit it safely, to attach a new host to the project and have it adopt the identity of a much loved but long departed former host, rather than being allocated a new ID at the top of an already overly bloated range.

So yesterday, I chose one of my old HostIDs which was last active back in 2007. I chose a Q8400 quad which was attached only to E@H and created an LHC project directory populated with the various versions of the current app. I placed an LHC account file in the BOINC directory and stopped BOINC and made the appropriate adjustments to the state file so that BOINC could use the old HostID. On restarting, I was expecting BOINC to select the generic version of the app (and that I would then have to switch to AP) but I was delighted to see it go for the SSE3 version and promptly download a number of new tasks.

I am grateful to the Admins for fixing the CPU capability detection mechanism. Thanks for the good job done! I can now transition all my other hosts away from AP. For anyone interested, 45149 is the resurrected HostID of my latest addition to this project.

For any other people running hosts under AP because of previous improper CPU capability detection, it now seems that you can drop AP. Be careful how you do this. It's very easy to trash all tasks currently in your cache. If you don't want to do extensive editing of client_state.xml, the safest method is to set NNT and allow all tasks to be completed and returned. Then delete app_info.xml, stop and restart BOINC and reset the project which will remove unwanted stuff from the state file. Then allow new tasks and the correct app and new tasks (if available) should be downloaded.
9) Message boards : Number crunching : Possible explanation for "No Tasks sent ..." (Message 24391)
Posted 17 Jul 2012 by Profile Gary Roberts
Post:
... I can confirm we submit in batches. We have been trying
to reduce the famous "tail" so that we complete a a study.

Here is a suggestion for you to consider. Let's say you have a batch of 100K WUs to submit. How easy would it be for you to submit them in sub-batches of say 5K WUs at a time with say 30 minutes (or even longer) between each sub-batch. Perhaps you could design a script that automates this.

There always seem to be a number of hosts that trash tasks fairly quickly. By leaving the time gap, the retries for any such trashed tasks will be inserted into the queue before the second sub-batch is added. This process should continue with each successive sub-batch. By the time the 100K WUs are all added, the retries for a large fraction of the download errors and early compute errors should already be distributed through the queue and not clumped together at the end. The retries for these could have the normal deadline, since they will be going out concurrently with normal tasks. Also, they could be sent to all hosts rather than just reliable ones.

Much later, the 'deadline miss' retries will be needed. Maybe by using the <reliable_priority_on_over_except_error> flag, (described at the link Richard provided), you could assign the higher priority (much shorter deadline??) to just these retries, rather than to all of them as is the case now?
Hopefully there will be significantly fewer of these and they will come over a period so no one host should get a large number of them in one hit.
10) Message boards : Number crunching : Possible explanation for "No Tasks sent ..." (Message 24390)
Posted 17 Jul 2012 by Profile Gary Roberts
Post:
Although Gary has explained the general principle very clearly, I don't agree with the conclusion that the project is deliberately slowing down the distribution of resent tasks.

Richard, thanks very much for your response. As usual, your insights are very helpful and provide food for thought. I'm sorry if you thought I was making conclusions - I was actually asking questions. I wasn't accusing the project of deliberately slowing down resends. I tried to point out that I was making assumptions and I asked if there was some configuration flag available to promote resends higher up the queue. I understand the FIFO queue so I shouldn't have referred to it as a 'pool'.

I freely admit I have almost zero knowledge of project configuration flags but at one point Igor referred to 'matchmaker' scheduling and so I googled it and did some reading. I think I understand the basic principle of creating a cache of jobs in shared memory and that matchmaker scheduling is a mechanism for selecting tasks preferentially from within that job cache. What I was really asking was if there was any mechanism for getting tasks from the general queue preferentially into the job cache so that they would have a chance of being distributed to hosts quickly.

... this project tends to create new jobs in relatively large batches: these will be processed first.

Yes, that's my observation too, but the insertion of the new jobs does take some time - the RTS on the server status page takes a while to reach the maximum. I can't say I've ever actually timed it though.

... It looks as if the project has the Accelerating retries configuration in effect - whether deliberately, or as a side effect of the other server configuration issues, I'll leave to them.

Yes, but the opening sentence mentions "... send timeout-generated retries ..." which perhaps is an oversight but seems to be suggesting it doesn't apply to error-generated retries. Reading on further then suggests that it can apply to all retries depending on which of the two flags <reliable_priority_on_over> or <reliable_priority_on_over_except_error> (if any) has been set non-zero. I can understand the urgency for a job that has timed out. Perhaps there's less need for urgency for a retry resulting from a download error??

1) resent tasks are only allocated to 'reliable' hosts. Quite a few machines may have lost that status because of the project's server problems, and they may not get any work until a new batch of first-run tasks is available.

The great majority of my hosts had many download errors and should have been 'unreliable'. Several that I've looked at have retries - a couple have large numbers of them. I guess it only takes a relatively few successful tasks to restore the 'reliable' status.

2) resent tasks are given a much shorter deadline. I'm seeing deadlines of half the already-tight 6 days 15.5 hours. If users are participating in other BOINC projects, and have set even a modest cache size, the server may judge that tasks won't be turned round in the 3-day 'resend' deadline interval, and decline to send tasks to that host.

I'm seeing the 3 days 7+ hours deadline on all resends but some of my hosts don't seem to have much trouble getting them. They are not being excluded by the server even though they have a 3 day cache.

One such host has been given around 250 tasks, another was given 450 tasks and a third was given 200. The host in most trouble is the first mentioned because there were around 130-150 resends with the very short deadline in the mix. The other two have a lot less and so have some time to attempt recovery.

There is no way the host in trouble can finish its cache in the allowed time so I've had to abort around 60 so far with probably more to follow. I think this host was over-supplied because the time estimate of a bunch of tasks was way too short (35 mins instead of close to 4 hrs. As soon as the first ones of these were finished, the estimate was adjusted and the host went into panic mode - where it still is despite the aborts. I can now let it run for a bit using 4 cores (E@H has been put on hold) so it might be able to work through the remaining tasks. There is a bit of time before I'll need to assess this again.

I've aborted more than 20 tasks on the host that had 200. I've not aborted any yet on the host with 450 as it has already worked off the resends and the next deadline is a few days away. I'll wait another 24 hours to see if it happens to get a bunch of shorties but some sort of cull seems inevitable. On this host, I actually saw a whole bunch of tasks with a 30-40 min estimate change to around 4 hours when I first noticed the huge cache and started to investigate. I assume it was much the same deal on the other two. There are other hosts with around 100+ tasks in their caches but they seem to be coping at the moment.

I agree completely that it's quite unfair to give hosts that support multiple projects, resends with such a short deadline. I don't know why the server isn't excluding my hosts which already have a 3 day E@H cache on board. I don't seem to be suffering the fate that I should be :-).
11) Message boards : Number crunching : Possible explanation for "No Tasks sent ..." (Message 24368)
Posted 15 Jul 2012 by Profile Gary Roberts
Post:
From time to time the server status page indicates plenty of tasks (often thousands) but repeated scheduler requests fail to get any. I see this quite a lot and others have commented about it. This particular message is a classic recent example.

I think I understand what is happening and I'm posting this in a separate thread because there are (if I'm correct) consequences that should be addressed.

When a bunch of new tasks are inserted into the DB, the 'ready to send' (RTS) number will start increasing and those hosts 'knocking on the door' will start receiving 'primary' tasks - those with the _0 or _1 extension on their name. The RTS will be fluctuating depending on the rate of creation versus the rate of distribution.

At one end of the spectrum, there will always be a proportion of hosts that fairly promptly trash new tasks and at the other end, there will be other hosts that completely fail to return at any stage. So there will be a steady ongoing flow of 'resend' tasks - those with an extension of _2 or greater - being added to the RTS pool. I'm assuming that a resend task is added to the pool as soon as the server becomes aware of the primary task failure. If you 'drill down' into WUs, you will find examples of such tasks labelled as 'unsent'.

I'm guessing that they show on the server status page as RTS. However they remain as 'unsent' for what seems like an unnecessarily long period.

I've just taken a look at the tasks list for some of my hosts. Using the 'show names' link I can find many very recent 'resend' tasks sent to a number of hosts. In every case I looked at, there was a considerable gap between the time of failure of the primary task and the time when the resend task was actually sent out. The shortest gap I saw was around 10 hours but most were in the range 12-20 hours. The server seems to want to 'sit on them' for a while.

I'm guessing that this is why you can see RTS tasks but you can't get them immediately.

Resend tasks have a shorter deadline - the project wants the results back ASAP and people understand this. So why can't they be distributed immediately? Is there some project config flag that controls this?

If you need a shorter deadline for resends, delaying their distribution seems counter-productive.

12) Message boards : Number crunching : Damn wingman (Message 24341)
Posted 13 Jul 2012 by Profile Gary Roberts
Post:
.... Just another bug that the project needs work out I guess.

No, it was a missing project config flag as reported by Richard a couple of days ago. He didn't get a response (he tried to get Eric's attention a couple of times) but it seems to have been attended to as I haven't seen any more recent examples. Yours are dated back then as well.

I guess the Admins were too embarrassed to admit they goofed :-).

13) Message boards : News : Server/Executable problems (Message 24320)
Posted 12 Jul 2012 by Profile Gary Roberts
Post:
Am I realy the only one with this problem?

I don't know but in the last 24 hrs I haven't noticed any uploading problems on any of my hosts. Also, there don't seem to be others with the same problem. I would think this makes it rather likely that the problem is somewhere at your end.

In another thread you mentioned that your firewall was complaining about the download of the executable. Is it possible that it's also messing with uploads? Have you tried completely disabling it temporarily to see if it makes any difference?

Another thing to try (if your computer is reasonably portable) is a friend or colleague's internet connection to see if that gets things moving.

Just some random thoughts ...
14) Message boards : Number crunching : How is the SSE3 thing coming along? (Message 24299)
Posted 12 Jul 2012 by Profile Gary Roberts
Post:
... You shouldn't have to (do) that of course ...

Sure enough, but if the project doesn't send anything but the generic executable, there is little other option :-).

In my case I've attached 15 hosts to the project, all of which are sse3 capable. They all received the generic executable and all were crunching very slowly. As soon as the versions appeared back in the download area - about 24 hrs ago - I grabbed copies of the sse3 versions. I have mostly linux hosts with a couple of Windows XP ones. Most of the hosts are quads - mainly Intel, with a couple of Phenom IIs.

To initiate AP processing, you need to construct a suitable app_info.xml file and place it with the new executable in the project directory. BOINC needs to be restarted to force it to 'see' app_info.xml and if you haven't constructed that file correctly, you will very likely trash your entire cache of tasks at that point. For that reason, people often set NNT and complete the existing cache before switching to AP mode.

I was in a hurry and I made the assumption that any version of the app would be able to read the last checkpoint saved by a different version and I also assumed (since you had said that all versions give numerically identical answers) that a part answer from one version, then completed with a different version, would hopefully give a final result that would still validate.

So all my sse3 capable hosts were transitioned from the generic version as quickly as possible. All had caches with many 'previous version' tasks in them. The _sse3 version has been running for at least the last 12 hours and all the 'two version' results I've looked at so far seem to be validating quite OK. The speedup using the sse3 version is quite impressive!

Hopefully, you will be able to come up with a reliable mechanism for detecting the CPU capabilities and the manual effort of managing AP can be dispensed with.

15) Message boards : Number crunching : Failed to download (Message 24249)
Posted 11 Jul 2012 by Profile Gary Roberts
Post:
There's not much point getting angry and blaming the servers. 'Fixing' a server will not allow it to transfer a non-existent file.

When your client requests work, it is handed the URL(s) for any data file(s) associated with any task(s) allocated to your host. Having received such a URL, your client attempts to initiate a download of the file at the specified location. The problem is that the file simply doesn't exist at the specified location. It's not a problem of an overloaded or unavailable server. If it were, it would be classified as a temporary failure and BOINC would keep retrying until the transfer succeeded.

In the current situation, the non-existent file(s) leads to an immediate permanent failure. BOINC is 'smart enough' (?) to give up completely if a file doesn't exist at all.

This is having an unfortunate consequence. There is a limit of 10 tasks per quorum and in a lot of cases, some days ago, hosts were able to download the data before these files became 'lost'. So a lot of people did invest time and effort in crunching such tasks before the problem arose. In quite a few cases, extra tasks were needed (to resolve incompatible answers or replace compute failures, etc) so now that the data files no longer exist, the permanent download failures will eventually push each quorum to the limit of 10 at which point any completed 'pending' tasks will be no longer able to be used and their pending status will change to 'can't validate'. If you have any such tasks in your list, drill down through the WUID link on the website and you will see all the download failures making up a full list of 10 instances of that task. The whole thing fails when the system tries to generate the 11th copy of the task and then discovers that the limit is exceeded.

Eric is certainly trying to rectify things - see his post in the News forum late last night. He said there that he will reassess the situation first thing this morning so we just need to be patient until he has time to do that.
16) Message boards : Number crunching : Wot no jobs? (Message 23895)
Posted 21 Feb 2012 by Profile Gary Roberts
Post:
Odd thing is, that the server status page shows over 1,000 tasks available, but the client is getting no task available errors.

It's not odd at all. There are two entries about this on the status page - 'Tasks ready to send' and 'Tasks in progress'. So, when you looked, the status page would have said there were over 1,000 tasks in progress but none ready to send. The 'in progress' figure simply accounts for those tasks which have been sent out to hosts but have not yet been returned.

I just detached and re-attached on the new URL (or what I think is the new url). It's not giving the update the project url error though anymore, just no tasks available.

Which is what you would expect if there are zero tasks 'ready to send'. Detaching and re-attaching, or resetting a project in order to get non-existent tasks is rather a futile exercise.

When there are tasks in progress, but no tasks ready to send, there is always a small chance you may get lucky and be sent a task or two in response to a work request. It won't be a primary task (there are none) but it could be a 'resend' task. If the server has just been notified that 'in progress' task(s) have failed, or have exceeded the deadline, it will create resend tasks to replace them. These don't show up in the 'ready to send' figure. They are just immediately allocated to the first lucky host(s) that happen to come along asking for work. This is why you will see people saying they just got work when others can't get any.

Also, with this particular project, there seem to be random occasions when very small numbers of new tasks are added to the pool. Because the status page is not updated continuously, it's possible that the new tasks could all be issued before the next status page update takes place and so they wouldn't have ever been recorded as 'ready to send' on the status page.

I'm guessing there's just none for Windows 7 64-bit? Or maybe it doesn't want to do the update with a LHC 2 task in progress. Not sure, hmm...

None of these guesses are correct. If your host is eligible for work, BOINC will request it. If there is work available, the server will send it. If you are not sure if your host is eligible for work, issue a manual 'update' and see what happens.
17) Message boards : Number crunching : Faulty Computers or Modified BOINC ?? Huge Credits (Message 23739)
Posted 23 Nov 2011 by Profile Gary Roberts
Post:
What's just as interesting is that the two tasks were both awarded the full credit value rather than being awarded the average. Surely that's a bug??

It's a bit concerning that there doesn't seem to be any interest or reaction by HQ to the more massive problem caused by the Bartonn rogue results. The site stats are now so screwed that it may have an adverse reaction on participant satisfaction, let alone the 'encouragement' factor to those with malicious intent. I would have thought that some action to reverse the rogue credits could well pay off in discouraging 'copycat' offenders.
18) Message boards : Number crunching : Faulty Computers or Modified BOINC ?? Huge Credits (Message 23731)
Posted 21 Nov 2011 by Profile Gary Roberts
Post:
Bartonn has several hosts attached to the project and these are not hidden. If he was a master cheater, his hosts would be hidden, he would have chosen a rather less conspicuous amount to cheat by and he would have spread this over all the hosts he has. The host that had the glitch is a quad core and exactly 4 tasks showed the glitch. It's possible that those 4 tasks were all running and were affected when the glitch occurred.

The host has received more tasks since the original 4 were returned and now the CPU time seems to be normal. Looks pretty much like it was just one of those unfortunate and unexplained events. It may well be that Bartonn doesn't even know the problem has occurred.

We can now identify the 4 lucky recipients of this unintended largesse - MrOctane, wdsmia, Inargus and Galeon 7. Just look at the top 5 names based on RAC. Two of them are on top based on total credit as well. Looks like there needs to be a sanity check on credit claims to prevent this sort of thing happening again.
19) Message boards : Number crunching : Faulty Computers or Modified BOINC ?? Huge Credits (Message 23721)
Posted 20 Nov 2011 by Profile Gary Roberts
Post:
Two users computers have come in with amazing results in the last few days.

MrOctane and Bartonn

Always better to give a link so others can easily follow. This is Bartonn's task list and something is screwing up the CPU time whilst the run time looks OK. As a result of the crazy claims made by this host, there would now be three wingmen who have gotten a rather unexpected and exceedingly large bonus, with more to come as the host completes more tasks.

I'm guessing you saw the two names you mentioned by looking at those two right up the top of the list for the top performers based on RAC. There are only those two at the moment but there will be two more lucky lottery winners shortly when the newly completed mega-credits are taken into account.

Perhaps there's no malice involved. I wonder if it could be something to do with a date change while tasks were in progress??
20) Message boards : Number crunching : no more work? (Message 23627)
Posted 2 Nov 2011 by Profile Gary Roberts
Post:
... This is what the scientists want, which would not be a problem if the mechanism functioned correctly....

When the Devs are looking at whether or not things are functioning correctly, perhaps they might like to consider examples like this quorum from which a number of interesting observations can be made. At first glance it looks quite straight forward - two deadline misses, one of which was completed 3 hours after the deadline so one of the two resends that were created wasn't actually needed.

But take a closer look. The two deadline misses were created around 6:12 UTC on 27 October. At this point, two 'unsent' resend tasks would have been added to the quorum. Three hours later at 9:11 UTC, one of the deadline misses was then returned. At that point one of the 'unsent' resends should have been cancelled with a status of 'didn't need'. I've actually seen this happen at Einstein so I know BOINC can do this.

Those two 'unsents' sat around for 4 days and were finally sent out on 31 October. It was nearly a day later on 1 November before the extra task was actually cancelled with a status of 'redundant result'. So why did it take so long for the system to realise that only one resend would actually be needed? It should have been able to make that decision back on 27 October at 9:11 UTC. A BOINC bug, I guess.

On the more general question of how to achieve the more rapid finalisation of batches of work, I think it would be beneficial to 'educate' the volunteers about how best to contribute if they really want to. Just take a look at the quorum I linked to above. Drill down to the details page for the two hosts who received the primary tasks. Notice that those hosts have turnaround times of 4.37 and 6.65 days respectively. So both those hosts are rather unsuited to the needs of the scientists. If you can put a 1.5 day turnaround limit on resends, why not put say a 4 day limit on primary tasks? If people insist on having 7 day work caches then the project should decline their offer to do work under such unsuitable conditions. There could be a big, obvious notice about the limitation on the home page and the message in BOINC Manager could say. "No work sent - your turnaround time is longer than 4 days. To get work, please reduce your work cache size and detach/reattach." People could then decide if they really wanted to support the project or not.

The actual primary deadline puzzles me. It's often quoted as 7 days but the deadline miss task in the above quorum gives a precise figure of 6 days 15 hours 32 minutes and 14 seconds (or 574334 seconds). Why such odd-ball numbers? If the scientists really need to finalise a run quickly, why not reduce the deadline a little (to, say, 5 days) to prevent the last primary tasks issued from potentially hanging around for up to 7 days before any 'deadline miss' resends for them could be sent out? Sure, some people will complain, but probably will get used to it when they understand the needs of the project.


Next 20


©2024 CERN