Message boards : Number crunching : Damn wingman
Message board moderation

To post messages, you must log in.

AuthorMessage
angler

Send message
Joined: 25 Nov 06
Posts: 25
Credit: 2,821,874
RAC: 34
Message 24297 - Posted: 12 Jul 2012, 10:54:59 UTC

http://lhcathomeclassic.cern.ch/sixtrack/workunit.php?wuid=1857346

was looking to see if my wingman had completed this result, was sort of surprised to find out the same machine had been assigned the validation.

at least the results seemed consistent :D
ID: 24297 · Report as offensive     Reply Quote
Profile White Mountain Wes
Avatar

Send message
Joined: 1 Jan 09
Posts: 32
Credit: 891,295
RAC: 456
Message 24335 - Posted: 13 Jul 2012, 20:21:57 UTC - in response to Message 24297.  

I just noticed that I am my own wingman on this pair of tasks: http://lhcathomeclassic.cern.ch/sixtrack/workunit.php?wuid=1856175. This doesn't seem to be scientifically legitimate to me. And I Have noticed that this has happened to others too. Are the administrators aware that this is happening, and are they OK with it?
ID: 24335 · Report as offensive     Reply Quote
angler

Send message
Joined: 25 Nov 06
Posts: 25
Credit: 2,821,874
RAC: 34
Message 24339 - Posted: 13 Jul 2012, 22:02:28 UTC - in response to Message 24335.  

I guess in your case you can abort the 2nd instance and someone else should get it. Believe a few others have encountered the situation recently as well.
ID: 24339 · Report as offensive     Reply Quote
Profile White Mountain Wes
Avatar

Send message
Joined: 1 Jan 09
Posts: 32
Credit: 891,295
RAC: 456
Message 24340 - Posted: 13 Jul 2012, 22:13:57 UTC - in response to Message 24339.  

I guess in your case you can abort the 2nd instance and someone else should get it. Believe a few others have encountered the situation recently as well.

I am planning to do that if I don't hear anything to the contrary from someone higher up. It just seems to me that there should be something in place to prevent this from happening in the first place. Just another bug that the project needs work out I guess.
ID: 24340 · Report as offensive     Reply Quote
Profile Gary Roberts

Send message
Joined: 22 Jul 05
Posts: 72
Credit: 3,962,626
RAC: 0
Message 24341 - Posted: 13 Jul 2012, 23:00:23 UTC - in response to Message 24340.  

.... Just another bug that the project needs work out I guess.

No, it was a missing project config flag as reported by Richard a couple of days ago. He didn't get a response (he tried to get Eric's attention a couple of times) but it seems to have been attended to as I haven't seen any more recent examples. Yours are dated back then as well.

I guess the Admins were too embarrassed to admit they goofed :-).


Cheers,
Gary.
ID: 24341 · Report as offensive     Reply Quote
Profile White Mountain Wes
Avatar

Send message
Joined: 1 Jan 09
Posts: 32
Credit: 891,295
RAC: 456
Message 24342 - Posted: 13 Jul 2012, 23:33:04 UTC - in response to Message 24341.  

Ahhh, that explains it. Thanks for the feedback. I'll be aborting that 2nd task ASAP.
ID: 24342 · Report as offensive     Reply Quote
Profile Igor Zacharov
Volunteer moderator
Project administrator
Project developer
Project tester
Project scientist

Send message
Joined: 16 May 11
Posts: 79
Credit: 111,419
RAC: 0
Message 24343 - Posted: 14 Jul 2012, 0:16:16 UTC - in response to Message 24342.  

The mechanism for redundancy is there. We always had the flag



in the config.xml file.

It seems that we get the dublicates send to the same user/machine when
we also enable the matchmaker schedule. It shouldn't happen in my opinion,
but this is the fact.

We have switched on the matchmaker scheduler, because with cache-job only
it would is picky about the hosts it would send work to. We don't need the
homogenious redundancy, it fact, we want to compare all possible combinations
of computers to study the reproducibility of the sixtrack program. But they
indeed should all be different computers.

Anyway, in short, there is more experimentation to do with the job distribution
algorithm we use for LHC@HOME. I will switch the matchmaker off now. Let's see if we get any dublicates sent to the same user during saturday/sunday.

Igor.
skype id: igor-zacharov
ID: 24343 · Report as offensive     Reply Quote
Profile rbpeake

Send message
Joined: 17 Sep 04
Posts: 75
Credit: 23,851,753
RAC: 7,244
Message 24348 - Posted: 14 Jul 2012, 0:49:54 UTC - in response to Message 24343.  

The mechanism for redundancy is there. ...We don't need the
homogenious redundancy, it fact, we want to compare all possible combinations
of computers to study the reproducibility of the sixtrack program....

Igor.

Is the goal to eliminate the need for redundancy in the future, to allow faster throughput of results (no repeats)?
Regards,
Bob P.
ID: 24348 · Report as offensive     Reply Quote
Profile Igor Zacharov
Volunteer moderator
Project administrator
Project developer
Project tester
Project scientist

Send message
Joined: 16 May 11
Posts: 79
Credit: 111,419
RAC: 0
Message 24351 - Posted: 14 Jul 2012, 1:06:19 UTC - in response to Message 24348.  


Is the goal to eliminate the need for redundancy in the future, to allow faster throughput of results (no repeats)?


The redundancy will be needed in the future also, since we cannot exclude faulty devices (not talking about cheating). In particular, one of the side effects of a large scale accelerator study is singling out hosts that produce wrong results.

We have seen an indication of that in the past, when getting results from Overclokers, but did not do a systematic study of these effects. With the latest executable this is within reach.

Just one explanation. The sixtrack program is for the accelerator study. It needs bit-accurate reproducibility to accurately frame out the appreture. If the results have artificial scatter we cannot zoom to the desired phase space boundary, even collecting order of magnitude more statistics. Eric McIntosh should give more explanation on this and the impact bit-reproducibility will have in science.
skype id: igor-zacharov
ID: 24351 · Report as offensive     Reply Quote
Tex1954

Send message
Joined: 24 Apr 11
Posts: 37
Credit: 1,105,291
RAC: 784
Message 24361 - Posted: 14 Jul 2012, 14:39:51 UTC - in response to Message 24351.  

I can vouch for Overclockers having problems. Most of the problems I catch, but for a day or so, two boxes had problems.

One box was easily corrected with a voltage tweak that was neglected on a major BIOS update.

The other box had a bad motherboard and it was replaced.

Both these boxes would run Prime95 all day/night no problem but fail other tests and certain BOINC jobs.

I have since proven the corrections stable as before and all is okay now...

Soo, yes, overclockers can generate good results that are wrong from time to time and one has to be painfully careful with that.

Even machines setup pure stock with BIOS defaults can fail if the BIOS doesn't set things up properly... and I've experienced that as well.

Also, it would be helpful (in the future) if the server would send a NOTICE to offending machines to wake up folks should this prove to be a problem in the future.

However, usually BOINC will start doing random stupid things when there is a problem... like shutting down unexpectedly without error... GPU drivers suddenly working more slowly... all kinds of hints that something isn't perfect.

:)
ID: 24361 · Report as offensive     Reply Quote

Message boards : Number crunching : Damn wingman


©2019 CERN