Message boards :
Number crunching :
Damn wingman
Message board moderation
Author | Message |
---|---|
Send message Joined: 25 Nov 06 Posts: 25 Credit: 4,686,113 RAC: 0 |
http://lhcathomeclassic.cern.ch/sixtrack/workunit.php?wuid=1857346 was looking to see if my wingman had completed this result, was sort of surprised to find out the same machine had been assigned the validation. at least the results seemed consistent :D |
Send message Joined: 1 Jan 09 Posts: 32 Credit: 1,106,567 RAC: 1 |
I just noticed that I am my own wingman on this pair of tasks: http://lhcathomeclassic.cern.ch/sixtrack/workunit.php?wuid=1856175. This doesn't seem to be scientifically legitimate to me. And I Have noticed that this has happened to others too. Are the administrators aware that this is happening, and are they OK with it? |
Send message Joined: 25 Nov 06 Posts: 25 Credit: 4,686,113 RAC: 0 |
I guess in your case you can abort the 2nd instance and someone else should get it. Believe a few others have encountered the situation recently as well. |
Send message Joined: 1 Jan 09 Posts: 32 Credit: 1,106,567 RAC: 1 |
I guess in your case you can abort the 2nd instance and someone else should get it. Believe a few others have encountered the situation recently as well. I am planning to do that if I don't hear anything to the contrary from someone higher up. It just seems to me that there should be something in place to prevent this from happening in the first place. Just another bug that the project needs work out I guess. |
Send message Joined: 22 Jul 05 Posts: 72 Credit: 3,962,626 RAC: 0 |
.... Just another bug that the project needs work out I guess. No, it was a missing project config flag as reported by Richard a couple of days ago. He didn't get a response (he tried to get Eric's attention a couple of times) but it seems to have been attended to as I haven't seen any more recent examples. Yours are dated back then as well. I guess the Admins were too embarrassed to admit they goofed :-). Cheers, Gary. |
Send message Joined: 1 Jan 09 Posts: 32 Credit: 1,106,567 RAC: 1 |
Ahhh, that explains it. Thanks for the feedback. I'll be aborting that 2nd task ASAP. |
Send message Joined: 16 May 11 Posts: 79 Credit: 111,419 RAC: 0 |
The mechanism for redundancy is there. We always had the flag in the config.xml file. It seems that we get the dublicates send to the same user/machine when we also enable the matchmaker schedule. It shouldn't happen in my opinion, but this is the fact. We have switched on the matchmaker scheduler, because with cache-job only it would is picky about the hosts it would send work to. We don't need the homogenious redundancy, it fact, we want to compare all possible combinations of computers to study the reproducibility of the sixtrack program. But they indeed should all be different computers. Anyway, in short, there is more experimentation to do with the job distribution algorithm we use for LHC@HOME. I will switch the matchmaker off now. Let's see if we get any dublicates sent to the same user during saturday/sunday. Igor. skype id: igor-zacharov |
Send message Joined: 17 Sep 04 Posts: 99 Credit: 30,702,399 RAC: 5,928 |
The mechanism for redundancy is there. ...We don't need the Is the goal to eliminate the need for redundancy in the future, to allow faster throughput of results (no repeats)? Regards, Bob P. |
Send message Joined: 16 May 11 Posts: 79 Credit: 111,419 RAC: 0 |
The redundancy will be needed in the future also, since we cannot exclude faulty devices (not talking about cheating). In particular, one of the side effects of a large scale accelerator study is singling out hosts that produce wrong results. We have seen an indication of that in the past, when getting results from Overclokers, but did not do a systematic study of these effects. With the latest executable this is within reach. Just one explanation. The sixtrack program is for the accelerator study. It needs bit-accurate reproducibility to accurately frame out the appreture. If the results have artificial scatter we cannot zoom to the desired phase space boundary, even collecting order of magnitude more statistics. Eric McIntosh should give more explanation on this and the impact bit-reproducibility will have in science. skype id: igor-zacharov |
Send message Joined: 24 Apr 11 Posts: 37 Credit: 1,295,012 RAC: 0 |
I can vouch for Overclockers having problems. Most of the problems I catch, but for a day or so, two boxes had problems. One box was easily corrected with a voltage tweak that was neglected on a major BIOS update. The other box had a bad motherboard and it was replaced. Both these boxes would run Prime95 all day/night no problem but fail other tests and certain BOINC jobs. I have since proven the corrections stable as before and all is okay now... Soo, yes, overclockers can generate good results that are wrong from time to time and one has to be painfully careful with that. Even machines setup pure stock with BIOS defaults can fail if the BIOS doesn't set things up properly... and I've experienced that as well. Also, it would be helpful (in the future) if the server would send a NOTICE to offending machines to wake up folks should this prove to be a problem in the future. However, usually BOINC will start doing random stupid things when there is a problem... like shutting down unexpectedly without error... GPU drivers suddenly working more slowly... all kinds of hints that something isn't perfect. :) |
©2024 CERN