Message boards :
Number crunching :
Host messing up tons of results
Message board moderation
Previous · 1 · 2 · 3 · 4 · 5 · 6 · 7 . . . 9 · Next
Author | Message |
---|---|
Send message Joined: 9 Oct 10 Posts: 77 Credit: 3,671,357 RAC: 0 |
I don't know if it's related or not, but I just got a computation error on http://lhcathomeclassic.cern.ch/sixtrack/workunit.php?wuid=30223344 ... this is the first time I see such a thing on this project/computer. Eric> you said that you suspect that the issue is with modern hardware, but the host that is messing a lot of results is already getting a bit old (it'a a plain i7 2600 with not overclocking capabilities, but it could be overheating, or use faulty memory sticks). I kown that's quite rare, bit it might also be a faulty CPU ... does anyone know this host on other BOINC projects ? Do you plan to run you test with "sixtracktest" application ? If yes, I think I'm already opted in to this application and ready to test :) |
Send message Joined: 2 Sep 04 Posts: 455 Credit: 201,262,385 RAC: 28,231 |
|
Send message Joined: 3 Jul 08 Posts: 20 Credit: 8,281,604 RAC: 0 |
Ah well; I have run this case at CERN. The results are genuine: Thanks for your reply. I have checked your long host list and found a more interesting example on one of my computers. I have only one invalid WU: 29657154 My computer (i7-3930K/HT) was invalid after 22069 seconds. A Phenom 1090T was valid after 17366 seconds with Dmitry while he needs 0.22 seconds. Stunning. I am also going to appeal for more volunteers to sign up for testing. Do you mean the checkmark in the preferences next to "Run test applications?" (most of my hosts are ready) or a other project like vLHC? Bye, Grubix. |
Send message Joined: 12 Jul 11 Posts: 857 Credit: 1,619,050 RAC: 0 |
Well, this is pretty strange, but I believe we are somehow getting nack empty/zero results files...another issue. This doesn't affect the study as these "valid" results are ignored....I shall look more closely soonest. Eric. |
Send message Joined: 12 Jul 11 Posts: 857 Credit: 1,619,050 RAC: 0 |
Update 31st March. I have finally accessed the database. (I append some more numbers, a shorter list.) I need to concentrate now on the wrong Valid results. Many of the invalids are "normal", aborted by user, no disk space, etc etc. I need a more sophisticated analysis where I look at the actual reason for Invalidation. I could go to running 3 copies and insist on 3 identical; maybe when the workload is lower. Eric. Tough problem; study independent, could be software of course, but seems to be OS independent, we use only ifort in production, but maybe only ssse2/sse3/pni having errors, "older" machines seem OK. Still, progress at last. Results based on database 31st March, 2015 but only looking at hosts with previous invalids in my log. Totals 153287 2504 134690 16093 HostID Total Invalid Valid Don't know 9996388 45057 870 44185 2 10334649 291 210 48 33 10353795 75 56 5 14 10338771 55 47 7 1 10151727 170 43 78 49 10352446 43 42 0 1 10352445 40 37 2 1 10348627 507 30 412 65 10332524 65 30 22 13 10325398 37 30 4 3 8424458 234 29 156 49 10332528 59 29 15 15 10282106 176 24 123 29 10340508 42 21 20 1 etc etc etc 10311622 53 20 0 33 |
Send message Joined: 12 Jul 11 Posts: 857 Credit: 1,619,050 RAC: 0 |
Thanks a million for your persistence. Eric.Ah well; I have run this case at CERN. The results are genuine: |
Send message Joined: 12 Jul 11 Posts: 857 Credit: 1,619,050 RAC: 0 |
Another step forwards thanks to Grubix. First let me say that I am going to use client for the volunteers i.e. you and customer for the CERN users including myself (because you have a boinc client, even if you are providing me a service). As a result of looking at the second specific case Grubix gave me I have identified the following problem. For one particular cutomer I have found 14,000 similar cases where an empty fort.10 is returned. This is a valid result, but only if the customer has not properly tested before using BOINC. If the initial conditions are not correct, pre-processing will fail, and tracking never starts. In practice I don't think I have ever seen this, but I have nonethless found 14,000 empty files returned out of about 3,000,000. This is NOT a problem on our side because these results are rejected and the case is run again. So, somehow WUs are being killed or something BUT return a result (empty). This happens twice and the two empty results are validated. The poor client who has been churning away for a million turns is now treated as invalid and of course gets no credit! I do not yet know to which operating Systems or BOINC client versions this applies. The important point is to make a fix. I am proposing to treat empty result files as invalid. (It would take too long to build new Sixtrack versions, and anyway I am not sure how I could fix this.) Note that if we have a genuine empty result it will have been produced in a few seconds and not much time/credit will have been lost. Sadly this does NOT explain my result differences :-( which I am working on with priority even if the error rate is small and the physics overall seems OK. I am hot on the trail of a possible gfortran bug in the hope it might help me with our ifort production versions. It may not be a bug, but a problem common to gfortran and ifort with the underlying hardware. It may give further insight. This "bug" is reproducable at CERN using different gfortran versions and I have already determined that differences arise during the first 10,000 turns, a big help in testing. So I will request that null results are invalidated, continue working hard on the gfortran question, and don't forget we already found a Sixtrack bug even if it turned out to be irrelevant. (I just wonder how other applications like climate prediction can be sure of their results, when I find wrong validated cases! Perhaps these small differences are not statitically significant.) Apologies and Thanks. Eric. P.S. For light relief, I am posting a pointer to the LEGO LHC to the Cafe. You too can have, or will soon be able to have, your own miniature accelerator. :-) |
Send message Joined: 14 Jul 05 Posts: 4 Credit: 441,483 RAC: 0 |
http://lhcathomeclassic.cern.ch/sixtrack/workunit.php?wuid=29668504 http://lhcathomeclassic.cern.ch/sixtrack/workunit.php?wuid=29664080 Сoincidence. 43k CPU sec vs 2.1 and 1.0 CPU sec. |
Send message Joined: 3 Jul 08 Posts: 20 Credit: 8,281,604 RAC: 0 |
Hello Eric, thank you very much for the detailed explanation. Сoincidence. 43k CPU sec vs 2.1 and 1.0 CPU sec. A great example I think. The "dream team" ;-) Dmitry and aqvario working on a task with a regular wingman. The worst case for a task. In the first example, the wingman is even a computer from the LHC itself. :-) Bye, Grubix. |
Send message Joined: 19 Feb 08 Posts: 708 Credit: 4,336,250 RAC: 0 |
As far as I remember in CPDN quorum is one. Results are then post processed by Oxford U computers which check their validity. Tullio |
Send message Joined: 12 Jul 11 Posts: 857 Credit: 1,619,050 RAC: 0 |
Yet another case of the same problem. Sorry. Eric (I am not inestigating further, just waiting for the fix.) |
Send message Joined: 11 Sep 08 Posts: 25 Credit: 384,225 RAC: 0 |
Eric, The Project is importante. I have a new system, last 3 weeks, and first time ever had invalid results. If it is your fault, OK. If here, OK. It happens but we are in the same boat, with same goals. I feel sometimes you take the feelings, if a users gives effort (time of his machine) and results are lost, it is a big problem. I dont think that way, and hope most users do not even care, unless out of hand. Most of my projects about 2 weeks ago, was due to my new rebuilt computer locking up or crashing. Turned out that heat sink Intel furnished with the i7 4790k CPU that is suppose to be 4.0 ghz, would not run at 3 ghz without overheating. So put a water cooler on it, and the 182 watts it generates when run it full out of 4.72 ghz is kept to 71 C now. Not your fault that so many of my work units bombed out with emergency shut downs. Now it is stable Next you and I are going to talk about cloud storage. I have 4 TB waiting to be used, and with a ASPERA network connection, from next door, or one third the way around the world, you transfer at top speed (does not use IP). 500 mb from Japan to here, transfer in 62 seconds over a NORMAL 50 mb network. We need to talk. Bill in Brazil |
Send message Joined: 12 Jul 11 Posts: 857 Credit: 1,619,050 RAC: 0 |
Eric, |
Send message Joined: 12 Jul 11 Posts: 857 Credit: 1,619,050 RAC: 0 |
Eric, |
Send message Joined: 9 Jan 08 Posts: 66 Credit: 727,923 RAC: 0 |
By the way. This host is on the loose again making invalids (currently 6317 invalids): http://lhcathomeclassic.cern.ch/sixtrack/show_host_detail.php?hostid=9996388 Maybe you need to ban him for another 2 weeks :) |
Send message Joined: 9 Jan 08 Posts: 66 Credit: 727,923 RAC: 0 |
Double post... |
Send message Joined: 12 Mar 12 Posts: 128 Credit: 20,013,377 RAC: 0 |
http://lhcathomeclassic.cern.ch/sixtrack/results.php?hostid=9996388 correct, this host spoiled 5 tasks of 5 for me yesterday and needs to be resolved overclocking I bet |
Send message Joined: 12 Jul 11 Posts: 857 Credit: 1,619,050 RAC: 0 |
I have already checked and he is still suspended....... I could not renew the suspension, and I tried. I have to wait until current suspension, which does not appear to be effective, expires. Eric. |
Send message Joined: 12 Mar 12 Posts: 128 Credit: 20,013,377 RAC: 0 |
Eric Is any way to ban particular host, not the user? In this case all task assigned will be discarded indefinitely were they calculated on user's side or not? In this case other hosts just perform as they do not aware of broken host and save us a fortune of time and energy) |
Send message Joined: 27 Oct 07 Posts: 186 Credit: 3,297,640 RAC: 0 |
Eric Ah! Eric, you're using the wrong tool! If you look at host 9996388, the owner is shown as "(banished: ID 147506)". That's designed to block spammers and other nuisances from these message boards - it doesn't affect his computer processing. Instead, you should be using Blacklisting hosts to stop the workflow to that host - and then lift the banishment, so that he can come here and talk to us about it! |
©2024 CERN