Thread 'Host messing up tons of results'

Author	Message
[AF>FAH-Addict.net]toTOW Send message Joined: 9 Oct 10 Posts: 77 Credit: 3,727,865 RAC: 0	Message 27245 - Posted: 30 Mar 2015, 22:23:48 UTC I don't know if it's related or not, but I just got a computation error on http://lhcathomeclassic.cern.ch/sixtrack/workunit.php?wuid=30223344 ... this is the first time I see such a thing on this project/computer. Eric> you said that you suspect that the issue is with modern hardware, but the host that is messing a lot of results is already getting a bit old (it'a a plain i7 2600 with not overclocking capabilities, but it could be overheating, or use faulty memory sticks). I kown that's quite rare, bit it might also be a faulty CPU ... does anyone know this host on other BOINC projects ? Do you plan to run you test with "sixtracktest" application ? If yes, I think I'm already opted in to this application and ready to test :) ID: 27245 · Reply Quote

Yeti Volunteer moderator Send message Joined: 2 Sep 04 Posts: 468 Credit: 224,935,712 RAC: 3,282	Message 27246 - Posted: 31 Mar 2015, 7:09:21 UTC We are also blocked from testing on boinctest, especially the version which returns all results. I am also going to appeal for more volunteers to sign up for testing. I would offer my boxes to help you test. What would I have to do ? Supporting BOINC, a great concept ! ID: 27246 · Reply Quote

Grubix Send message Joined: 3 Jul 08 Posts: 20 Credit: 8,281,604 RAC: 0	Message 27247 - Posted: 31 Mar 2015, 7:44:27 UTC - in response to Message 27231. Ah well; I have run this case at CERN. The results are genuine: Thanks for your reply. I have checked your long host list and found a more interesting example on one of my computers. I have only one invalid WU: 29657154 My computer (i7-3930K/HT) was invalid after 22069 seconds. A Phenom 1090T was valid after 17366 seconds with Dmitry while he needs 0.22 seconds. Stunning. I am also going to appeal for more volunteers to sign up for testing. Do you mean the checkmark in the preferences next to "Run test applications?" (most of my hosts are ready) or a other project like vLHC? Bye, Grubix. ID: 27247 · Reply Quote

Eric Mcintosh Volunteer moderator Project administrator Project developer Project tester Project scientist Send message Joined: 12 Jul 11 Posts: 857 Credit: 1,619,050 RAC: 0	Message 27248 - Posted: 31 Mar 2015, 21:37:55 UTC - in response to Message 27247. Well, this is pretty strange, but I believe we are somehow getting nack empty/zero results files...another issue. This doesn't affect the study as these "valid" results are ignored....I shall look more closely soonest. Eric. ID: 27248 · Reply Quote

Eric Mcintosh Volunteer moderator Project administrator Project developer Project tester Project scientist Send message Joined: 12 Jul 11 Posts: 857 Credit: 1,619,050 RAC: 0	Message 27249 - Posted: 31 Mar 2015, 21:48:22 UTC Update 31st March. I have finally accessed the database. (I append some more numbers, a shorter list.) I need to concentrate now on the wrong Valid results. Many of the invalids are "normal", aborted by user, no disk space, etc etc. I need a more sophisticated analysis where I look at the actual reason for Invalidation. I could go to running 3 copies and insist on 3 identical; maybe when the workload is lower. Eric. Tough problem; study independent, could be software of course, but seems to be OS independent, we use only ifort in production, but maybe only ssse2/sse3/pni having errors, "older" machines seem OK. Still, progress at last. Results based on database 31st March, 2015 but only looking at hosts with previous invalids in my log. Totals 153287 2504 134690 16093 HostID Total Invalid Valid Don't know 9996388 45057 870 44185 2 10334649 291 210 48 33 10353795 75 56 5 14 10338771 55 47 7 1 10151727 170 43 78 49 10352446 43 42 0 1 10352445 40 37 2 1 10348627 507 30 412 65 10332524 65 30 22 13 10325398 37 30 4 3 8424458 234 29 156 49 10332528 59 29 15 15 10282106 176 24 123 29 10340508 42 21 20 1 etc etc etc 10311622 53 20 0 33 ID: 27249 · Reply Quote

Eric Mcintosh Volunteer moderator Project administrator Project developer Project tester Project scientist Send message Joined: 12 Jul 11 Posts: 857 Credit: 1,619,050 RAC: 0	Message 27251 - Posted: 1 Apr 2015, 16:46:03 UTC - in response to Message 27247. Ah well; I have run this case at CERN. The results are genuine: Thanks for your reply. I have checked your long host list and found a more interesting example on one of my computers. I have only one invalid WU: 29657154 My computer (i7-3930K/HT) was invalid after 22069 seconds. A Phenom 1090T was valid after 17366 seconds with Dmitry while he needs 0.22 seconds. Stunning. Indeed; this is a breakthrough. I have investigated thoroughly and this looks like the empty/zero result problem. I am making a new post to this thread explaining what is going on. I am also going to appeal for more volunteers to sign up for testing. Do you mean the checkmark in the preferences next to "Run test applications?" (most of my hosts are ready) or a other project like vLHC? Yes I did mean that and you are all set. Bye, Grubix. Thanks a million for your persistence. Eric. ID: 27251 · Reply Quote

Eric Mcintosh Volunteer moderator Project administrator Project developer Project tester Project scientist Send message Joined: 12 Jul 11 Posts: 857 Credit: 1,619,050 RAC: 0	Message 27252 - Posted: 1 Apr 2015, 17:20:35 UTC Another step forwards thanks to Grubix. First let me say that I am going to use client for the volunteers i.e. you and customer for the CERN users including myself (because you have a boinc client, even if you are providing me a service). As a result of looking at the second specific case Grubix gave me I have identified the following problem. For one particular cutomer I have found 14,000 similar cases where an empty fort.10 is returned. This is a valid result, but only if the customer has not properly tested before using BOINC. If the initial conditions are not correct, pre-processing will fail, and tracking never starts. In practice I don't think I have ever seen this, but I have nonethless found 14,000 empty files returned out of about 3,000,000. This is NOT a problem on our side because these results are rejected and the case is run again. So, somehow WUs are being killed or something BUT return a result (empty). This happens twice and the two empty results are validated. The poor client who has been churning away for a million turns is now treated as invalid and of course gets no credit! I do not yet know to which operating Systems or BOINC client versions this applies. The important point is to make a fix. I am proposing to treat empty result files as invalid. (It would take too long to build new Sixtrack versions, and anyway I am not sure how I could fix this.) Note that if we have a genuine empty result it will have been produced in a few seconds and not much time/credit will have been lost. Sadly this does NOT explain my result differences :-( which I am working on with priority even if the error rate is small and the physics overall seems OK. I am hot on the trail of a possible gfortran bug in the hope it might help me with our ifort production versions. It may not be a bug, but a problem common to gfortran and ifort with the underlying hardware. It may give further insight. This "bug" is reproducable at CERN using different gfortran versions and I have already determined that differences arise during the first 10,000 turns, a big help in testing. So I will request that null results are invalidated, continue working hard on the gfortran question, and don't forget we already found a Sixtrack bug even if it turned out to be irrelevant. (I just wonder how other applications like climate prediction can be sure of their results, when I find wrong validated cases! Perhaps these small differences are not statitically significant.) Apologies and Thanks. Eric. P.S. For light relief, I am posting a pointer to the LEGO LHC to the Cafe. You too can have, or will soon be able to have, your own miniature accelerator. :-) ID: 27252 · Reply Quote

Armagedets Send message Joined: 14 Jul 05 Posts: 4 Credit: 441,483 RAC: 0	Message 27255 - Posted: 2 Apr 2015, 7:53:40 UTC Last modified: 2 Apr 2015, 7:54:15 UTC http://lhcathomeclassic.cern.ch/sixtrack/workunit.php?wuid=29668504 http://lhcathomeclassic.cern.ch/sixtrack/workunit.php?wuid=29664080 Ð¡oincidence. 43k CPU sec vs 2.1 and 1.0 CPU sec. ID: 27255 · Reply Quote

Grubix Send message Joined: 3 Jul 08 Posts: 20 Credit: 8,281,604 RAC: 0	Message 27256 - Posted: 2 Apr 2015, 10:10:40 UTC - in response to Message 27255. Hello Eric, thank you very much for the detailed explanation. Ð¡oincidence. 43k CPU sec vs 2.1 and 1.0 CPU sec. A great example I think. The "dream team" ;-) Dmitry and aqvario working on a task with a regular wingman. The worst case for a task. In the first example, the wingman is even a computer from the LHC itself. :-) Bye, Grubix. ID: 27256 · Reply Quote

tullio Send message Joined: 19 Feb 08 Posts: 708 Credit: 4,336,250 RAC: 0	Message 27258 - Posted: 2 Apr 2015, 10:13:57 UTC - in response to Message 27252. As far as I remember in CPDN quorum is one. Results are then post processed by Oxford U computers which check their validity. Tullio ID: 27258 · Reply Quote

Eric Mcintosh Volunteer moderator Project administrator Project developer Project tester Project scientist Send message Joined: 12 Jul 11 Posts: 857 Credit: 1,619,050 RAC: 0	Message 27260 - Posted: 2 Apr 2015, 18:15:52 UTC - in response to Message 27255. Yet another case of the same problem. Sorry. Eric (I am not inestigating further, just waiting for the fix.) ID: 27260 · Reply Quote

William C Wilson Send message Joined: 11 Sep 08 Posts: 25 Credit: 384,225 RAC: 0	Message 27280 - Posted: 4 Apr 2015, 21:10:32 UTC - in response to Message 27252. Eric, The Project is importante. I have a new system, last 3 weeks, and first time ever had invalid results. If it is your fault, OK. If here, OK. It happens but we are in the same boat, with same goals. I feel sometimes you take the feelings, if a users gives effort (time of his machine) and results are lost, it is a big problem. I dont think that way, and hope most users do not even care, unless out of hand. Most of my projects about 2 weeks ago, was due to my new rebuilt computer locking up or crashing. Turned out that heat sink Intel furnished with the i7 4790k CPU that is suppose to be 4.0 ghz, would not run at 3 ghz without overheating. So put a water cooler on it, and the 182 watts it generates when run it full out of 4.72 ghz is kept to 71 C now. Not your fault that so many of my work units bombed out with emergency shut downs. Now it is stable Next you and I are going to talk about cloud storage. I have 4 TB waiting to be used, and with a ASPERA network connection, from next door, or one third the way around the world, you transfer at top speed (does not use IP). 500 mb from Japan to here, transfer in 62 seconds over a NORMAL 50 mb network. We need to talk. Bill in Brazil ID: 27280 · Reply Quote

Eric Mcintosh Volunteer moderator Project administrator Project developer Project tester Project scientist Send message Joined: 12 Jul 11 Posts: 857 Credit: 1,619,050 RAC: 0	Message 27283 - Posted: 5 Apr 2015, 7:35:30 UTC - in response to Message 27280. Eric, The Project is importante. I have a new system, last 3 weeks, and first time ever had invalid results. If it is your fault, OK. If here, OK. It happens but we are in the same boat, with same goals. Bill, First thanks for your understanding and support, I wish everyone felt the same way. There are four problems I know of and worry me. I am running a huge number of tests over Easter in the hope of getting information about killer problem number 4. 1. I messed up by overlooking the need to set the max number of tries to say 10, when I want 3 identical results with only 3 tries as pointed out by Metallus in "New bizarre Work Units". No excuses, old age, more haste less speed. I can't fix this right now because of some permissions problem on the file uploadWorkunit_many. 2. We are getting too many empty result files. Don't know why yet, but we have a fix. This is probably the reason why so many 3/3 WUs are failing. 3. I have just this weekend tracked a result difference down to a problem with a recent gfortran/gcc release. This does not affect our ifort production but puts our plan to move to gfortran on hold. 4. I am still getting validated results which are wrong. To find these cases a customer case has to be run twice ate least on BOINC. The number of known cases is small, but of course the number of unknown cases is unknown. I have never been able to reproduce any of these differences at CERN; hence the current campaign over Easter. luckily the differences are also "small" and do not invalidate the complete study. I do not feel able to publish my work on Getting identical results, 0 ULP (Unit in the last Place) IEEE 754 as intended" until 4. is resolved. Sadly almost eveyone is more interested in speed and is not affected by the kind of very small differences that I am avoiding. (What I am trying to do and I vae almost got there is reported as "being perhaps impossible in the literature.) Intel were offered the crlibm elementary function library crlibm, which I use, but turned it down. I feel this 0 ULP functionality is important for Computer Science at least, and I as a minimum this 0 ULP version of Sixtrack will be a great test of hardware and software (as you have found out!). I feel sometimes you take the feelings, if a users gives effort (time of his machine) and results are lost, it is a big problem. I dont think that way, and hope most users do not even care, unless out of hand. I do feel upset, especially when I have screwed up. Most of my projects about 2 weeks ago, was due to my new rebuilt computer locking up or crashing. Turned out that heat sink Intel furnished with the i7 4790k CPU that is suppose to be 4.0 ghz, would not run at 3 ghz without overheating. .....as I said SixTrack is a very demanding test. So put a water cooler on it, and the 182 watts it generates when run it full out of 4.72 ghz is kept to 71 C now. Not your fault that so many of my work units bombed out with emergency shut downs. Now it is stable....Good. Next you and I are going to talk about cloud storage. I have 4 TB waiting to be used, and with a ASPERA network connection, from next door, or one third the way around the world, you transfer at top speed (does not use IP). 500 mb from Japan to here, transfer in 62 seconds over a NORMAL 50 mb network. We need to talk. Give me your phone number and I shall call you (over Internet Nonoh as I have to pay for my own calls). Thanks again. Eric. Bill in Brazil ID: 27283 · Reply Quote

Eric Mcintosh Volunteer moderator Project administrator Project developer Project tester Project scientist Send message Joined: 12 Jul 11 Posts: 857 Credit: 1,619,050 RAC: 0	Message 27284 - Posted: 5 Apr 2015, 7:51:22 UTC - in response to Message 27280. Eric, The Project is importante. I have a new system, last 3 weeks, and first time ever had invalid results. If it is your fault, OK. If here, OK. It happens but we are in the same boat, with same goals. Bill, First thanks for your understanding and support, I wish everyone felt the same way. There are four problems I know of and worry me. I am running a huge number of tests over Easter in the hope of getting information about killer problem number 4. 1. I messed up by overlooking the need to set the max number of tries to say 10, when I want 3 identical results with only 3 tries as pointed out by Metallus in "New bizarre Work Units". No excuses, old age, more haste less speed. I can't fix this right now because of some permissions problem on the file uploadWorkunit_many. Moral, and I should know better, DON'T change anything on a Friday, especially not a holiday weekend. 2. We are getting too many empty result files. Don't know why yet, but we have a fix. This is probably the reason why so many 3/3 WUs are failing. 3. I have just this weekend tracked a result difference down to a problem with a recent gfortran/gcc release. This does not affect our ifort production but puts our plan to move to gfortran on hold. 4. I am still getting validated results which are wrong. To find these cases a customer case has to be run twice at least on BOINC. The number of known cases is small, but of course the number of unknown cases is unknown. I have never been able to reproduce any of these differences at CERN; hence the current campaign over Easter. luckily the differences are also "small" and do not invalidate the complete study. I do not feel able to publish my work on "Getting identical results, 0 ULP (Unit in the last Place) IEEE 754 as intended" until 4. is resolved. Sadly almost eveyone is more interested in speed and is not affected by the kind of very small differences that I am avoiding. (What I am trying to do and I have almost got there is reported as "being perhaps impossible" in the literature.) Intel were offered the elementary function library crlibm, which I use, but turned it down. I feel this 0 ULP functionality is important for Computer Science at least, and as a minimum this 0 ULP version of Sixtrack will be a great test of hardware and software (as you have found out!). Actually most interest comes from Games programmers who want the same kind of portability as us, and for their games to operate identically on a wide range of hardware and software. I feel sometimes you take the feelings, if a users gives effort (time of his machine) and results are lost, it is a big problem. I dont think that way, and hope most users do not even care, unless out of hand. I do feel upset, especially when I have screwed up. Most of my projects about 2 weeks ago, was due to my new rebuilt computer locking up or crashing. Turned out that heat sink Intel furnished with the i7 4790k CPU that is suppose to be 4.0 ghz, would not run at 3 ghz without overheating. .....as I said SixTrack is a very demanding test. So put a water cooler on it, and the 182 watts it generates when run it full out of 4.72 ghz is kept to 71 C now. Not your fault that so many of my work units bombed out with emergency shut downs. Now it is stable....Good. Next you and I are going to talk about cloud storage. I have 4 TB waiting to be used, and with a ASPERA network connection, from next door, or one third the way around the world, you transfer at top speed (does not use IP). 500 mb from Japan to here, transfer in 62 seconds over a NORMAL 50 mb network. We need to talk. Give me your phone number and I shall call you (over Internet, Nonoh as I have to pay for my own calls and these are pretty much free). Waiting for the news that the LHC has restarted, in spite of an initial problem with a faulty magnet connection. Thanks again. Eric. (eric.mcintosh@cern.ch) Bill in Brazil ID: 27284 · Reply Quote

Uffe F Send message Joined: 9 Jan 08 Posts: 66 Credit: 727,923 RAC: 0	Message 27296 - Posted: 7 Apr 2015, 0:48:53 UTC - in response to Message 27284. By the way. This host is on the loose again making invalids (currently 6317 invalids): http://lhcathomeclassic.cern.ch/sixtrack/show_host_detail.php?hostid=9996388 Maybe you need to ban him for another 2 weeks :) ID: 27296 · Reply Quote

Uffe F Send message Joined: 9 Jan 08 Posts: 66 Credit: 727,923 RAC: 0	Message 27297 - Posted: 7 Apr 2015, 0:49:01 UTC - in response to Message 27284. Last modified: 7 Apr 2015, 0:51:36 UTC Double post... ID: 27297 · Reply Quote

alvin Send message Joined: 12 Mar 12 Posts: 128 Credit: 20,013,377 RAC: 0	Message 27298 - Posted: 7 Apr 2015, 1:09:11 UTC - in response to Message 27296. http://lhcathomeclassic.cern.ch/sixtrack/results.php?hostid=9996388 correct, this host spoiled 5 tasks of 5 for me yesterday and needs to be resolved overclocking I bet ID: 27298 · Reply Quote

Eric Mcintosh Volunteer moderator Project administrator Project developer Project tester Project scientist Send message Joined: 12 Jul 11 Posts: 857 Credit: 1,619,050 RAC: 0	Message 27301 - Posted: 7 Apr 2015, 4:05:06 UTC - in response to Message 27296. I have already checked and he is still suspended....... I could not renew the suspension, and I tried. I have to wait until current suspension, which does not appear to be effective, expires. Eric. ID: 27301 · Reply Quote

alvin Send message Joined: 12 Mar 12 Posts: 128 Credit: 20,013,377 RAC: 0	Message 27305 - Posted: 7 Apr 2015, 6:58:22 UTC - in response to Message 27301. Eric Is any way to ban particular host, not the user? In this case all task assigned will be discarded indefinitely were they calculated on user's side or not? In this case other hosts just perform as they do not aware of broken host and save us a fortune of time and energy) ID: 27305 · Reply Quote

Richard Haselgrove Send message Joined: 27 Oct 07 Posts: 186 Credit: 3,297,640 RAC: 0	Message 27306 - Posted: 7 Apr 2015, 8:24:20 UTC - in response to Message 27305. Eric Is any way to ban particular host, not the user? In this case all task assigned will be discarded indefinitely were they calculated on user's side or not? In this case other hosts just perform as they do not aware of broken host and save us a fortune of time and energy) Ah! Eric, you're using the wrong tool! If you look at host 9996388, the owner is shown as "(banished: ID 147506)". That's designed to block spammers and other nuisances from these message boards - it doesn't affect his computer processing. Instead, you should be using Blacklisting hosts to stop the workflow to that host - and then lift the banishment, so that he can come here and talk to us about it! ID: 27306 · Reply Quote