Message boards : Number crunching : Host messing up tons of results
Message board moderation

To post messages, you must log in.

Previous · 1 · 2 · 3 · 4 · 5 · 6 · 7 . . . 9 · Next

AuthorMessage
[AF>FAH-Addict.net]toTOW

Send message
Joined: 9 Oct 10
Posts: 77
Credit: 3,671,357
RAC: 0
Message 27245 - Posted: 30 Mar 2015, 22:23:48 UTC

I don't know if it's related or not, but I just got a computation error on http://lhcathomeclassic.cern.ch/sixtrack/workunit.php?wuid=30223344 ... this is the first time I see such a thing on this project/computer.

Eric> you said that you suspect that the issue is with modern hardware, but the host that is messing a lot of results is already getting a bit old (it'a a plain i7 2600 with not overclocking capabilities, but it could be overheating, or use faulty memory sticks). I kown that's quite rare, bit it might also be a faulty CPU ... does anyone know this host on other BOINC projects ?

Do you plan to run you test with "sixtracktest" application ? If yes, I think I'm already opted in to this application and ready to test :)
ID: 27245 · Report as offensive     Reply Quote
Profile Yeti
Volunteer moderator
Avatar

Send message
Joined: 2 Sep 04
Posts: 453
Credit: 193,369,412
RAC: 10,065
Message 27246 - Posted: 31 Mar 2015, 7:09:21 UTC

We are also blocked from testing on boinctest, especially the version
which returns all results. I am also going to appeal for more
volunteers to sign up for testing.

I would offer my boxes to help you test.

What would I have to do ?


Supporting BOINC, a great concept !
ID: 27246 · Report as offensive     Reply Quote
Profile Grubix

Send message
Joined: 3 Jul 08
Posts: 20
Credit: 8,281,604
RAC: 0
Message 27247 - Posted: 31 Mar 2015, 7:44:27 UTC - in response to Message 27231.  

Ah well; I have run this case at CERN. The results are genuine:

Thanks for your reply. I have checked your long host list and found a more interesting example on one of my computers. I have only one invalid WU: 29657154

My computer (i7-3930K/HT) was invalid after 22069 seconds. A Phenom 1090T was valid after 17366 seconds with Dmitry while he needs 0.22 seconds. Stunning.


I am also going to appeal for more volunteers to sign up for testing.

Do you mean the checkmark in the preferences next to "Run test applications?" (most of my hosts are ready) or a other project like vLHC?

Bye, Grubix.
ID: 27247 · Report as offensive     Reply Quote
Eric Mcintosh
Volunteer moderator
Project administrator
Project developer
Project tester
Project scientist

Send message
Joined: 12 Jul 11
Posts: 857
Credit: 1,619,050
RAC: 0
Message 27248 - Posted: 31 Mar 2015, 21:37:55 UTC - in response to Message 27247.  

Well, this is pretty strange, but I believe we are somehow getting
nack empty/zero results files...another issue. This doesn't affect
the study as these "valid" results are ignored....I shall look more
closely soonest. Eric.
ID: 27248 · Report as offensive     Reply Quote
Eric Mcintosh
Volunteer moderator
Project administrator
Project developer
Project tester
Project scientist

Send message
Joined: 12 Jul 11
Posts: 857
Credit: 1,619,050
RAC: 0
Message 27249 - Posted: 31 Mar 2015, 21:48:22 UTC

Update 31st March. I have finally accessed the database.
(I append some more numbers, a shorter list.)
I need to concentrate now on the wrong Valid results.
Many of the invalids are "normal", aborted by user, no
disk space, etc etc. I need a more sophisticated analysis
where I look at the actual reason for Invalidation.

I could go to running 3 copies and insist on 3 identical;
maybe when the workload is lower. Eric.

Tough problem; study independent, could be software of course,
but seems to be OS independent, we use only ifort in production,
but maybe only ssse2/sse3/pni having errors, "older" machines
seem OK. Still, progress at last.

Results based on database 31st March, 2015
but only looking at hosts with previous invalids in my log.

Totals 153287 2504 134690 16093
HostID Total Invalid Valid Don't know
9996388 45057 870 44185 2
10334649 291 210 48 33
10353795 75 56 5 14
10338771 55 47 7 1
10151727 170 43 78 49
10352446 43 42 0 1
10352445 40 37 2 1
10348627 507 30 412 65
10332524 65 30 22 13
10325398 37 30 4 3
8424458 234 29 156 49
10332528 59 29 15 15
10282106 176 24 123 29
10340508 42 21 20 1
etc etc etc

10311622 53 20 0 33
ID: 27249 · Report as offensive     Reply Quote
Eric Mcintosh
Volunteer moderator
Project administrator
Project developer
Project tester
Project scientist

Send message
Joined: 12 Jul 11
Posts: 857
Credit: 1,619,050
RAC: 0
Message 27251 - Posted: 1 Apr 2015, 16:46:03 UTC - in response to Message 27247.  

Ah well; I have run this case at CERN. The results are genuine:

Thanks for your reply. I have checked your long host list and found a more interesting example on one of my computers. I have only one invalid WU: 29657154

My computer (i7-3930K/HT) was invalid after 22069 seconds. A Phenom 1090T was valid after 17366 seconds with Dmitry while he needs 0.22 seconds. Stunning.

Indeed; this is a breakthrough. I have investigated thoroughly and this looks like the empty/zero result problem. I am making a new post to this
thread explaining what is going on.

I am also going to appeal for more volunteers to sign up for testing.

Do you mean the checkmark in the preferences next to "Run test applications?" (most of my hosts are ready) or a other project like vLHC?

Yes I did mean that and you are all set.

Bye, Grubix.
Thanks a million for your persistence. Eric.
ID: 27251 · Report as offensive     Reply Quote
Eric Mcintosh
Volunteer moderator
Project administrator
Project developer
Project tester
Project scientist

Send message
Joined: 12 Jul 11
Posts: 857
Credit: 1,619,050
RAC: 0
Message 27252 - Posted: 1 Apr 2015, 17:20:35 UTC

Another step forwards thanks to Grubix.

First let me say that I am going to use client for the volunteers i.e. you
and customer for the CERN users including myself (because you have a
boinc client, even if you are providing me a service).

As a result of looking at the second specific case Grubix gave me I
have identified the following problem.

For one particular cutomer I have found 14,000 similar cases where
an empty fort.10 is returned. This is a valid result, but only if the
customer has not properly tested before using BOINC. If the initial
conditions are not correct, pre-processing will fail, and tracking never
starts. In practice I don't think I have ever seen this, but I have
nonethless found 14,000 empty files returned out of about 3,000,000.
This is NOT a problem on our side because these results are rejected
and the case is run again.

So, somehow WUs are being killed or something BUT return a result (empty).
This happens twice and the two empty results are validated. The poor client
who has been churning away for a million turns is now treated as invalid
and of course gets no credit!

I do not yet know to which operating Systems or BOINC client versions this
applies. The important point is to make a fix. I am proposing to treat empty
result files as invalid. (It would take too long to build new Sixtrack
versions, and anyway I am not sure how I could fix this.) Note that if
we have a genuine empty result it will have been produced in a few
seconds and not much time/credit will have been lost.

Sadly this does NOT explain my result differences :-( which I am working
on with priority even if the error rate is small and the physics overall
seems OK.

I am hot on the trail of a possible gfortran bug in the hope it might help me with our ifort production versions. It may not be a bug, but a problem
common to gfortran and ifort with the underlying hardware. It may give
further insight. This "bug" is reproducable at CERN using different
gfortran versions and I have already determined that differences
arise during the first 10,000 turns, a big help in testing.

So I will request that null results are invalidated, continue working
hard on the gfortran question, and don't forget we already found a
Sixtrack bug even if it turned out to be irrelevant.

(I just wonder how other applications like climate prediction can be
sure of their results, when I find wrong validated cases! Perhaps these
small differences are not statitically significant.)
Apologies and Thanks. Eric.

P.S. For light relief, I am posting a pointer to the LEGO LHC to the Cafe. You too can have, or will soon be able to have, your own miniature accelerator. :-)
ID: 27252 · Report as offensive     Reply Quote
Armagedets

Send message
Joined: 14 Jul 05
Posts: 4
Credit: 441,483
RAC: 0
Message 27255 - Posted: 2 Apr 2015, 7:53:40 UTC
Last modified: 2 Apr 2015, 7:54:15 UTC

ID: 27255 · Report as offensive     Reply Quote
Profile Grubix

Send message
Joined: 3 Jul 08
Posts: 20
Credit: 8,281,604
RAC: 0
Message 27256 - Posted: 2 Apr 2015, 10:10:40 UTC - in response to Message 27255.  

Hello Eric, thank you very much for the detailed explanation.


Сoincidence. 43k CPU sec vs 2.1 and 1.0 CPU sec.

A great example I think. The "dream team" ;-) Dmitry and aqvario working on a task with a regular wingman. The worst case for a task. In the first example, the wingman is even a computer from the LHC itself. :-)

Bye, Grubix.
ID: 27256 · Report as offensive     Reply Quote
tullio

Send message
Joined: 19 Feb 08
Posts: 708
Credit: 4,336,250
RAC: 0
Message 27258 - Posted: 2 Apr 2015, 10:13:57 UTC - in response to Message 27252.  

As far as I remember in CPDN quorum is one. Results are then post processed by Oxford U computers which check their validity.
Tullio
ID: 27258 · Report as offensive     Reply Quote
Eric Mcintosh
Volunteer moderator
Project administrator
Project developer
Project tester
Project scientist

Send message
Joined: 12 Jul 11
Posts: 857
Credit: 1,619,050
RAC: 0
Message 27260 - Posted: 2 Apr 2015, 18:15:52 UTC - in response to Message 27255.  

Yet another case of the same problem. Sorry. Eric
(I am not inestigating further, just waiting for the fix.)
ID: 27260 · Report as offensive     Reply Quote
William C Wilson
Avatar

Send message
Joined: 11 Sep 08
Posts: 25
Credit: 384,225
RAC: 0
Message 27280 - Posted: 4 Apr 2015, 21:10:32 UTC - in response to Message 27252.  

Eric,
The Project is importante. I have a new system, last 3 weeks, and first time ever had invalid results. If it is your fault, OK. If here, OK. It happens but we are in the same boat, with same goals.


I feel sometimes you take the feelings, if a users gives effort (time of his machine) and results are lost, it is a big problem. I dont think that way, and hope most users do not even care, unless out of hand. Most of my projects about 2 weeks ago, was due to my new rebuilt computer locking up or crashing. Turned out that heat sink Intel furnished with the i7 4790k CPU that is suppose to be 4.0 ghz, would not run at 3 ghz without overheating.

So put a water cooler on it, and the 182 watts it generates when run it full out of 4.72 ghz is kept to 71 C now. Not your fault that so many of my work units bombed out with emergency shut downs. Now it is stable

Next you and I are going to talk about cloud storage. I have 4 TB waiting to be used, and with a ASPERA network connection, from next door, or one third the way around the world, you transfer at top speed (does not use IP). 500 mb from Japan to here, transfer in 62 seconds over a NORMAL 50 mb network.

We need to talk.

Bill in Brazil
ID: 27280 · Report as offensive     Reply Quote
Eric Mcintosh
Volunteer moderator
Project administrator
Project developer
Project tester
Project scientist

Send message
Joined: 12 Jul 11
Posts: 857
Credit: 1,619,050
RAC: 0
Message 27283 - Posted: 5 Apr 2015, 7:35:30 UTC - in response to Message 27280.  

Eric,
The Project is importante. I have a new system, last 3 weeks, and first time ever had invalid results. If it is your fault, OK. If here, OK. It happens but we are in the same boat, with same goals.

Bill, First thanks for your understanding and support, I wish everyone
felt the same way. There are four problems I know of and worry me.
I am running a huge number of tests over Easter in the hope of
getting information about killer problem number 4.

1. I messed up by overlooking the need to set the max number of tries
to say 10, when I want 3 identical results with only 3 tries as pointed out
by Metallus in "New bizarre Work Units". No excuses, old age, more haste
less speed. I can't fix this right now because of some permissions
problem on the file uploadWorkunit_many.

2. We are getting too many empty result files. Don't know why yet, but we
have a fix. This is probably the reason why so many 3/3 WUs are failing.

3. I have just this weekend tracked a result difference down to a problem with a recent gfortran/gcc release. This does not affect our ifort production but puts our plan to move to gfortran on hold.

4. I am still getting validated results which are wrong. To find these cases a customer case has to be run twice ate least on BOINC. The number of known cases is small, but of course the number of unknown cases is unknown. I have
never been able to reproduce any of these differences at CERN; hence the
current campaign over Easter. luckily the differences are also "small" and
do not invalidate the complete study.

I do not feel able to publish my work on *Getting identical results, 0 ULP
(Unit in the last Place) IEEE 754 as intended" until 4. is resolved. Sadly
almost eveyone is more interested in speed and is not affected by the kind
of very small differences that I am avoiding. (What I am trying to do and
I vae almost got there is reported as "being perhaps impossible* in the
literature.) Intel were offered the crlibm elementary function library
crlibm, which I use, but turned it down. I feel this 0 ULP functionality
is important for Computer Science at least, and I as a minimum this 0 ULP
version of Sixtrack will be a great test of hardware and software (as you have found out!).

I feel sometimes you take the feelings, if a users gives effort (time of his machine) and results are lost, it is a big problem. I dont think that way, and hope most users do not even care, unless out of hand.

I do feel upset, especially when I have screwed up.

Most of my projects about 2 weeks ago, was due to my new rebuilt computer locking up or crashing. Turned out that heat sink Intel furnished with the i7 4790k CPU that is suppose to be 4.0 ghz, would not run at 3 ghz without overheating. .....as I said SixTrack is a very demanding test.

So put a water cooler on it, and the 182 watts it generates when run it full out of 4.72 ghz is kept to 71 C now. Not your fault that so many of my work units bombed out with emergency shut downs. Now it is stable....Good.

Next you and I are going to talk about cloud storage. I have 4 TB waiting to be used, and with a ASPERA network connection, from next door, or one third the way around the world, you transfer at top speed (does not use IP). 500 mb from Japan to here, transfer in 62 seconds over a NORMAL 50 mb network.

We need to talk.

Give me your phone number and I shall call you (over Internet Nonoh as I have to pay for my own calls).

Thanks again. Eric.


Bill in Brazil

ID: 27283 · Report as offensive     Reply Quote
Eric Mcintosh
Volunteer moderator
Project administrator
Project developer
Project tester
Project scientist

Send message
Joined: 12 Jul 11
Posts: 857
Credit: 1,619,050
RAC: 0
Message 27284 - Posted: 5 Apr 2015, 7:51:22 UTC - in response to Message 27280.  

Eric,
The Project is importante. I have a new system, last 3 weeks, and first time ever had invalid results. If it is your fault, OK. If here, OK. It happens but we are in the same boat, with same goals.

Bill, First thanks for your understanding and support, I wish everyone
felt the same way. There are four problems I know of and worry me.
I am running a huge number of tests over Easter in the hope of
getting information about killer problem number 4.

1. I messed up by overlooking the need to set the max number of tries
to say 10, when I want 3 identical results with only 3 tries as pointed out
by Metallus in "New bizarre Work Units". No excuses, old age, more haste
less speed. I can't fix this right now because of some permissions
problem on the file uploadWorkunit_many. Moral, and I should know
better, DON'T change anything on a Friday, especially not a holiday
weekend.

2. We are getting too many empty result files. Don't know why yet, but we
have a fix. This is probably the reason why so many 3/3 WUs are failing.

3. I have just this weekend tracked a result difference down to a problem with a recent gfortran/gcc release.
This does not affect our ifort production but puts our plan to move to gfortran on hold.

4. I am still getting validated results which are wrong.
To find these cases a customer case has to be run twice at least on BOINC.

The number of known cases is small, but of course the number of unknown cases is unknown. I have
never been able to reproduce any of these differences at CERN; hence the
current campaign over Easter. luckily the differences are also "small" and
do not invalidate the complete study.

I do not feel able to publish my work on "Getting identical results, 0 ULP
(Unit in the last Place) IEEE 754 as intended" until 4. is resolved. Sadly
almost eveyone is more interested in speed and is not affected by the kind
of very small differences that I am avoiding. (What I am trying to do and
I have almost got there is reported as "being perhaps impossible" in the
literature.) Intel were offered the elementary function library
crlibm, which I use, but turned it down. I feel this 0 ULP functionality
is important for Computer Science at least, and as a minimum this 0 ULP
version of Sixtrack will be a great test of hardware and software (as you have found out!).
Actually most interest comes from Games programmers who
want the same kind of portability as us, and for their games to operate
identically on a wide range of hardware and software.

I feel sometimes you take the feelings, if a users gives effort (time of his machine) and results are lost, it is a big problem. I dont think that way, and hope most users do not even care, unless out of hand.

I do feel upset, especially when I have screwed up.

Most of my projects about 2 weeks ago, was due to my new rebuilt computer locking up or crashing. Turned out that heat sink Intel furnished with the i7 4790k CPU that is suppose to be 4.0 ghz, would not run at 3 ghz without overheating. .....as I said SixTrack is a very demanding test.

So put a water cooler on it, and the 182 watts it generates when run it full out of 4.72 ghz is kept to 71 C now. Not your fault that so many of my work units bombed out with emergency shut downs. Now it is stable....Good.

Next you and I are going to talk about cloud storage. I have 4 TB waiting to be used, and with a ASPERA network connection, from next door, or one third the way around the world, you transfer at top speed (does not use IP). 500 mb from Japan to here, transfer in 62 seconds over a NORMAL 50 mb network.

We need to talk.

Give me your phone number and I shall call you (over Internet, Nonoh as I have to pay for my own calls and these are pretty much free).

Waiting for the news that the LHC has restarted, in spite of an initial
problem with a faulty magnet connection.

Thanks again. Eric. (eric.mcintosh@cern.ch)


Bill in Brazil

ID: 27284 · Report as offensive     Reply Quote
Uffe F

Send message
Joined: 9 Jan 08
Posts: 66
Credit: 727,923
RAC: 0
Message 27296 - Posted: 7 Apr 2015, 0:48:53 UTC - in response to Message 27284.  

By the way. This host is on the loose again making invalids (currently 6317 invalids):

http://lhcathomeclassic.cern.ch/sixtrack/show_host_detail.php?hostid=9996388

Maybe you need to ban him for another 2 weeks :)
ID: 27296 · Report as offensive     Reply Quote
Uffe F

Send message
Joined: 9 Jan 08
Posts: 66
Credit: 727,923
RAC: 0
Message 27297 - Posted: 7 Apr 2015, 0:49:01 UTC - in response to Message 27284.  
Last modified: 7 Apr 2015, 0:51:36 UTC

Double post...
ID: 27297 · Report as offensive     Reply Quote
alvin
Avatar

Send message
Joined: 12 Mar 12
Posts: 128
Credit: 20,013,377
RAC: 0
Message 27298 - Posted: 7 Apr 2015, 1:09:11 UTC - in response to Message 27296.  

http://lhcathomeclassic.cern.ch/sixtrack/results.php?hostid=9996388
correct, this host spoiled 5 tasks of 5 for me yesterday and needs to be resolved
overclocking I bet
ID: 27298 · Report as offensive     Reply Quote
Eric Mcintosh
Volunteer moderator
Project administrator
Project developer
Project tester
Project scientist

Send message
Joined: 12 Jul 11
Posts: 857
Credit: 1,619,050
RAC: 0
Message 27301 - Posted: 7 Apr 2015, 4:05:06 UTC - in response to Message 27296.  

I have already checked and he is still suspended.......
I could not renew the suspension, and I tried. I
have to wait until current suspension, which does not
appear to be effective, expires. Eric.
ID: 27301 · Report as offensive     Reply Quote
alvin
Avatar

Send message
Joined: 12 Mar 12
Posts: 128
Credit: 20,013,377
RAC: 0
Message 27305 - Posted: 7 Apr 2015, 6:58:22 UTC - in response to Message 27301.  

Eric
Is any way to ban particular host, not the user? In this case all task assigned will be discarded indefinitely were they calculated on user's side or not?
In this case other hosts just perform as they do not aware of broken host and save us a fortune of time and energy)
ID: 27305 · Report as offensive     Reply Quote
Richard Haselgrove

Send message
Joined: 27 Oct 07
Posts: 186
Credit: 3,297,640
RAC: 0
Message 27306 - Posted: 7 Apr 2015, 8:24:20 UTC - in response to Message 27305.  

Eric
Is any way to ban particular host, not the user? In this case all task assigned will be discarded indefinitely were they calculated on user's side or not?
In this case other hosts just perform as they do not aware of broken host and save us a fortune of time and energy)

Ah!

Eric, you're using the wrong tool!

If you look at host 9996388, the owner is shown as "(banished: ID 147506)". That's designed to block spammers and other nuisances from these message boards - it doesn't affect his computer processing.

Instead, you should be using Blacklisting hosts to stop the workflow to that host - and then lift the banishment, so that he can come here and talk to us about it!
ID: 27306 · Report as offensive     Reply Quote
Previous · 1 · 2 · 3 · 4 · 5 · 6 · 7 . . . 9 · Next

Message boards : Number crunching : Host messing up tons of results


©2024 CERN