Message boards : News : Three Problems, 22nd May.
Message board moderation

To post messages, you must log in.

1 · 2 · Next

AuthorMessage
Eric Mcintosh
Volunteer moderator
Project administrator
Project developer
Project tester
Project scientist

Send message
Joined: 12 Jul 11
Posts: 843
Credit: 1,510,342
RAC: 1,475
Message 26516 - Posted: 22 May 2014, 16:15:39 UTC

Settling down a bit; I am seeing around 2% WU failures.

Problem 1: EXIT_TIME_LIMIT_EXCEEDED. Tried to minimise this
and will hopefully implement "outliers" to avoid it in future.

Problem 2: Can't Create Process and I will look for help on this.
Probably connected with our build but we shall see.

Problem 3: Found 545 invalid results involving 124 hosts.
One invalid result was duplicated! but i am not going to run
everything 3 times. Can live with this. The top 12 culprits gave
77 45 26 25 22 21 19 16 14 11 10 9 invalid results each.
(I thought we stopped using hosts with this many errors......)
Seems to be hardware, overclocking, cosmic rays?????

Getting a lot of production done successfully. Eric.

ID: 26516 · Report as offensive     Reply Quote
Richard Haselgrove

Send message
Joined: 27 Oct 07
Posts: 182
Credit: 3,115,075
RAC: 4,516
Message 26517 - Posted: 22 May 2014, 16:38:24 UTC - in response to Message 26516.  

1) is certainly a BOINC back-end problem, not the fault of either your application or our clients. Outliers would be good: the other steps you've taken already (provided you keep rsc_fpops_est friendly) should keep it under control for now. Just remember that it goes back to square one with each new version you release. I'm going to write to the dev list with some lessons learned (not least, the lack of server documentation).

2) I don't know either, so I'll be interested to hear what answers you get.

3) I hope you will (or maybe you could get a student or intern to) write a private PM to the worst offenders, politely saying "Please clean your machine up, or take your cycles elsewhere" - or words to that effect. And if that fails, there is a facility for blacklisting a rogue host. CPDN use that combination of procedures (carrot and stick) to some good effect.
ID: 26517 · Report as offensive     Reply Quote
Eric Mcintosh
Volunteer moderator
Project administrator
Project developer
Project tester
Project scientist

Send message
Joined: 12 Jul 11
Posts: 843
Credit: 1,510,342
RAC: 1,475
Message 26518 - Posted: 22 May 2014, 17:56:42 UTC - in response to Message 26517.  

As usual many thanks Richard; maybe I am on the right lines. :-) Eric.
P.S. Working for CERN's 60th Anniversary this weekend, perhaps the
last chance to go underground and visit the experiments for some time.
ID: 26518 · Report as offensive     Reply Quote
[TA]Assimilator1
Avatar

Send message
Joined: 29 Nov 13
Posts: 46
Credit: 1,300,104
RAC: 5,380
Message 26519 - Posted: 22 May 2014, 19:10:16 UTC
Last modified: 22 May 2014, 19:27:17 UTC

Re no3, not my old overclocked rig :)

If any of them are on Team AnandTech I could try PMing them.
But short of going through every single active user on my team is their a quicker way I can see if any invalids are coming from TA?

[edit] nm theirs only 8 active on our team!, could only check 6 as 2 were 'hidden'. No invalids but 1 Linux user had a single errored WU.
Could you post the list of people with invalids??
If not LMK if any are TA members (of the unknown 2 I'm pretty sure 1 of them would of spotted any invalids btw - have now PMed 1).
Team AnandTech - SETI@H, Muon1 DPAD, F@H, MW@H, A@H, LHC@H, POGS, R@H, E@H.

Main rig - i7 4820k @3.9 GHz, 16 GB DDR3 1866, HD 7950 3GB, Win 7 64bit
2nd rig - Q9550 @3.6 GHz, 4GB DDR2 1066, HD 5850, Win 7 64bit
ID: 26519 · Report as offensive     Reply Quote
Eric Mcintosh
Volunteer moderator
Project administrator
Project developer
Project tester
Project scientist

Send message
Joined: 12 Jul 11
Posts: 843
Credit: 1,510,342
RAC: 1,475
Message 26520 - Posted: 23 May 2014, 0:38:53 UTC - in response to Message 26519.  

I don't really want to name (and shame!). I am working on a
script to e-mail the owner of the host[id]s concerned. Eric.
ID: 26520 · Report as offensive     Reply Quote
Profile Ananas

Send message
Joined: 17 Jul 05
Posts: 102
Credit: 542,016
RAC: 0
Message 26521 - Posted: 23 May 2014, 1:00:44 UTC

I have inconclusive results on 3 hosts, only one of them is mildly OCed.

Actually several of them are invalid, as the two other results have already been validated. Mine stick to "inconclusive" - the transitioner (I think that's the one that is supposed to do it) "forgot" to switch the state into "invalid".

All invalids (without exception) have one thing in common : the results ran less than a minute on my boxes (credit range 0.04 - 0.06). For the valid partners, the runtimes vary.
ID: 26521 · Report as offensive     Reply Quote
Eric Mcintosh
Volunteer moderator
Project administrator
Project developer
Project tester
Project scientist

Send message
Joined: 12 Jul 11
Posts: 843
Credit: 1,510,342
RAC: 1,475
Message 26522 - Posted: 23 May 2014, 6:03:05 UTC - in response to Message 26521.  

Hi Ananas; I have a test suite but really set for
Linux and/or Cygwin. When I can, I shall set up
something for Windows command line.

I shall have a look at "unsticking". Eric.
ID: 26522 · Report as offensive     Reply Quote
Profile Ananas

Send message
Joined: 17 Jul 05
Posts: 102
Credit: 542,016
RAC: 0
Message 26523 - Posted: 23 May 2014, 8:25:18 UTC - in response to Message 26522.  
Last modified: 23 May 2014, 9:01:10 UTC

...
I shall have a look at "unsticking". Eric.

An attempt to fix the two database problems I know about :

DELETE FROM result
  WHERE NOT EXISTS (
    SELECT workunitid.id
      FROM workunit  
      WHERE workunit.id = result.workunitid
  );
UPDATE result SET outcome = 2
  WHERE outcome = 4
  AND 0 < (
    SELECT workunitid.canonical_resultid
      FROM workunit
      WHERE workunitid.id = result.workunitid
      AND workunitid.need_validate = 0
  );


I'm not totally sure that the second one does what we want it to do so better "SELECT" (and check a few samples) before "UPDATE".
If MySQL handles NUL > 0, it will not work properly.

It might also be necessary to manipulate result.server_state too, unfortunately there is not a single status attribute for the result.

If it becomes too complex ... all those results will sooner or later become orphaned so the "DELETE" SQL statement will catch them ;-)

#################################

p.s.: There are two types of invalid results. Some run only seconds and return an incomplete stdout (host 10137504 returns tons of those)
ID: 26523 · Report as offensive     Reply Quote
Eric Mcintosh
Volunteer moderator
Project administrator
Project developer
Project tester
Project scientist

Send message
Joined: 12 Jul 11
Posts: 843
Credit: 1,510,342
RAC: 1,475
Message 26525 - Posted: 23 May 2014, 15:34:13 UTC - in response to Message 26523.  

Hi Ananas; I had a look at Host 10137504. Looks OK now
with 26 valid results. Thinks these inconclusive are part
of a hangover, but indeed there were a lot of them, and
they are still there. Eric.
ID: 26525 · Report as offensive     Reply Quote
Profile Ananas

Send message
Joined: 17 Jul 05
Posts: 102
Credit: 542,016
RAC: 0
Message 26528 - Posted: 23 May 2014, 18:24:03 UTC - in response to Message 26525.  
Last modified: 23 May 2014, 18:33:46 UTC

... Looks OK now ...

I don't think so. Check the pending ones, all the results with a CPU time less than a second seem to be damaged (stderr contains nothing but the core client version), 1500+ damaged ones pending from today (and maybe 10 not damaged ones).

They just don't occur in the "inconclusive" list yet because the wingmen didn't return their share yet and the validator didn't touch them yet.

Maybe a heat issue, a quadcore laptop with hyperthreading - the other hosts of the same user do not show any similar issues so it is most likely not a virus scanner that interfers with the application.
ID: 26528 · Report as offensive     Reply Quote
Eric Mcintosh
Volunteer moderator
Project administrator
Project developer
Project tester
Project scientist

Send message
Joined: 12 Jul 11
Posts: 843
Credit: 1,510,342
RAC: 1,475
Message 26530 - Posted: 23 May 2014, 19:19:35 UTC - in response to Message 26528.  

NOT OK then I'll watch. Eric.
ID: 26530 · Report as offensive     Reply Quote
[TA]Assimilator1
Avatar

Send message
Joined: 29 Nov 13
Posts: 46
Credit: 1,300,104
RAC: 5,380
Message 26547 - Posted: 25 May 2014, 0:50:49 UTC - in response to Message 26530.  
Last modified: 25 May 2014, 0:51:50 UTC

No probs Eric :), thought that might be the case.#

Oh btw, 1 of the 2 users with hidden PCs PMed me back, no invalids :).
Team AnandTech - SETI@H, Muon1 DPAD, F@H, MW@H, A@H, LHC@H, POGS, R@H, E@H.

Main rig - i7 4820k @3.9 GHz, 16 GB DDR3 1866, HD 7950 3GB, Win 7 64bit
2nd rig - Q9550 @3.6 GHz, 4GB DDR2 1066, HD 5850, Win 7 64bit
ID: 26547 · Report as offensive     Reply Quote
Eric Mcintosh
Volunteer moderator
Project administrator
Project developer
Project tester
Project scientist

Send message
Joined: 12 Jul 11
Posts: 843
Credit: 1,510,342
RAC: 1,475
Message 26553 - Posted: 25 May 2014, 12:21:21 UTC - in response to Message 26547.  

Thanks; you will get an answer. My colleagues are
helping set up a procedure for host/owner
identification and e-mailing appropriate messages.
Eric.
ID: 26553 · Report as offensive     Reply Quote
[TA]Assimilator1
Avatar

Send message
Joined: 29 Nov 13
Posts: 46
Credit: 1,300,104
RAC: 5,380
Message 26554 - Posted: 25 May 2014, 15:48:33 UTC - in response to Message 26553.  

Nice work! :)
Team AnandTech - SETI@H, Muon1 DPAD, F@H, MW@H, A@H, LHC@H, POGS, R@H, E@H.

Main rig - i7 4820k @3.9 GHz, 16 GB DDR3 1866, HD 7950 3GB, Win 7 64bit
2nd rig - Q9550 @3.6 GHz, 4GB DDR2 1066, HD 5850, Win 7 64bit
ID: 26554 · Report as offensive     Reply Quote
Errabee

Send message
Joined: 27 Oct 04
Posts: 3
Credit: 99,577
RAC: 0
Message 26690 - Posted: 18 Jul 2014, 23:27:09 UTC - in response to Message 26516.  
Last modified: 18 Jul 2014, 23:29:51 UTC

Hi Ananas; I had a look at Host 10137504. Looks OK now
with 26 valid results. Thinks these inconclusive are part
of a hangover, but indeed there were a lot of them, and
they are still there. Eric.


Only 46 valid results with over 34,000 inconclusive (and I trust ultimately invalid) results right now. I don't think this is a hangover, this is a DT.
ID: 26690 · Report as offensive     Reply Quote
[TA]Assimilator1
Avatar

Send message
Joined: 29 Nov 13
Posts: 46
Credit: 1,300,104
RAC: 5,380
Message 26697 - Posted: 19 Jul 2014, 13:02:38 UTC - in response to Message 26690.  

I hadn't realised this was the same host giving invalids from 2 mths ago!

Jesus the owner needs to that host out!
Team AnandTech - SETI@H, Muon1 DPAD, F@H, MW@H, A@H, LHC@H, POGS, R@H, E@H.

Main rig - i7 4820k @3.9 GHz, 16 GB DDR3 1866, HD 7950 3GB, Win 7 64bit
2nd rig - Q9550 @3.6 GHz, 4GB DDR2 1066, HD 5850, Win 7 64bit
ID: 26697 · Report as offensive     Reply Quote
[TA]Assimilator1
Avatar

Send message
Joined: 29 Nov 13
Posts: 46
Credit: 1,300,104
RAC: 5,380
Message 26705 - Posted: 21 Jul 2014, 17:22:16 UTC - in response to Message 26697.  

*sort
Team AnandTech - SETI@H, Muon1 DPAD, F@H, MW@H, A@H, LHC@H, POGS, R@H, E@H.

Main rig - i7 4820k @3.9 GHz, 16 GB DDR3 1866, HD 7950 3GB, Win 7 64bit
2nd rig - Q9550 @3.6 GHz, 4GB DDR2 1066, HD 5850, Win 7 64bit
ID: 26705 · Report as offensive     Reply Quote
Code11

Send message
Joined: 26 Jan 09
Posts: 2
Credit: 225,060
RAC: 99
Message 26782 - Posted: 3 Oct 2014, 16:19:29 UTC

Hello,

Im posting here, because im getting "Computation Errors" in mostly of LHC Tasks

I do have a OC computer and i have done a lot of tests lately to find if it was from the overclock

I have done Tests with LINPACK, Prime95, and Memtest x86, Memtest x64, and windows mem test.

none had give me an error with the current OC.

I do Compute for World grid comunity and other without errors.


Can anyone advise?

Best regards
ID: 26782 · Report as offensive     Reply Quote
USTL-FIL (Lille Fr)

Send message
Joined: 11 Dec 09
Posts: 21
Credit: 82,177,343
RAC: 313,238
Message 26783 - Posted: 3 Oct 2014, 16:25:36 UTC - in response to Message 26782.  
Last modified: 3 Oct 2014, 16:26:01 UTC

ID: 26783 · Report as offensive     Reply Quote
Ken Beishir

Send message
Joined: 11 Dec 05
Posts: 4
Credit: 945,618
RAC: 180
Message 26784 - Posted: 3 Oct 2014, 16:26:56 UTC - in response to Message 26782.  

I am also seeing a lot of errors (example):

Name w-b3_-22000_job.HLLHC_b3_-22000.0732__11__s__62.31_60.32__11_13__5__44.1178_1_sixvf_boinc3935_2
Workunit 21232197
Created 3 Oct 2014, 11:42:05 UTC
Sent 3 Oct 2014, 13:37:22 UTC
Received 3 Oct 2014, 16:22:06 UTC
Server state Over
Outcome Computation error
Client state Compute error
Exit status 196 (0xc4) EXIT_DISK_LIMIT_EXCEEDED
Computer ID 10315741
Report deadline 11 Oct 2014, 5:09:36 UTC
Run time 4,885.56
CPU time 4,349.03
Validate state Invalid
Credit 0.00
Application version SixTrack v451.07 (pni)


ID: 26784 · Report as offensive     Reply Quote
1 · 2 · Next

Message boards : News : Three Problems, 22nd May.


©2018 CERN