Message boards :
News :
Three Problems, 22nd May.
Message board moderation
Author | Message |
---|---|
Send message Joined: 12 Jul 11 Posts: 857 Credit: 1,619,050 RAC: 0 |
Settling down a bit; I am seeing around 2% WU failures. Problem 1: EXIT_TIME_LIMIT_EXCEEDED. Tried to minimise this and will hopefully implement "outliers" to avoid it in future. Problem 2: Can't Create Process and I will look for help on this. Probably connected with our build but we shall see. Problem 3: Found 545 invalid results involving 124 hosts. One invalid result was duplicated! but i am not going to run everything 3 times. Can live with this. The top 12 culprits gave 77 45 26 25 22 21 19 16 14 11 10 9 invalid results each. (I thought we stopped using hosts with this many errors......) Seems to be hardware, overclocking, cosmic rays????? Getting a lot of production done successfully. Eric. |
Send message Joined: 27 Oct 07 Posts: 186 Credit: 3,297,640 RAC: 0 |
1) is certainly a BOINC back-end problem, not the fault of either your application or our clients. Outliers would be good: the other steps you've taken already (provided you keep rsc_fpops_est friendly) should keep it under control for now. Just remember that it goes back to square one with each new version you release. I'm going to write to the dev list with some lessons learned (not least, the lack of server documentation). 2) I don't know either, so I'll be interested to hear what answers you get. 3) I hope you will (or maybe you could get a student or intern to) write a private PM to the worst offenders, politely saying "Please clean your machine up, or take your cycles elsewhere" - or words to that effect. And if that fails, there is a facility for blacklisting a rogue host. CPDN use that combination of procedures (carrot and stick) to some good effect. |
Send message Joined: 12 Jul 11 Posts: 857 Credit: 1,619,050 RAC: 0 |
As usual many thanks Richard; maybe I am on the right lines. :-) Eric. P.S. Working for CERN's 60th Anniversary this weekend, perhaps the last chance to go underground and visit the experiments for some time. |
Send message Joined: 29 Nov 13 Posts: 59 Credit: 4,012,100 RAC: 0 |
Re no3, not my old overclocked rig :) If any of them are on Team AnandTech I could try PMing them. But short of going through every single active user on my team is their a quicker way I can see if any invalids are coming from TA? [edit] nm theirs only 8 active on our team!, could only check 6 as 2 were 'hidden'. No invalids but 1 Linux user had a single errored WU. Could you post the list of people with invalids?? If not LMK if any are TA members (of the unknown 2 I'm pretty sure 1 of them would of spotted any invalids btw - have now PMed 1). Team AnandTech - WCG, Uni@H, F@H, MW@H, Ast@H, LHC@H, R@H, CPDN, E@H. Main rig - Ryzen 3600, MSI B450 Gm Pro C AC, 32GB DDR4 3200, RTX 3060 Ti 8GB, Win10 64bit 2nd rig - i7 4930k @4.1 GHz, 16 GB DDR3 1866, HD 7870XT 3GB(DS), Win7 64 |
Send message Joined: 12 Jul 11 Posts: 857 Credit: 1,619,050 RAC: 0 |
I don't really want to name (and shame!). I am working on a script to e-mail the owner of the host[id]s concerned. Eric. |
Send message Joined: 17 Jul 05 Posts: 102 Credit: 542,016 RAC: 0 |
I have inconclusive results on 3 hosts, only one of them is mildly OCed. Actually several of them are invalid, as the two other results have already been validated. Mine stick to "inconclusive" - the transitioner (I think that's the one that is supposed to do it) "forgot" to switch the state into "invalid". All invalids (without exception) have one thing in common : the results ran less than a minute on my boxes (credit range 0.04 - 0.06). For the valid partners, the runtimes vary. |
Send message Joined: 12 Jul 11 Posts: 857 Credit: 1,619,050 RAC: 0 |
Hi Ananas; I have a test suite but really set for Linux and/or Cygwin. When I can, I shall set up something for Windows command line. I shall have a look at "unsticking". Eric. |
Send message Joined: 17 Jul 05 Posts: 102 Credit: 542,016 RAC: 0 |
... An attempt to fix the two database problems I know about : DELETE FROM result WHERE NOT EXISTS ( SELECT workunitid.id FROM workunit WHERE workunit.id = result.workunitid ); UPDATE result SET outcome = 2 WHERE outcome = 4 AND 0 < ( SELECT workunitid.canonical_resultid FROM workunit WHERE workunitid.id = result.workunitid AND workunitid.need_validate = 0 ); I'm not totally sure that the second one does what we want it to do so better "SELECT" (and check a few samples) before "UPDATE". If MySQL handles NUL > 0, it will not work properly. It might also be necessary to manipulate result.server_state too, unfortunately there is not a single status attribute for the result. If it becomes too complex ... all those results will sooner or later become orphaned so the "DELETE" SQL statement will catch them ;-) ################################# p.s.: There are two types of invalid results. Some run only seconds and return an incomplete stdout (host 10137504 returns tons of those) |
Send message Joined: 12 Jul 11 Posts: 857 Credit: 1,619,050 RAC: 0 |
Hi Ananas; I had a look at Host 10137504. Looks OK now with 26 valid results. Thinks these inconclusive are part of a hangover, but indeed there were a lot of them, and they are still there. Eric. |
Send message Joined: 17 Jul 05 Posts: 102 Credit: 542,016 RAC: 0 |
... Looks OK now ... I don't think so. Check the pending ones, all the results with a CPU time less than a second seem to be damaged (stderr contains nothing but the core client version), 1500+ damaged ones pending from today (and maybe 10 not damaged ones). They just don't occur in the "inconclusive" list yet because the wingmen didn't return their share yet and the validator didn't touch them yet. Maybe a heat issue, a quadcore laptop with hyperthreading - the other hosts of the same user do not show any similar issues so it is most likely not a virus scanner that interfers with the application. |
Send message Joined: 12 Jul 11 Posts: 857 Credit: 1,619,050 RAC: 0 |
NOT OK then I'll watch. Eric. |
Send message Joined: 29 Nov 13 Posts: 59 Credit: 4,012,100 RAC: 0 |
No probs Eric :), thought that might be the case.# Oh btw, 1 of the 2 users with hidden PCs PMed me back, no invalids :). Team AnandTech - WCG, Uni@H, F@H, MW@H, Ast@H, LHC@H, R@H, CPDN, E@H. Main rig - Ryzen 3600, MSI B450 Gm Pro C AC, 32GB DDR4 3200, RTX 3060 Ti 8GB, Win10 64bit 2nd rig - i7 4930k @4.1 GHz, 16 GB DDR3 1866, HD 7870XT 3GB(DS), Win7 64 |
Send message Joined: 12 Jul 11 Posts: 857 Credit: 1,619,050 RAC: 0 |
Thanks; you will get an answer. My colleagues are helping set up a procedure for host/owner identification and e-mailing appropriate messages. Eric. |
Send message Joined: 29 Nov 13 Posts: 59 Credit: 4,012,100 RAC: 0 |
Nice work! :) Team AnandTech - WCG, Uni@H, F@H, MW@H, Ast@H, LHC@H, R@H, CPDN, E@H. Main rig - Ryzen 3600, MSI B450 Gm Pro C AC, 32GB DDR4 3200, RTX 3060 Ti 8GB, Win10 64bit 2nd rig - i7 4930k @4.1 GHz, 16 GB DDR3 1866, HD 7870XT 3GB(DS), Win7 64 |
Send message Joined: 27 Oct 04 Posts: 3 Credit: 99,577 RAC: 0 |
Hi Ananas; I had a look at Host 10137504. Looks OK now Only 46 valid results with over 34,000 inconclusive (and I trust ultimately invalid) results right now. I don't think this is a hangover, this is a DT. |
Send message Joined: 29 Nov 13 Posts: 59 Credit: 4,012,100 RAC: 0 |
I hadn't realised this was the same host giving invalids from 2 mths ago! Jesus the owner needs to that host out! Team AnandTech - WCG, Uni@H, F@H, MW@H, Ast@H, LHC@H, R@H, CPDN, E@H. Main rig - Ryzen 3600, MSI B450 Gm Pro C AC, 32GB DDR4 3200, RTX 3060 Ti 8GB, Win10 64bit 2nd rig - i7 4930k @4.1 GHz, 16 GB DDR3 1866, HD 7870XT 3GB(DS), Win7 64 |
Send message Joined: 29 Nov 13 Posts: 59 Credit: 4,012,100 RAC: 0 |
*sort Team AnandTech - WCG, Uni@H, F@H, MW@H, Ast@H, LHC@H, R@H, CPDN, E@H. Main rig - Ryzen 3600, MSI B450 Gm Pro C AC, 32GB DDR4 3200, RTX 3060 Ti 8GB, Win10 64bit 2nd rig - i7 4930k @4.1 GHz, 16 GB DDR3 1866, HD 7870XT 3GB(DS), Win7 64 |
Send message Joined: 26 Jan 09 Posts: 2 Credit: 243,715 RAC: 0 |
Hello, Im posting here, because im getting "Computation Errors" in mostly of LHC Tasks I do have a OC computer and i have done a lot of tests lately to find if it was from the overclock I have done Tests with LINPACK, Prime95, and Memtest x86, Memtest x64, and windows mem test. none had give me an error with the current OC. I do Compute for World grid comunity and other without errors. Can anyone advise? Best regards |
Send message Joined: 11 Dec 09 Posts: 27 Credit: 236,763,011 RAC: 0 |
|
Send message Joined: 11 Dec 05 Posts: 4 Credit: 1,077,763 RAC: 0 |
I am also seeing a lot of errors (example): Name w-b3_-22000_job.HLLHC_b3_-22000.0732__11__s__62.31_60.32__11_13__5__44.1178_1_sixvf_boinc3935_2 Workunit 21232197 Created 3 Oct 2014, 11:42:05 UTC Sent 3 Oct 2014, 13:37:22 UTC Received 3 Oct 2014, 16:22:06 UTC Server state Over Outcome Computation error Client state Compute error Exit status 196 (0xc4) EXIT_DISK_LIMIT_EXCEEDED Computer ID 10315741 Report deadline 11 Oct 2014, 5:09:36 UTC Run time 4,885.56 CPU time 4,349.03 Validate state Invalid Credit 0.00 Application version SixTrack v451.07 (pni) |
©2024 CERN