Message boards :
Number crunching :
Host messing up tons of results
Message board moderation
Previous · 1 · 2 · 3 · 4 · 5 · 6 · 7 · 8 . . . 9 · Next
Author | Message |
---|---|
Send message Joined: 19 Feb 08 Posts: 708 Credit: 4,336,250 RAC: 0 |
Eric, you are always posting two times. Why? (I am old too). Tullio |
Send message Joined: 12 Jul 11 Posts: 857 Credit: 1,619,050 RAC: 0 |
There is no way I know of; I await my colleagues to fix the null result problem, adjust outliers, and perhaps ban the host. My bad WUs should be out of the way soon and we shall be back to "normal". Eric. |
Send message Joined: 27 Oct 07 Posts: 186 Credit: 3,297,640 RAC: 0 |
There is no way I know of; I await my colleagues to fix "To blacklist a host, set its max_results_day field to -1." |
Send message Joined: 12 Mar 12 Posts: 128 Credit: 20,013,377 RAC: 0 |
Tullio forum page is very slow sometimes. then you press "post reply" it takes a while with no movement so anyone will think "oh I miss that button" and click again I bet. |
Send message Joined: 12 Jul 11 Posts: 857 Credit: 1,619,050 RAC: 0 |
Because the server is so slow as others have remarked! Eric. |
Send message Joined: 12 Jul 11 Posts: 857 Credit: 1,619,050 RAC: 0 |
Great; except i don't have the tool nor the permission. I'll get id done soonest thogh. Thanks a million. Eric. |
Send message Joined: 27 Oct 07 Posts: 186 Credit: 3,297,640 RAC: 0 |
Great; except i don't have the tool nor the permission. At least it gives you a better idea of the message to pass to Cerberus! |
Send message Joined: 12 Mar 12 Posts: 128 Credit: 20,013,377 RAC: 0 |
same host 9996388 spoiled 3 tasks of 3 today again all tasks lasted 1-2 seconds based on their statistics as it definitely contactable and alive, could we either send message to owner or block it from any other tasks and remove all exist as I see 17k tasks http://lhcathomeclassic.cern.ch/sixtrack/results.php?hostid=9996388 State: All (17205) · In progress (34) · Validation pending (4266) · Validation inconclusive (4662) · Valid (3220) · Invalid (4596) · Error (427) Application: All (17205) · SixTrack (17205) · sixtracktest (0) |
Send message Joined: 12 Jul 11 Posts: 857 Credit: 1,619,050 RAC: 0 |
OK; host has been set with max£results_day -1. |
Send message Joined: 12 Jul 11 Posts: 857 Credit: 1,619,050 RAC: 0 |
I have now asked for another dozen or so hosts to be banned; they all produce more invalids than valids. Eric. (Trying to lock the stable door after the horses have bolted!) |
Send message Joined: 9 Jan 08 Posts: 66 Credit: 727,923 RAC: 0 |
That sounds great. That will give a more steady throughput of full results you can use. |
Send message Joined: 29 Nov 13 Posts: 59 Credit: 4,012,100 RAC: 0 |
I have now asked for another dozen or so hosts to be banned; Better late than never ;) Team AnandTech - WCG, Uni@H, F@H, MW@H, Ast@H, LHC@H, R@H, CPDN, E@H. Main rig - Ryzen 3600, MSI B450 Gm Pro C AC, 32GB DDR4 3200, RTX 3060 Ti 8GB, Win10 64bit 2nd rig - i7 4930k @4.1 GHz, 16 GB DDR3 1866, HD 7870XT 3GB(DS), Win7 64 |
Send message Joined: 26 Jul 05 Posts: 63 Credit: 4,083,755 RAC: 0 |
I have already checked and he is still suspended....... I dont know the BOINC Server nomenclature, but it seems you have banned hin from using the forums rather than from doing more tasks, as shown here. The mechanism that shows on that page allows the server to restrict (right down to 1job/day) a host until it starts returning useful results. Sorry I dont know the full details, hope someone else can help. [edit]I recall theres a flag called reliable host that might help[/edit] |
Send message Joined: 12 Mar 12 Posts: 128 Credit: 20,013,377 RAC: 0 |
let see if host ban works first so for now we have these numbers for that famous 9996388 http://lhcathomeclassic.cern.ch/sixtrack/results.php?hostid=9996388 I'd expect to see all zeros in the very end I think. State: All (15847) · In progress (34) · Validation pending (3284) · Validation inconclusive (4968) · Valid (3164) · Invalid (4014) · Error (383) Application: All (15847) · SixTrack (15847) · sixtracktest (0) some 1k tasks went away, but we need to see they all cleared completely to make sure this ban works right and this state for DONLOADED tasks Task Show names Work unit Sent Time reported or deadline Status Run time CPU time Credit Application 64841129 30742048 7 Apr 2015, 11:44:07 UTC 7 Apr 2015, 11:46:43 UTC Error while downloading 0.00 0.00 --- SixTrack v451.07 64840860 30741914 7 Apr 2015, 11:41:41 UTC 7 Apr 2015, 11:43:32 UTC Error while downloading 0.00 0.00 --- SixTrack v451.07 64840890 30741929 7 Apr 2015, 11:39:36 UTC 7 Apr 2015, 11:41:41 UTC Error while downloading 0.00 0.00 --- SixTrack v451.07 64840544 30741756 7 Apr 2015, 11:37:25 UTC 7 Apr 2015, 11:39:01 UTC Error while downloading 0.00 0.00 --- SixTrack v451.07 64839711 30741478 7 Apr 2015, 11:28:40 UTC 7 Apr 2015, 11:31:45 UTC Error while downloading 0.00 0.00 --- SixTrack v451.07 (sse2) 64839967 30741564 7 Apr 2015, 11:28:08 UTC 7 Apr 2015, 11:31:45 UTC Error while downloading 0.00 0.00 --- SixTrack v451.07 (sse2) 64839508 30741411 7 Apr 2015, 11:24:32 UTC 7 Apr 2015, 11:28:08 UTC Error while downloading 0.00 0.00 --- SixTrack v451.07 64839298 30741341 7 Apr 2015, 11:24:19 UTC 7 Apr 2015, 11:28:08 UTC Error while downloading 0.00 0.00 --- SixTrack v451.07 |
Send message Joined: 12 Mar 12 Posts: 128 Credit: 20,013,377 RAC: 0 |
Eric It also worth to put a message related to this issue in BOINC Manager Notices, so every user of total 120k across the world will check their hosts against your list and make appropriate action. It's unfair to users just silently block their hosts. We need to explain and make some proposals how to fix it. Also User needs to have an ability to resume host in testing phase to make sure host is in correct action before returning to full production. I think we need to write draft of fix procedure here and share it via BOINC Manager Notice board (which is developed for it, right) Might worth to establish new sticked thread or leave this one ONLY with host IDs and procedure to fix or resolve for users with limited ability to comment for us all, only admins to fill it up and make easy to get through lists of hosts. Also for all hosts involved I think we need to establish special empty/limited training/testing tasks so host needs to be conclusive before you assign it back to production tasks. These training tasks in limited numbers could be deployed to willing users so I'm ready to sacrifice some of my machine time to make correlative results. We are talking here about saving Eric's time so he doesn't spend time fixing tons of small issues if it will be automated. I remember Eric mentioned 4 classes of problems to resolve, so I think we may help where we could help. |
Send message Joined: 12 Mar 12 Posts: 128 Credit: 20,013,377 RAC: 0 |
Here is draft of procedures and steps to resolve Inconclusive results hosts for LHC@home Possible reasons why host returns inconclusive results: 1. Not the latest version of BOINC Manager. Please check and install latest from http://boinc.berkeley.edu/download 2. Hast has issues with overheating either CPU or case or videocard. To resolve please - try to set default CPU settings - check CPU fan acts correctly - set alarm warning for certain CPU temperature limit like 60C or 70C - make sure case fans are blowing enough air and temperature in case is within the limits. 3. Memory issues. Please remove some memory, replace modules, change settings for timings. Set it to default settings. 4. Video card issues. Please have your drivers updated, check fan onboard, make sure it got enough room for air circulation. 5. Power supply issues. Power supply might not be powerful enough to keep CPU on stable 100% and in conjunction with other BOINC tasks/projects might withdraw too much power for both CPU and GPU so power supply failed. - Please try with different power unit, - please reduce CPU approved percentage limits for all projects. This change could both be done on BOINC Manager side and on every particular project page configuration. 6. Project reset. Reinstall BOINC Manager, reset or remove project and add it again. 7. Disk space. Check that your computer has enough disk storage room for projects to run. Having different BOINC rojects require plenty of disk space and any other project could easily fill whole available space so LHC@home left with nothing remained. 8. Other CPU and GPU consuming tasks. Please make sure no other CPU and GPU hungry tasks are running simultaneously. Try to avoid heavy load for CPU and GPU during crunching tasks for LHC project. 9. Make sure no viruses and malware/adaware is in action. Please use latest antivirus and have it fully updated to latest signatures. Use freeware like AVP, AVAST, Avira, etc. Use Malwarebytes for malware. Use Windows Microsoft Essentials for Windows. Check cnet.com and filehippo.com for latest software. 10. Try to suspend other projects while make testing crunching for LHC@home. Change CPU settings, change share for multiprocessor systems - instead of 0 put some specific share like 30% or 50% or 80% and see output. 11. ========== please add any other ideas and feel free to correct any of proposed steps for this draft. |
Send message Joined: 12 Mar 12 Posts: 128 Credit: 20,013,377 RAC: 0 |
As a side thought lets take a look at this http://lhcathomeclassic.cern.ch/sixtrack/workunit.php?wuid=30042916 explain Status Run time (sec) CPU time (sec) Credit Application 63303862 10342642 28 Mar 2015 UTC 28 Mar 2015, 4:23:43 UTC Error while computing 0.00 0.00 --- SixTrack v451.07 63303863 10287799 28 Mar 2015 UTC 4 Apr 2015, 16:49:49 UTC Timed out - no response 0.00 0.00 --- SixTrack v451.07 (sse2) 63853277 10298282 31 Mar 2015, 4:27:24 UTC 1 Apr 2015, 8:06:20 UTC Completed, validation inconclusive 43,089.33 42,663.87 pending SixTrack v451.07 (sse2) 64614634 9996388 5 Apr 2015, 8:47:28 UTC 5 Apr 2015, 8:49:39 UTC Completed, validation inconclusive 2.25 0.30 pending SixTrack v451.07 (pni) 64872554 10343493 7 Apr 2015, 19:24:12 UTC 15 Apr 2015, 10:56:26 UTC In progress --- --- --- SixTrack v451.07 (sse2) We see zeros, different versions and timeouts. Full set of all possible issues. Could it be x86 and 64-bit compatibility issue? Could it be sse2 and pni compatibility issue? |
Send message Joined: 12 Jul 11 Posts: 857 Credit: 1,619,050 RAC: 0 |
Not a side thought; this is a very clear example of the biggest problem AND it is a genuine production task (not one of my dud wzero tasks). this must be a "new" host as it is not in my list of a dozen or so bad hosts. I do not believe there is a compatibility issue (but it is Windows 8) and I never say never. I am adding host to list as it has never produced a valid result! just 192 failures with 0 CPU seconds. Eric. |
Send message Joined: 12 Jul 11 Posts: 857 Credit: 1,619,050 RAC: 0 |
I have added 10342642 and 10343493 to the banned list. (10298282 costa seems to be OK :-). This is ad hoc though. For the time being I will just send another post with my first thoughts. Eric. |
Send message Joined: 12 Jul 11 Posts: 857 Credit: 1,619,050 RAC: 0 |
First, as usual, many thanks to all who posted. It is great to have this support. Priorities are RFP, Reliability, Functionality, Performance not forgetting Communication. However, let us not forget that we are getting millions of tasks successfully completed. It must be ten years now since we started but the quality of service has deteriorated in the last year or so, even if the work done has really increased. I am a volunteer too and feel responsible but have no authority. I don't have all permissions like root password (and I don't want them as I make too many mistakes!). I have tried to avoid being too involved in the BOINC infrastructure itself as I am pretty much full time on SixTrack portability, debugging compilers, maintaining the CERN customer SixTrack/SixDesk scripts and utilities and researching the portability issues. I also have to fight the limited CERN infrastructure and remember we don't have a budget either. Still the CERN experiments are perhaps jumping on the bandwagon, so we shall see. Don't forget either we are pretty much unique in getting identical, 0 ULP, results on "any" PC on all flavours of Windows, Linux (and soon Mac again), with different compilers, at all levels of standard compliant optimisation. I agree procedures must be automated as noone has the time to spend all day checking logs etc. Communication and information is vital; must let clients know what is going on and use the boinc_dev etc mailing lists perhaps as Berkeley support from Professor Anderson's team seems to be very very good. "BOINC Manager Notices It's unfair to users just silently block their hosts. We need to explain and make some proposals how to fix it." AGREED. "Also User needs to have an ability to resume host in testing phase to make sure host is in correct action before returning to full production." AGREED, SixTrack is a great test. "Might worth to establish new sticked thread or leave this one ONLY with host IDs and procedure to fix or resolve for users with limited ability to comment for us all, only admins to fill it up and make easy to get through lists of hosts." AGREED, or something similar. "These training tasks in limited numbers could be deployed to willing users so I'm ready to sacrifice some of my machine time to make correlative results." THANKS again. Now I think it is a question of priorities; when there is too much to do PRIORITISE. Sorry to waffle, but it will take some time and collaboration to get this sorted. I am a trouble shooter, problem solver, at heart and HATE all the management issues. which I thought I could avoid in retirement. They have to be sorted though. Eric. |
©2025 CERN