Message boards :
Number crunching :
Host messing up tons of results
Message board moderation
Previous · 1 . . . 3 · 4 · 5 · 6 · 7 · 8 · 9 · Next
Author | Message |
---|---|
![]() Send message Joined: 4 May 07 Posts: 250 Credit: 826,541 RAC: 0 |
Eric, I saw the following in your recent post... "These training tasks in limited numbers could be deployed to willing users so I'm ready to sacrifice some of my machine time to make correlative results." As you pointed out in our previous exchange of messages, my system doesn't appear to be throwing errors so I'm quite willing to point to a different location and get some "broken" tasks to crunch as a point of reference.
INTEL CORE i7-2600 3.4GHz 16GB RAM NVIDIA GPU My network is Wi-Fi using Linksys WMP600N
|
Send message Joined: 27 Oct 07 Posts: 186 Credit: 3,297,640 RAC: 0 ![]() |
"BOINC Manager Notices I suspect that a directly targetted email to the host's owner would have more success than a Notice. I think we have to assume, almost by definition, that the owners of these machines don't pay much, if any, attention to BOINC - they may not even be running a version of BOINC which is capable of displaying notices, or the rogue results may be happening on a machine they don't regularly visit. |
Send message Joined: 12 Jul 11 Posts: 857 Credit: 1,619,050 RAC: 0 |
Good point, and it can be done in a script of course. Eric. |
![]() Send message Joined: 12 Mar 12 Posts: 128 Credit: 20,013,377 RAC: 0 |
I think we have to assume, almost by definition, that the owners of these machines don't pay much, if any, attention to BOINC - they may not even be running a version of BOINC which is capable of displaying notices, or the rogue results may be happening on a machine they don't regularly visit. I think these both ways work together: via BOINC Manager Notices we invite people to pay attention to their tasks and hosts on regular basis, propose them ways to deal with possible issues. By targeted emails we inform users that their particular hosts being disabled. We also inform them about ways to deal with it. |
![]() Send message Joined: 12 Mar 12 Posts: 128 Credit: 20,013,377 RAC: 0 |
I swear I hit that button once! I swear! Anyway, I think we need to check all our inconclusive tasks to find out wrongdoings and build up final list of unreliable hosts. Then finished see if it stabilize system OK, say today I had 3 inconclusives and they all happn again surprise surprise with our favorite 9996388 I see decreased number of tasks again down 1k approx, but number of inconclusives have risen since last check from 4968 to 5331 State: All (14485) · In progress (34) · Validation pending (2291) · Validation inconclusive (5331) · Valid (2928) · Invalid (3551) · Error (350) How could it be if host blacklisted? Increased number means that host continues to spoil tasks all acroos the globe and blacklist doesnt work as it still has active tasks and risen number of inconclusives. My understanding of blacklisting is that main server has to ignore everything from that host whatever comes pretending it DOESN'T exist at all, overwise we will suffer from its activity till Judgement day. |
Send message Joined: 27 Oct 07 Posts: 186 Credit: 3,297,640 RAC: 0 ![]() |
Don't worry about 9996388 - it hasn't received any new tasks since 7 Apr 2015, 11:46:43 UTC (yesterday) - previously it was grabbing new ones every minute. If more inconclusive results are showing now than before, they will be previously 'validation pending' tasks which have now been tested against a wingmate - and found wanting. Simply calling up that 14,000 task list took the server ages - let's hope that when the excessive task lists are finally purged from the system, MySQL will be able to run at normal speed again. I agree that email and Notices are not mutually exclusive ways of getting the message out - by all means use both. |
![]() Send message Joined: 12 Mar 12 Posts: 128 Credit: 20,013,377 RAC: 0 |
My understanding is since host is in blacklist ALL assigned tasks have to be redeployed to other hosts and recrunched. As I see my 3 of 3 tasks related to 9996388 now fresh in inconclusive list I assume all these hundreds of new tasks in 9996388's inconclusive list still rely on central. |
Send message Joined: 27 Oct 07 Posts: 186 Credit: 3,297,640 RAC: 0 ![]() |
My understanding is since host is in blacklist ALL assigned tasks have to be redeployed to other hosts and recrunched. No, blacklisting simply prevents new tasks being allocated - as per the mechanism "set max_results_day field to -1". There is supposedly an automatic quota system which - eventually - drags results per day down to 1. But the automatic system allows it to float back up again if tasks validate, and enough of 9996388's tasks validate to allow the work to keep flowing. With the special -1 setting, the validation of existing tasks won't allow quota to be increased automatically. |
![]() Send message Joined: 12 Mar 12 Posts: 128 Credit: 20,013,377 RAC: 0 |
Now I'd like to know are any keys in host management to force host to abandon tasks? say this is one of my 30 error tasks max # of error/total/success tasks 3, 3, 3 errors Too many total results Task Computer Sent Time reported or deadline Status Run time CPU time Credit Application 64654399 10351263 5 Apr 2015, 18:31:35 UTC 6 Apr 2015, 4:31:48 UTC Cancelled by server 0.00 0.00 --- SixTrack v451.07 (pni) 64654400 10298292 5 Apr 2015, 18:30:53 UTC 5 Apr 2015, 18:38:23 UTC Cancelled by server 0.00 0.00 --- SixTrack v451.07 (sse2) 64654401 10356797 5 Apr 2015, 18:34:12 UTC 5 Apr 2015, 18:37:15 UTC Abandoned 0.00 0.00 --- SixTrack v451.07 (pni) So central has a power to either cancel or abandon task with zero effort. If and then I see "Cancelled by server" or "abandoned" status it definitely went from central therefore could be applied to whole bunch of tasks at once. Having host blacklisted by your terms it wouldn't get any new tasks - voila! So what are these keys to force host to abandon ALL tasks? |
Send message Joined: 12 Jul 11 Posts: 857 Credit: 1,619,050 RAC: 0 |
Sorry, have to rush (supposed to be on vacation today). Also I need to print off all this valuable information. I have set the worst 20 hostsa to max_results_day -1 and checked that. Eric. |
Send message Joined: 27 Oct 07 Posts: 186 Credit: 3,297,640 RAC: 0 ![]() |
'Cancelled by server' can be either an automatic action - perhaps when a quorum partner returns their result, valid but late: if the replacement task hasn't been issued, or has been issued but not yet started by the recipient, it gets cancelled with no manual intervention. Or, as I suspect in this case, the server operators can cancel a whole batch of of WUs because they were configured wrongly or otherwise no longer needed. I believe it's generally easier to cancel all WUs in a batch, than to cancel all tasks sent to an individual user or host, but there may be some additional tools made available recently which could help. I'll try to look into those before Eric gets back from his vacation. 'Abandoned' is a different matter entirely. It's supposed to happen when a computer is detached from the project, and then re-attached - but there are indications that there is a deeply-buried bug somewhere in the server code which occasionally throws a whole batch of tasks away while the computer still thinks it's attached and is processing them. But that's one where we need help from a server administrator, rather than a project scientist like Eric. |
Send message Joined: 12 Jul 11 Posts: 857 Credit: 1,619,050 RAC: 0 |
Thanks Richard (again); it is my wife's birthday! and my son is captain on a flight from Dubai to Geneva, but it is late (not his fault) and so I have a few minutes to spare :-) I am working on the e-mails to clients, having already automated the setting of max_results_day to -1. I fully agree with what you are saying, and yes it would be good to cancel WUs. I may be able to do this with existing tools or at least mysql. More whenever. Eric. |
![]() Send message Joined: 29 Nov 13 Posts: 59 Credit: 4,012,100 RAC: 0 ![]() ![]() |
Eric This host wasn't silently blocked, Eric emailed him, at least twice I believe, & I sent the guy a PM months ago. But I agree about the BOINC notices, I mentioned that a post or 2 back ;) Team AnandTech - WCG, Uni@H, F@H, MW@H, Ast@H, LHC@H, R@H, CPDN, E@H. Main rig - Ryzen 3600, MSI B450 Gm Pro C AC, 32GB DDR4 3200, RTX 3060 Ti 8GB, Win10 64bit 2nd rig - i7 4930k @4.1 GHz, 16 GB DDR3 1866, HD 7870XT 3GB(DS), Win7 64 |
Send message Joined: 12 Jul 11 Posts: 857 Credit: 1,619,050 RAC: 0 |
Sorry men; I searched the WWW but i don't know how to send a Boinc Manager Notice. Do I need permissions? I am generating e-mails to those users affected. Eric. |
Send message Joined: 27 Oct 07 Posts: 186 Credit: 3,297,640 RAC: 0 ![]() |
The instruction manual for notices is https://boinc.berkeley.edu/trac/wiki/ProjectNotices It refers first of all to "your gui_urls.xml file": this would place (optional) project web page buttons on the left of our BOINC Manager screens, below the 'Command' buttons. You're not showing any optional buttons at the moment, so this project may not even have a gui_urls.xml file yet (or it may be empty). That would be step one: https://boinc.berkeley.edu/trac/wiki/GuiUrls. After that, Notices appear to be linked to the procedure for putting news items and matching comment threads onto the front page - you have that authority already, so it should be relatively straightforward to find the 'export' button they refer to. |
Send message Joined: 12 Jul 11 Posts: 857 Credit: 1,619,050 RAC: 0 |
Thanks AGAIN Richard; it looks as if I just need to set the URL stuff and my MB NEWS will be displayed. Eric. |
![]() Send message Joined: 12 Mar 12 Posts: 128 Credit: 20,013,377 RAC: 0 |
this host looks suspicious for me as it has huge ratio of errors vs regular results http://lhcathomeclassic.cern.ch/sixtrack/results.php?hostid=9880631 All (46) · In progress (7) Validation pending (0) Validation inconclusive (0)Valid (8)Invalid (0) Error (31) or this http://lhcathomeclassic.cern.ch/sixtrack/results.php?hostid=10297384 State: All (231) In progress (4) Validation pending (2) Validation inconclusive (0)Valid (6) Invalid(0) Error (219) this host has 1/3 of errors and 2/3 of good results http://lhcathomeclassic.cern.ch/sixtrack/results.php?hostid=9990507 In progress (32) Validation pending (7) Validation inconclusive (1) Valid (72) Invalid (0) Error (36) we need to establish criteria probably to separate "good hosts" from "nobehaving hosts" but it could be still just bad batches, incompatibility, power jumps and any other reason The question is: will advantages of having any kind of hosts crunching for project overweight disadvantages of all kind of delays, waste work, jams etc for other hosts in project or not? |
Send message Joined: 12 Jul 11 Posts: 857 Credit: 1,619,050 RAC: 0 |
Right; I am working precisely on this, and I shall look at these hosts soonest. a good QUESTION now: The question is: will advantages of having any kind of hosts crunching for project overweight disadvantages of all kind of delays, waste work, jams etc for other hosts in project or not?[/quote] I pride myself on being "all" inclusive (well, Macs should come soon now, and my colleagues are working on GPU). My problem is that I cannot (yet) find the reasons for the Invalids, and even less so for the empty results which are valid but wrong. The empty results should be fixed soon on my side even if we don't know why they occur (BOINC client, Apache?). Investigation can continue on the BOINC server even after the work round. I have been unable to determine any common factor (yet). So for the moment I have stopped new work for hosts with too many Invalid to Valid. I am working on the script to e-mail those host owners. I have already had some feedback from one such owner asking why his hosts are not getting new work and I am replying immediately, as collaboration here could be very very useful in determining the problem(s). So email the owner, and e-mail other owners and get this on the Boinc Manager notices. Overall the situation appears to be improving slowly. Eric. |
![]() Send message Joined: 29 Nov 13 Posts: 59 Credit: 4,012,100 RAC: 0 ![]() ![]() |
Ah good to hear 1 of them replied! :) Team AnandTech - WCG, Uni@H, F@H, MW@H, Ast@H, LHC@H, R@H, CPDN, E@H. Main rig - Ryzen 3600, MSI B450 Gm Pro C AC, 32GB DDR4 3200, RTX 3060 Ti 8GB, Win10 64bit 2nd rig - i7 4930k @4.1 GHz, 16 GB DDR3 1866, HD 7870XT 3GB(DS), Win7 64 |
![]() Send message Joined: 12 Mar 12 Posts: 128 Credit: 20,013,377 RAC: 0 |
Ah good to hear 1 of them replied! :) One of us) We're all one team. As I see 12k active users and 20k active hosts it is amazing that a vast impact project has. SO I urge everyone to check their error and invalid and inconclusive tasks on regular basis to see any kind of irregularities so it could be fixed asap. |
©2025 CERN