Message boards : Number crunching : Host messing up tons of results
Message board moderation

To post messages, you must log in.

Previous · 1 . . . 3 · 4 · 5 · 6 · 7 · 8 · 9 · Next

AuthorMessage
Profile Tom95134

Send message
Joined: 4 May 07
Posts: 250
Credit: 826,541
RAC: 0
Message 27331 - Posted: 8 Apr 2015, 4:17:44 UTC
Last modified: 8 Apr 2015, 4:19:18 UTC

Eric,

I saw the following in your recent post...
"These training tasks in limited numbers could be deployed to willing users so I'm ready to sacrifice some of my machine time to make correlative results."


As you pointed out in our previous exchange of messages, my system doesn't appear to be throwing errors so I'm quite willing to point to a different location and get some "broken" tasks to crunch as a point of reference.

    Windows 7 Pro x64 SP1
    INTEL CORE i7-2600 3.4GHz
    16GB RAM
    NVIDIA GPU
    My network is Wi-Fi using Linksys WMP600N



I'll shut up now and let you get on with the work. :)

Thanks for the effort you are putting in.


Tom

ID: 27331 · Report as offensive     Reply Quote
Richard Haselgrove

Send message
Joined: 27 Oct 07
Posts: 186
Credit: 3,297,640
RAC: 0
Message 27332 - Posted: 8 Apr 2015, 7:49:11 UTC - in response to Message 27330.  

"BOINC Manager Notices
It's unfair to users just silently block their hosts. We need to explain and make some proposals how to fix it." AGREED.

I suspect that a directly targetted email to the host's owner would have more success than a Notice. I think we have to assume, almost by definition, that the owners of these machines don't pay much, if any, attention to BOINC - they may not even be running a version of BOINC which is capable of displaying notices, or the rogue results may be happening on a machine they don't regularly visit.
ID: 27332 · Report as offensive     Reply Quote
Eric Mcintosh
Volunteer moderator
Project administrator
Project developer
Project tester
Project scientist

Send message
Joined: 12 Jul 11
Posts: 857
Credit: 1,619,050
RAC: 0
Message 27333 - Posted: 8 Apr 2015, 8:31:21 UTC - in response to Message 27332.  

Good point, and it can be done in a script of course. Eric.
ID: 27333 · Report as offensive     Reply Quote
alvin
Avatar

Send message
Joined: 12 Mar 12
Posts: 128
Credit: 20,013,377
RAC: 0
Message 27334 - Posted: 8 Apr 2015, 8:41:35 UTC - in response to Message 27333.  
Last modified: 8 Apr 2015, 8:50:27 UTC

I think we have to assume, almost by definition, that the owners of these machines don't pay much, if any, attention to BOINC - they may not even be running a version of BOINC which is capable of displaying notices, or the rogue results may be happening on a machine they don't regularly visit.



I think these both ways work together: via BOINC Manager Notices we invite people to pay attention to their tasks and hosts on regular basis, propose them ways to deal with possible issues.
By targeted emails we inform users that their particular hosts being disabled. We also inform them about ways to deal with it.
ID: 27334 · Report as offensive     Reply Quote
alvin
Avatar

Send message
Joined: 12 Mar 12
Posts: 128
Credit: 20,013,377
RAC: 0
Message 27335 - Posted: 8 Apr 2015, 8:41:36 UTC - in response to Message 27333.  
Last modified: 8 Apr 2015, 9:09:45 UTC

I swear I hit that button once! I swear!
Anyway, I think we need to check all our inconclusive tasks to find out wrongdoings and build up final list of unreliable hosts.
Then finished see if it stabilize system

OK, say today I had 3 inconclusives and they all happn again surprise surprise with our favorite 9996388

I see decreased number of tasks again down 1k approx, but number of inconclusives have risen since last check from 4968 to 5331

State: All (14485) · In progress (34) · Validation pending (2291) · Validation inconclusive (5331) · Valid (2928) · Invalid (3551) · Error (350)


How could it be if host blacklisted?
Increased number means that host continues to spoil tasks all acroos the globe and blacklist doesnt work as it still has active tasks and risen number of inconclusives.
My understanding of blacklisting is that main server has to ignore everything from that host whatever comes pretending it DOESN'T exist at all, overwise we will suffer from its activity till Judgement day.
ID: 27335 · Report as offensive     Reply Quote
Richard Haselgrove

Send message
Joined: 27 Oct 07
Posts: 186
Credit: 3,297,640
RAC: 0
Message 27336 - Posted: 8 Apr 2015, 9:23:00 UTC - in response to Message 27335.  

Don't worry about 9996388 - it hasn't received any new tasks since 7 Apr 2015, 11:46:43 UTC (yesterday) - previously it was grabbing new ones every minute.

If more inconclusive results are showing now than before, they will be previously 'validation pending' tasks which have now been tested against a wingmate - and found wanting.

Simply calling up that 14,000 task list took the server ages - let's hope that when the excessive task lists are finally purged from the system, MySQL will be able to run at normal speed again.

I agree that email and Notices are not mutually exclusive ways of getting the message out - by all means use both.
ID: 27336 · Report as offensive     Reply Quote
alvin
Avatar

Send message
Joined: 12 Mar 12
Posts: 128
Credit: 20,013,377
RAC: 0
Message 27337 - Posted: 8 Apr 2015, 9:29:09 UTC - in response to Message 27336.  
Last modified: 8 Apr 2015, 9:51:55 UTC

My understanding is since host is in blacklist ALL assigned tasks have to be redeployed to other hosts and recrunched.
As I see my 3 of 3 tasks related to 9996388 now fresh in inconclusive list I assume all these hundreds of new tasks in 9996388's inconclusive list still rely on central.
ID: 27337 · Report as offensive     Reply Quote
Richard Haselgrove

Send message
Joined: 27 Oct 07
Posts: 186
Credit: 3,297,640
RAC: 0
Message 27338 - Posted: 8 Apr 2015, 10:12:02 UTC - in response to Message 27337.  

My understanding is since host is in blacklist ALL assigned tasks have to be redeployed to other hosts and recrunched.
As I see my 3 of 3 tasks related to 9996388 now fresh in inconclusive list I assume all these hundreds of new tasks in 9996388's inconclusive list still rely on central.

No, blacklisting simply prevents new tasks being allocated - as per the mechanism "set max_results_day field to -1". There is supposedly an automatic quota system which - eventually - drags results per day down to 1. But the automatic system allows it to float back up again if tasks validate, and enough of 9996388's tasks validate to allow the work to keep flowing. With the special -1 setting, the validation of existing tasks won't allow quota to be increased automatically.
ID: 27338 · Report as offensive     Reply Quote
alvin
Avatar

Send message
Joined: 12 Mar 12
Posts: 128
Credit: 20,013,377
RAC: 0
Message 27339 - Posted: 8 Apr 2015, 10:21:55 UTC - in response to Message 27338.  
Last modified: 8 Apr 2015, 10:26:22 UTC

Now I'd like to know are any keys in host management to force host to abandon tasks?
say this is one of my 30 error tasks

max # of error/total/success tasks 3, 3, 3
errors Too many total results
Task Computer Sent Time reported or deadline Status Run time CPU time Credit Application
64654399 10351263 5 Apr 2015, 18:31:35 UTC 6 Apr 2015, 4:31:48 UTC Cancelled by server 0.00 0.00 --- SixTrack v451.07 (pni)
64654400 10298292 5 Apr 2015, 18:30:53 UTC 5 Apr 2015, 18:38:23 UTC Cancelled by server 0.00 0.00 --- SixTrack v451.07 (sse2)
64654401 10356797 5 Apr 2015, 18:34:12 UTC 5 Apr 2015, 18:37:15 UTC Abandoned 0.00 0.00 --- SixTrack v451.07 (pni)

So central has a power to either cancel or abandon task with zero effort.

If and then I see "Cancelled by server" or "abandoned" status it definitely went from central therefore could be applied to whole bunch of tasks at once.
Having host blacklisted by your terms it wouldn't get any new tasks - voila!
So what are these keys to force host to abandon ALL tasks?
ID: 27339 · Report as offensive     Reply Quote
Eric Mcintosh
Volunteer moderator
Project administrator
Project developer
Project tester
Project scientist

Send message
Joined: 12 Jul 11
Posts: 857
Credit: 1,619,050
RAC: 0
Message 27340 - Posted: 8 Apr 2015, 11:32:56 UTC

Sorry, have to rush (supposed to be on vacation today). Also I need
to print off all this valuable information.

I have set the worst 20 hostsa to max_results_day -1 and checked that.

Eric.
ID: 27340 · Report as offensive     Reply Quote
Richard Haselgrove

Send message
Joined: 27 Oct 07
Posts: 186
Credit: 3,297,640
RAC: 0
Message 27341 - Posted: 8 Apr 2015, 11:48:13 UTC - in response to Message 27339.  

'Cancelled by server' can be either an automatic action - perhaps when a quorum partner returns their result, valid but late: if the replacement task hasn't been issued, or has been issued but not yet started by the recipient, it gets cancelled with no manual intervention. Or, as I suspect in this case, the server operators can cancel a whole batch of of WUs because they were configured wrongly or otherwise no longer needed. I believe it's generally easier to cancel all WUs in a batch, than to cancel all tasks sent to an individual user or host, but there may be some additional tools made available recently which could help. I'll try to look into those before Eric gets back from his vacation.

'Abandoned' is a different matter entirely. It's supposed to happen when a computer is detached from the project, and then re-attached - but there are indications that there is a deeply-buried bug somewhere in the server code which occasionally throws a whole batch of tasks away while the computer still thinks it's attached and is processing them. But that's one where we need help from a server administrator, rather than a project scientist like Eric.
ID: 27341 · Report as offensive     Reply Quote
Eric Mcintosh
Volunteer moderator
Project administrator
Project developer
Project tester
Project scientist

Send message
Joined: 12 Jul 11
Posts: 857
Credit: 1,619,050
RAC: 0
Message 27342 - Posted: 8 Apr 2015, 12:40:25 UTC - in response to Message 27341.  

Thanks Richard (again); it is my wife's birthday! and my son is
captain on a flight from Dubai to Geneva, but it is late
(not his fault) and so I have a few minutes to spare :-)

I am working on the e-mails to clients, having already automated
the setting of max_results_day to -1.

I fully agree with what you are saying, and yes it would be good
to cancel WUs. I may be able to do this with existing tools or
at least mysql. More whenever. Eric.
ID: 27342 · Report as offensive     Reply Quote
[TA]Assimilator1
Avatar

Send message
Joined: 29 Nov 13
Posts: 58
Credit: 4,010,807
RAC: 28
Message 27343 - Posted: 8 Apr 2015, 16:54:40 UTC - in response to Message 27325.  

Eric
It also worth to put a message related to this issue in BOINC Manager Notices, so every user of total 120k across the world will check their hosts against your list and make appropriate action.
It's unfair to users just silently block their hosts. We need to explain and make some proposals how to fix it.


This host wasn't silently blocked, Eric emailed him, at least twice I believe, & I sent the guy a PM months ago.

But I agree about the BOINC notices, I mentioned that a post or 2 back ;)
Team AnandTech - SETI@H, Muon1 DPAD, F@H, MW@H, A@H, LHC@H, POGS, R@H, DHEP, CPDN, E@H.
Main rig - Ryzen 3600, MSI B450 Gm Pro C AC, 32GB DDR4 3200, RX580 8GB, Win10 64bit
2nd rig - i7 4930k @4.1 GHz, 16 GB DDR3 1866, HD 7870XT 3GB(DS), Win7 64bit
ID: 27343 · Report as offensive     Reply Quote
Eric Mcintosh
Volunteer moderator
Project administrator
Project developer
Project tester
Project scientist

Send message
Joined: 12 Jul 11
Posts: 857
Credit: 1,619,050
RAC: 0
Message 27344 - Posted: 8 Apr 2015, 17:12:36 UTC

Sorry men; I searched the WWW but i don't know how to send a
Boinc Manager Notice. Do I need permissions?

I am generating e-mails to those users affected. Eric.
ID: 27344 · Report as offensive     Reply Quote
Richard Haselgrove

Send message
Joined: 27 Oct 07
Posts: 186
Credit: 3,297,640
RAC: 0
Message 27345 - Posted: 8 Apr 2015, 18:02:32 UTC - in response to Message 27344.  

The instruction manual for notices is

https://boinc.berkeley.edu/trac/wiki/ProjectNotices

It refers first of all to "your gui_urls.xml file": this would place (optional) project web page buttons on the left of our BOINC Manager screens, below the 'Command' buttons. You're not showing any optional buttons at the moment, so this project may not even have a gui_urls.xml file yet (or it may be empty). That would be step one: https://boinc.berkeley.edu/trac/wiki/GuiUrls.

After that, Notices appear to be linked to the procedure for putting news items and matching comment threads onto the front page - you have that authority already, so it should be relatively straightforward to find the 'export' button they refer to.
ID: 27345 · Report as offensive     Reply Quote
Eric Mcintosh
Volunteer moderator
Project administrator
Project developer
Project tester
Project scientist

Send message
Joined: 12 Jul 11
Posts: 857
Credit: 1,619,050
RAC: 0
Message 27346 - Posted: 8 Apr 2015, 18:30:58 UTC - in response to Message 27345.  

Thanks AGAIN Richard; it looks as if I just need to set
the URL stuff and my MB NEWS will be displayed.
Eric.
ID: 27346 · Report as offensive     Reply Quote
alvin
Avatar

Send message
Joined: 12 Mar 12
Posts: 128
Credit: 20,013,377
RAC: 0
Message 27347 - Posted: 9 Apr 2015, 7:56:27 UTC - in response to Message 27346.  
Last modified: 9 Apr 2015, 8:09:17 UTC

this host looks suspicious for me as it has huge ratio of errors vs regular results
http://lhcathomeclassic.cern.ch/sixtrack/results.php?hostid=9880631
All (46) · In progress (7) Validation pending (0) Validation inconclusive (0)Valid (8)Invalid (0) Error (31)


or this
http://lhcathomeclassic.cern.ch/sixtrack/results.php?hostid=10297384
State: All (231) In progress (4) Validation pending (2) Validation inconclusive (0)Valid (6) Invalid(0) Error (219)


this host has 1/3 of errors and 2/3 of good results
http://lhcathomeclassic.cern.ch/sixtrack/results.php?hostid=9990507
In progress (32) Validation pending (7) Validation inconclusive (1) Valid (72) Invalid (0) Error (36)
we need to establish criteria probably to separate "good hosts" from "nobehaving hosts"

but it could be still just bad batches, incompatibility, power jumps and any other reason


The question is: will advantages of having any kind of hosts crunching for project overweight disadvantages of all kind of delays, waste work, jams etc for other hosts in project or not?
ID: 27347 · Report as offensive     Reply Quote
Eric Mcintosh
Volunteer moderator
Project administrator
Project developer
Project tester
Project scientist

Send message
Joined: 12 Jul 11
Posts: 857
Credit: 1,619,050
RAC: 0
Message 27348 - Posted: 9 Apr 2015, 9:57:09 UTC - in response to Message 27347.  

Right; I am working precisely on this, and I shall look at these
hosts soonest.

a good QUESTION now:
The question is: will advantages of having any kind of hosts crunching for project overweight disadvantages of all kind of delays, waste work, jams etc for other hosts in project or not?[/quote]

I pride myself on being "all" inclusive (well, Macs should come soon now,
and my colleagues are working on GPU). My problem is that I cannot (yet)
find the reasons for the Invalids, and even less so for the empty results
which are valid but wrong. The empty results should be fixed soon on my
side even if we don't know why they occur (BOINC client, Apache?).
Investigation can continue on the BOINC server even after the work round.
I have been unable to determine any common factor (yet). So for the moment
I have stopped new work for hosts with too many Invalid to Valid.
I am working on the script to e-mail those host owners.
I have already had some feedback from one such owner asking why his hosts
are not getting new work and I am replying immediately, as collaboration
here could be very very useful in determining the problem(s).

So email the owner, and e-mail other owners and get this on the Boinc
Manager notices.

Overall the situation appears to be improving slowly. Eric.
ID: 27348 · Report as offensive     Reply Quote
[TA]Assimilator1
Avatar

Send message
Joined: 29 Nov 13
Posts: 58
Credit: 4,010,807
RAC: 28
Message 27351 - Posted: 9 Apr 2015, 17:24:23 UTC - in response to Message 27348.  
Last modified: 9 Apr 2015, 17:25:12 UTC

Ah good to hear 1 of them replied! :)
Team AnandTech - SETI@H, Muon1 DPAD, F@H, MW@H, A@H, LHC@H, POGS, R@H, DHEP, CPDN, E@H.
Main rig - Ryzen 3600, MSI B450 Gm Pro C AC, 32GB DDR4 3200, RX580 8GB, Win10 64bit
2nd rig - i7 4930k @4.1 GHz, 16 GB DDR3 1866, HD 7870XT 3GB(DS), Win7 64bit
ID: 27351 · Report as offensive     Reply Quote
alvin
Avatar

Send message
Joined: 12 Mar 12
Posts: 128
Credit: 20,013,377
RAC: 0
Message 27352 - Posted: 9 Apr 2015, 22:38:01 UTC - in response to Message 27351.  
Last modified: 9 Apr 2015, 23:06:29 UTC

Ah good to hear 1 of them replied! :)

One of us)

We're all one team. As I see 12k active users and 20k active hosts it is amazing that a vast impact project has.
SO I urge everyone to check their error and invalid and inconclusive tasks on regular basis to see any kind of irregularities so it could be fixed asap.
ID: 27352 · Report as offensive     Reply Quote
Previous · 1 . . . 3 · 4 · 5 · 6 · 7 · 8 · 9 · Next

Message boards : Number crunching : Host messing up tons of results


©2024 CERN