21) Message boards : Number crunching : Host messing up tons of results (Message 27356)
Posted 10 Apr 2015 by alvin
Post:
Seems that every other task is giving me the invalid or inconclusive tasks, even with CPU not over clocked.


Bill
I'd propose you to run BOINC in Compatibilty modes and play with 32 and 64-bit versions compatibility Win7,8 , Vista etc and see output.
Also in properties set "Always Run as Administrator"
make sure your BIOS is latest and no fancy settings about CPU and memory there.
Finally play with CPU settings in BOINC and LHC project
22) Message boards : Number crunching : Host messing up tons of results (Message 27353)
Posted 9 Apr 2015 by alvin
Post:
this host has huge ratio of errors and non-empty results
http://lhcathomeclassic.cern.ch/sixtrack/results.php?hostid=10301941
and this
http://lhcathomeclassic.cern.ch/sixtrack/results.php?hostid=10339565


this one is rated high in inconclusives http://lhcathomeclassic.cern.ch/sixtrack/results.php?hostid=10137504
23) Message boards : Number crunching : Host messing up tons of results (Message 27352)
Posted 9 Apr 2015 by alvin
Post:
Ah good to hear 1 of them replied! :)

One of us)

We're all one team. As I see 12k active users and 20k active hosts it is amazing that a vast impact project has.
SO I urge everyone to check their error and invalid and inconclusive tasks on regular basis to see any kind of irregularities so it could be fixed asap.
24) Message boards : Number crunching : Host messing up tons of results (Message 27347)
Posted 9 Apr 2015 by alvin
Post:
this host looks suspicious for me as it has huge ratio of errors vs regular results
http://lhcathomeclassic.cern.ch/sixtrack/results.php?hostid=9880631
All (46) · In progress (7) Validation pending (0) Validation inconclusive (0)Valid (8)Invalid (0) Error (31)


or this
http://lhcathomeclassic.cern.ch/sixtrack/results.php?hostid=10297384
State: All (231) In progress (4) Validation pending (2) Validation inconclusive (0)Valid (6) Invalid(0) Error (219)


this host has 1/3 of errors and 2/3 of good results
http://lhcathomeclassic.cern.ch/sixtrack/results.php?hostid=9990507
In progress (32) Validation pending (7) Validation inconclusive (1) Valid (72) Invalid (0) Error (36)
we need to establish criteria probably to separate "good hosts" from "nobehaving hosts"

but it could be still just bad batches, incompatibility, power jumps and any other reason


The question is: will advantages of having any kind of hosts crunching for project overweight disadvantages of all kind of delays, waste work, jams etc for other hosts in project or not?
25) Message boards : Number crunching : Host messing up tons of results (Message 27339)
Posted 8 Apr 2015 by alvin
Post:
Now I'd like to know are any keys in host management to force host to abandon tasks?
say this is one of my 30 error tasks

max # of error/total/success tasks 3, 3, 3
errors Too many total results
Task Computer Sent Time reported or deadline Status Run time CPU time Credit Application
64654399 10351263 5 Apr 2015, 18:31:35 UTC 6 Apr 2015, 4:31:48 UTC Cancelled by server 0.00 0.00 --- SixTrack v451.07 (pni)
64654400 10298292 5 Apr 2015, 18:30:53 UTC 5 Apr 2015, 18:38:23 UTC Cancelled by server 0.00 0.00 --- SixTrack v451.07 (sse2)
64654401 10356797 5 Apr 2015, 18:34:12 UTC 5 Apr 2015, 18:37:15 UTC Abandoned 0.00 0.00 --- SixTrack v451.07 (pni)

So central has a power to either cancel or abandon task with zero effort.

If and then I see "Cancelled by server" or "abandoned" status it definitely went from central therefore could be applied to whole bunch of tasks at once.
Having host blacklisted by your terms it wouldn't get any new tasks - voila!
So what are these keys to force host to abandon ALL tasks?
26) Message boards : Number crunching : Host messing up tons of results (Message 27337)
Posted 8 Apr 2015 by alvin
Post:
My understanding is since host is in blacklist ALL assigned tasks have to be redeployed to other hosts and recrunched.
As I see my 3 of 3 tasks related to 9996388 now fresh in inconclusive list I assume all these hundreds of new tasks in 9996388's inconclusive list still rely on central.
27) Message boards : Number crunching : Host messing up tons of results (Message 27335)
Posted 8 Apr 2015 by alvin
Post:
I swear I hit that button once! I swear!
Anyway, I think we need to check all our inconclusive tasks to find out wrongdoings and build up final list of unreliable hosts.
Then finished see if it stabilize system

OK, say today I had 3 inconclusives and they all happn again surprise surprise with our favorite 9996388

I see decreased number of tasks again down 1k approx, but number of inconclusives have risen since last check from 4968 to 5331

State: All (14485) · In progress (34) · Validation pending (2291) · Validation inconclusive (5331) · Valid (2928) · Invalid (3551) · Error (350)


How could it be if host blacklisted?
Increased number means that host continues to spoil tasks all acroos the globe and blacklist doesnt work as it still has active tasks and risen number of inconclusives.
My understanding of blacklisting is that main server has to ignore everything from that host whatever comes pretending it DOESN'T exist at all, overwise we will suffer from its activity till Judgement day.
28) Message boards : Number crunching : Host messing up tons of results (Message 27334)
Posted 8 Apr 2015 by alvin
Post:
I think we have to assume, almost by definition, that the owners of these machines don't pay much, if any, attention to BOINC - they may not even be running a version of BOINC which is capable of displaying notices, or the rogue results may be happening on a machine they don't regularly visit.



I think these both ways work together: via BOINC Manager Notices we invite people to pay attention to their tasks and hosts on regular basis, propose them ways to deal with possible issues.
By targeted emails we inform users that their particular hosts being disabled. We also inform them about ways to deal with it.
29) Message boards : Number crunching : Host messing up tons of results (Message 27327)
Posted 7 Apr 2015 by alvin
Post:
As a side thought lets take a look at this
http://lhcathomeclassic.cern.ch/sixtrack/workunit.php?wuid=30042916

explain Status Run time (sec) CPU time (sec) Credit Application
63303862 10342642 28 Mar 2015 UTC 28 Mar 2015, 4:23:43 UTC Error while computing 0.00 0.00 --- SixTrack v451.07
63303863 10287799 28 Mar 2015 UTC 4 Apr 2015, 16:49:49 UTC Timed out - no response 0.00 0.00 --- SixTrack v451.07 (sse2)
63853277 10298282 31 Mar 2015, 4:27:24 UTC 1 Apr 2015, 8:06:20 UTC Completed, validation inconclusive 43,089.33 42,663.87 pending SixTrack v451.07 (sse2)
64614634 9996388 5 Apr 2015, 8:47:28 UTC 5 Apr 2015, 8:49:39 UTC Completed, validation inconclusive 2.25 0.30 pending SixTrack v451.07 (pni)
64872554 10343493 7 Apr 2015, 19:24:12 UTC 15 Apr 2015, 10:56:26 UTC In progress --- --- --- SixTrack v451.07 (sse2)

We see zeros, different versions and timeouts. Full set of all possible issues.
Could it be x86 and 64-bit compatibility issue? Could it be sse2 and pni compatibility issue?
30) Message boards : Number crunching : Host messing up tons of results (Message 27326)
Posted 7 Apr 2015 by alvin
Post:
Here is draft of procedures and steps to resolve Inconclusive results hosts for LHC@home


Possible reasons why host returns inconclusive results:
1. Not the latest version of BOINC Manager. Please check and install latest from http://boinc.berkeley.edu/download
2. Hast has issues with overheating either CPU or case or videocard.
To resolve please
- try to set default CPU settings
- check CPU fan acts correctly
- set alarm warning for certain CPU temperature limit like 60C or 70C
- make sure case fans are blowing enough air and temperature in case is within the limits.
3. Memory issues. Please remove some memory, replace modules, change settings for timings. Set it to default settings.
4. Video card issues. Please have your drivers updated, check fan onboard, make sure it got enough room for air circulation.
5. Power supply issues. Power supply might not be powerful enough to keep CPU on stable 100% and in conjunction with other BOINC tasks/projects might withdraw too much power for both CPU and GPU so power supply failed.
- Please try with different power unit,
- please reduce CPU approved percentage limits for all projects. This change could both be done on BOINC Manager side and on every particular project page configuration.
6. Project reset. Reinstall BOINC Manager, reset or remove project and add it again.
7. Disk space. Check that your computer has enough disk storage room for projects to run. Having different BOINC rojects require plenty of disk space and any other project could easily fill whole available space so LHC@home left with nothing remained.
8. Other CPU and GPU consuming tasks. Please make sure no other CPU and GPU hungry tasks are running simultaneously. Try to avoid heavy load for CPU and GPU during crunching tasks for LHC project.
9. Make sure no viruses and malware/adaware is in action. Please use latest antivirus and have it fully updated to latest signatures. Use freeware like AVP, AVAST, Avira, etc. Use Malwarebytes for malware. Use Windows Microsoft Essentials for Windows.
Check cnet.com and filehippo.com for latest software.
10. Try to suspend other projects while make testing crunching for LHC@home. Change CPU settings, change share for multiprocessor systems - instead of 0 put some specific share like 30% or 50% or 80% and see output.
11.




==========
please add any other ideas and feel free to correct any of proposed steps for this draft.
31) Message boards : Number crunching : Host messing up tons of results (Message 27325)
Posted 7 Apr 2015 by alvin
Post:
Eric
It also worth to put a message related to this issue in BOINC Manager Notices, so every user of total 120k across the world will check their hosts against your list and make appropriate action.
It's unfair to users just silently block their hosts. We need to explain and make some proposals how to fix it.

Also User needs to have an ability to resume host in testing phase to make sure host is in correct action before returning to full production.

I think we need to write draft of fix procedure here and share it via BOINC Manager Notice board (which is developed for it, right)


Might worth to establish new sticked thread or leave this one ONLY with host IDs and procedure to fix or resolve for users with limited ability to comment for us all, only admins to fill it up and make easy to get through lists of hosts.

Also for all hosts involved I think we need to establish special empty/limited training/testing tasks so host needs to be conclusive before you assign it back to production tasks.

These training tasks in limited numbers could be deployed to willing users so I'm ready to sacrifice some of my machine time to make correlative results.

We are talking here about saving Eric's time so he doesn't spend time fixing tons of small issues if it will be automated.

I remember Eric mentioned 4 classes of problems to resolve, so I think we may help where we could help.
32) Message boards : Number crunching : Host messing up tons of results (Message 27324)
Posted 7 Apr 2015 by alvin
Post:
let see if host ban works first
so for now we have these numbers for that famous 9996388
http://lhcathomeclassic.cern.ch/sixtrack/results.php?hostid=9996388
I'd expect to see all zeros in the very end I think.

State: All (15847) · In progress (34) · Validation pending (3284) · Validation inconclusive (4968) · Valid (3164) · Invalid (4014) · Error (383)
Application: All (15847) · SixTrack (15847) · sixtracktest (0)


some 1k tasks went away, but we need to see they all cleared completely to make sure this ban works right

and this state for DONLOADED tasks
Task Show names Work unit Sent Time reported or deadline Status Run time CPU time Credit Application
64841129 30742048 7 Apr 2015, 11:44:07 UTC 7 Apr 2015, 11:46:43 UTC Error while downloading 0.00 0.00 --- SixTrack v451.07
64840860 30741914 7 Apr 2015, 11:41:41 UTC 7 Apr 2015, 11:43:32 UTC Error while downloading 0.00 0.00 --- SixTrack v451.07
64840890 30741929 7 Apr 2015, 11:39:36 UTC 7 Apr 2015, 11:41:41 UTC Error while downloading 0.00 0.00 --- SixTrack v451.07
64840544 30741756 7 Apr 2015, 11:37:25 UTC 7 Apr 2015, 11:39:01 UTC Error while downloading 0.00 0.00 --- SixTrack v451.07
64839711 30741478 7 Apr 2015, 11:28:40 UTC 7 Apr 2015, 11:31:45 UTC Error while downloading 0.00 0.00 --- SixTrack v451.07 (sse2)
64839967 30741564 7 Apr 2015, 11:28:08 UTC 7 Apr 2015, 11:31:45 UTC Error while downloading 0.00 0.00 --- SixTrack v451.07 (sse2)
64839508 30741411 7 Apr 2015, 11:24:32 UTC 7 Apr 2015, 11:28:08 UTC Error while downloading 0.00 0.00 --- SixTrack v451.07
64839298 30741341 7 Apr 2015, 11:24:19 UTC 7 Apr 2015, 11:28:08 UTC Error while downloading 0.00 0.00 --- SixTrack v451.07
33) Message boards : Number crunching : Host messing up tons of results (Message 27314)
Posted 7 Apr 2015 by alvin
Post:
same host 9996388 spoiled 3 tasks of 3 today again
all tasks lasted 1-2 seconds based on their statistics

as it definitely contactable and alive, could we either send message to owner or block it from any other tasks and remove all exist as I see 17k tasks

http://lhcathomeclassic.cern.ch/sixtrack/results.php?hostid=9996388

State: All (17205) · In progress (34) · Validation pending (4266) · Validation inconclusive (4662) · Valid (3220) · Invalid (4596) · Error (427)
Application: All (17205) · SixTrack (17205) · sixtracktest (0)
34) Message boards : Number crunching : Host messing up tons of results (Message 27310)
Posted 7 Apr 2015 by alvin
Post:
Tullio
forum page is very slow sometimes.
then you press "post reply" it takes a while with no movement so anyone will think "oh I miss that button" and click again I bet.
35) Message boards : Number crunching : Host messing up tons of results (Message 27305)
Posted 7 Apr 2015 by alvin
Post:
Eric
Is any way to ban particular host, not the user? In this case all task assigned will be discarded indefinitely were they calculated on user's side or not?
In this case other hosts just perform as they do not aware of broken host and save us a fortune of time and energy)
36) Message boards : Number crunching : Host messing up tons of results (Message 27298)
Posted 7 Apr 2015 by alvin
Post:
http://lhcathomeclassic.cern.ch/sixtrack/results.php?hostid=9996388
correct, this host spoiled 5 tasks of 5 for me yesterday and needs to be resolved
overclocking I bet
37) Message boards : Number crunching : Available work? (Message 27272)
Posted 4 Apr 2015 by alvin
Post:
Interestingly I see significant drop in "inconclusive" results last days - only 2-4 per day, not scores as before.
38) Message boards : News : Server Intervention 10-Feb-2014 (Message 27147)
Posted 11 Feb 2015 by alvin
Post:
more than 200 errors "error while downloading" most dated night and morning hours 11th of February and evening of 10 Feb (UTC)

but also have more than 300 WU also dated 11 feb and onwards timed evening and current time, so seems WUs are in pipeline
39) Message boards : Number crunching : Where did all the work go? (Message 27146)
Posted 11 Feb 2015 by alvin
Post:
more than 200 errors "error while downloading" dated by 10 Feb and morning of 11 Feb UTC
new WUs came later 11th Feb with no issues
seems working OK now
40) Message boards : Number crunching : Stats Export MIA (Message 27083)
Posted 26 Jan 2015 by alvin
Post:
yes, usually it's project issue.
not a big deal until we have 500k+ tasks


Previous 20 · Next 20


©2024 CERN