Message boards : Number crunching : Host messing up tons of results
Message board moderation

To post messages, you must log in.

Previous · 1 · 2 · 3 · 4 · 5 · 6 · 7 · 8 . . . 9 · Next

AuthorMessage
tullio

Send message
Joined: 19 Feb 08
Posts: 708
Credit: 4,336,250
RAC: 0
Message 27307 - Posted: 7 Apr 2015, 8:26:21 UTC
Last modified: 7 Apr 2015, 8:26:40 UTC

Eric, you are always posting two times. Why? (I am old too).
Tullio
ID: 27307 · Report as offensive     Reply Quote
Eric Mcintosh
Volunteer moderator
Project administrator
Project developer
Project tester
Project scientist

Send message
Joined: 12 Jul 11
Posts: 857
Credit: 1,619,050
RAC: 0
Message 27308 - Posted: 7 Apr 2015, 8:37:00 UTC - in response to Message 27305.  

There is no way I know of; I await my colleagues to fix
the null result problem, adjust outliers, and perhaps
ban the host. My bad WUs should be out of the way soon
and we shall be back to "normal". Eric.
ID: 27308 · Report as offensive     Reply Quote
Richard Haselgrove

Send message
Joined: 27 Oct 07
Posts: 186
Credit: 3,297,640
RAC: 0
Message 27309 - Posted: 7 Apr 2015, 8:38:06 UTC - in response to Message 27308.  

There is no way I know of; I await my colleagues to fix
the null result problem, adjust outliers, and perhaps
ban the host. My bad WUs should be out of the way soon
and we shall be back to "normal". Eric.

"To blacklist a host, set its max_results_day field to -1."
ID: 27309 · Report as offensive     Reply Quote
alvin
Avatar

Send message
Joined: 12 Mar 12
Posts: 128
Credit: 20,013,377
RAC: 0
Message 27310 - Posted: 7 Apr 2015, 9:35:15 UTC - in response to Message 27307.  
Last modified: 7 Apr 2015, 9:46:39 UTC

Tullio
forum page is very slow sometimes.
then you press "post reply" it takes a while with no movement so anyone will think "oh I miss that button" and click again I bet.
ID: 27310 · Report as offensive     Reply Quote
Eric Mcintosh
Volunteer moderator
Project administrator
Project developer
Project tester
Project scientist

Send message
Joined: 12 Jul 11
Posts: 857
Credit: 1,619,050
RAC: 0
Message 27311 - Posted: 7 Apr 2015, 11:41:53 UTC - in response to Message 27307.  

Because the server is so slow as others have remarked! Eric.
ID: 27311 · Report as offensive     Reply Quote
Eric Mcintosh
Volunteer moderator
Project administrator
Project developer
Project tester
Project scientist

Send message
Joined: 12 Jul 11
Posts: 857
Credit: 1,619,050
RAC: 0
Message 27312 - Posted: 7 Apr 2015, 11:43:03 UTC - in response to Message 27309.  

Great; except i don't have the tool nor the permission.
I'll get id done soonest thogh. Thanks a million. Eric.
ID: 27312 · Report as offensive     Reply Quote
Richard Haselgrove

Send message
Joined: 27 Oct 07
Posts: 186
Credit: 3,297,640
RAC: 0
Message 27313 - Posted: 7 Apr 2015, 13:28:30 UTC - in response to Message 27312.  

Great; except i don't have the tool nor the permission.
I'll get id done soonest thogh. Thanks a million. Eric.

At least it gives you a better idea of the message to pass to Cerberus!
ID: 27313 · Report as offensive     Reply Quote
alvin
Avatar

Send message
Joined: 12 Mar 12
Posts: 128
Credit: 20,013,377
RAC: 0
Message 27314 - Posted: 7 Apr 2015, 14:00:25 UTC - in response to Message 27312.  

same host 9996388 spoiled 3 tasks of 3 today again
all tasks lasted 1-2 seconds based on their statistics

as it definitely contactable and alive, could we either send message to owner or block it from any other tasks and remove all exist as I see 17k tasks

http://lhcathomeclassic.cern.ch/sixtrack/results.php?hostid=9996388

State: All (17205) · In progress (34) · Validation pending (4266) · Validation inconclusive (4662) · Valid (3220) · Invalid (4596) · Error (427)
Application: All (17205) · SixTrack (17205) · sixtracktest (0)
ID: 27314 · Report as offensive     Reply Quote
Eric Mcintosh
Volunteer moderator
Project administrator
Project developer
Project tester
Project scientist

Send message
Joined: 12 Jul 11
Posts: 857
Credit: 1,619,050
RAC: 0
Message 27315 - Posted: 7 Apr 2015, 14:44:29 UTC

OK; host has been set with max£results_day -1.
ID: 27315 · Report as offensive     Reply Quote
Eric Mcintosh
Volunteer moderator
Project administrator
Project developer
Project tester
Project scientist

Send message
Joined: 12 Jul 11
Posts: 857
Credit: 1,619,050
RAC: 0
Message 27317 - Posted: 7 Apr 2015, 18:34:01 UTC

I have now asked for another dozen or so hosts to be banned;
they all produce more invalids than valids. Eric.
(Trying to lock the stable door after the horses have bolted!)
ID: 27317 · Report as offensive     Reply Quote
Uffe F

Send message
Joined: 9 Jan 08
Posts: 66
Credit: 727,923
RAC: 0
Message 27318 - Posted: 7 Apr 2015, 19:05:46 UTC - in response to Message 27317.  

That sounds great. That will give a more steady throughput of full results you can use.
ID: 27318 · Report as offensive     Reply Quote
[TA]Assimilator1
Avatar

Send message
Joined: 29 Nov 13
Posts: 58
Credit: 4,010,807
RAC: 28
Message 27319 - Posted: 7 Apr 2015, 19:42:25 UTC - in response to Message 27317.  

I have now asked for another dozen or so hosts to be banned;
they all produce more invalids than valids. Eric.
(Trying to lock the stable door after the horses have bolted!)


Better late than never ;)
Team AnandTech - SETI@H, Muon1 DPAD, F@H, MW@H, A@H, LHC@H, POGS, R@H, DHEP, CPDN, E@H.
Main rig - Ryzen 3600, MSI B450 Gm Pro C AC, 32GB DDR4 3200, RX580 8GB, Win10 64bit
2nd rig - i7 4930k @4.1 GHz, 16 GB DDR3 1866, HD 7870XT 3GB(DS), Win7 64bit
ID: 27319 · Report as offensive     Reply Quote
Phil
Avatar

Send message
Joined: 26 Jul 05
Posts: 63
Credit: 4,083,755
RAC: 0
Message 27321 - Posted: 7 Apr 2015, 19:55:18 UTC - in response to Message 27301.  
Last modified: 7 Apr 2015, 20:40:36 UTC

I have already checked and he is still suspended.......
I could not renew the suspension, and I tried. I
have to wait until current suspension, which does not
appear to be effective, expires. Eric.

I dont know the BOINC Server nomenclature, but it seems you have banned hin from using the forums rather than from doing more tasks, as shown here.
The mechanism that shows on that page allows the server to restrict (right down to 1job/day) a host until it starts returning useful results. Sorry I dont know the full details, hope someone else can help.

[edit]I recall theres a flag called reliable host that might help[/edit]
ID: 27321 · Report as offensive     Reply Quote
alvin
Avatar

Send message
Joined: 12 Mar 12
Posts: 128
Credit: 20,013,377
RAC: 0
Message 27324 - Posted: 7 Apr 2015, 22:07:46 UTC - in response to Message 27321.  
Last modified: 7 Apr 2015, 22:37:23 UTC

let see if host ban works first
so for now we have these numbers for that famous 9996388
http://lhcathomeclassic.cern.ch/sixtrack/results.php?hostid=9996388
I'd expect to see all zeros in the very end I think.

State: All (15847) · In progress (34) · Validation pending (3284) · Validation inconclusive (4968) · Valid (3164) · Invalid (4014) · Error (383)
Application: All (15847) · SixTrack (15847) · sixtracktest (0)


some 1k tasks went away, but we need to see they all cleared completely to make sure this ban works right

and this state for DONLOADED tasks
Task Show names Work unit Sent Time reported or deadline Status Run time CPU time Credit Application
64841129 30742048 7 Apr 2015, 11:44:07 UTC 7 Apr 2015, 11:46:43 UTC Error while downloading 0.00 0.00 --- SixTrack v451.07
64840860 30741914 7 Apr 2015, 11:41:41 UTC 7 Apr 2015, 11:43:32 UTC Error while downloading 0.00 0.00 --- SixTrack v451.07
64840890 30741929 7 Apr 2015, 11:39:36 UTC 7 Apr 2015, 11:41:41 UTC Error while downloading 0.00 0.00 --- SixTrack v451.07
64840544 30741756 7 Apr 2015, 11:37:25 UTC 7 Apr 2015, 11:39:01 UTC Error while downloading 0.00 0.00 --- SixTrack v451.07
64839711 30741478 7 Apr 2015, 11:28:40 UTC 7 Apr 2015, 11:31:45 UTC Error while downloading 0.00 0.00 --- SixTrack v451.07 (sse2)
64839967 30741564 7 Apr 2015, 11:28:08 UTC 7 Apr 2015, 11:31:45 UTC Error while downloading 0.00 0.00 --- SixTrack v451.07 (sse2)
64839508 30741411 7 Apr 2015, 11:24:32 UTC 7 Apr 2015, 11:28:08 UTC Error while downloading 0.00 0.00 --- SixTrack v451.07
64839298 30741341 7 Apr 2015, 11:24:19 UTC 7 Apr 2015, 11:28:08 UTC Error while downloading 0.00 0.00 --- SixTrack v451.07
ID: 27324 · Report as offensive     Reply Quote
alvin
Avatar

Send message
Joined: 12 Mar 12
Posts: 128
Credit: 20,013,377
RAC: 0
Message 27325 - Posted: 7 Apr 2015, 22:14:53 UTC - in response to Message 27321.  
Last modified: 7 Apr 2015, 22:51:27 UTC

Eric
It also worth to put a message related to this issue in BOINC Manager Notices, so every user of total 120k across the world will check their hosts against your list and make appropriate action.
It's unfair to users just silently block their hosts. We need to explain and make some proposals how to fix it.

Also User needs to have an ability to resume host in testing phase to make sure host is in correct action before returning to full production.

I think we need to write draft of fix procedure here and share it via BOINC Manager Notice board (which is developed for it, right)


Might worth to establish new sticked thread or leave this one ONLY with host IDs and procedure to fix or resolve for users with limited ability to comment for us all, only admins to fill it up and make easy to get through lists of hosts.

Also for all hosts involved I think we need to establish special empty/limited training/testing tasks so host needs to be conclusive before you assign it back to production tasks.

These training tasks in limited numbers could be deployed to willing users so I'm ready to sacrifice some of my machine time to make correlative results.

We are talking here about saving Eric's time so he doesn't spend time fixing tons of small issues if it will be automated.

I remember Eric mentioned 4 classes of problems to resolve, so I think we may help where we could help.
ID: 27325 · Report as offensive     Reply Quote
alvin
Avatar

Send message
Joined: 12 Mar 12
Posts: 128
Credit: 20,013,377
RAC: 0
Message 27326 - Posted: 7 Apr 2015, 23:10:10 UTC
Last modified: 7 Apr 2015, 23:21:30 UTC

Here is draft of procedures and steps to resolve Inconclusive results hosts for LHC@home


Possible reasons why host returns inconclusive results:
1. Not the latest version of BOINC Manager. Please check and install latest from http://boinc.berkeley.edu/download
2. Hast has issues with overheating either CPU or case or videocard.
To resolve please
- try to set default CPU settings
- check CPU fan acts correctly
- set alarm warning for certain CPU temperature limit like 60C or 70C
- make sure case fans are blowing enough air and temperature in case is within the limits.
3. Memory issues. Please remove some memory, replace modules, change settings for timings. Set it to default settings.
4. Video card issues. Please have your drivers updated, check fan onboard, make sure it got enough room for air circulation.
5. Power supply issues. Power supply might not be powerful enough to keep CPU on stable 100% and in conjunction with other BOINC tasks/projects might withdraw too much power for both CPU and GPU so power supply failed.
- Please try with different power unit,
- please reduce CPU approved percentage limits for all projects. This change could both be done on BOINC Manager side and on every particular project page configuration.
6. Project reset. Reinstall BOINC Manager, reset or remove project and add it again.
7. Disk space. Check that your computer has enough disk storage room for projects to run. Having different BOINC rojects require plenty of disk space and any other project could easily fill whole available space so LHC@home left with nothing remained.
8. Other CPU and GPU consuming tasks. Please make sure no other CPU and GPU hungry tasks are running simultaneously. Try to avoid heavy load for CPU and GPU during crunching tasks for LHC project.
9. Make sure no viruses and malware/adaware is in action. Please use latest antivirus and have it fully updated to latest signatures. Use freeware like AVP, AVAST, Avira, etc. Use Malwarebytes for malware. Use Windows Microsoft Essentials for Windows.
Check cnet.com and filehippo.com for latest software.
10. Try to suspend other projects while make testing crunching for LHC@home. Change CPU settings, change share for multiprocessor systems - instead of 0 put some specific share like 30% or 50% or 80% and see output.
11.




==========
please add any other ideas and feel free to correct any of proposed steps for this draft.
ID: 27326 · Report as offensive     Reply Quote
alvin
Avatar

Send message
Joined: 12 Mar 12
Posts: 128
Credit: 20,013,377
RAC: 0
Message 27327 - Posted: 7 Apr 2015, 23:30:17 UTC
Last modified: 7 Apr 2015, 23:31:50 UTC

As a side thought lets take a look at this
http://lhcathomeclassic.cern.ch/sixtrack/workunit.php?wuid=30042916

explain Status Run time (sec) CPU time (sec) Credit Application
63303862 10342642 28 Mar 2015 UTC 28 Mar 2015, 4:23:43 UTC Error while computing 0.00 0.00 --- SixTrack v451.07
63303863 10287799 28 Mar 2015 UTC 4 Apr 2015, 16:49:49 UTC Timed out - no response 0.00 0.00 --- SixTrack v451.07 (sse2)
63853277 10298282 31 Mar 2015, 4:27:24 UTC 1 Apr 2015, 8:06:20 UTC Completed, validation inconclusive 43,089.33 42,663.87 pending SixTrack v451.07 (sse2)
64614634 9996388 5 Apr 2015, 8:47:28 UTC 5 Apr 2015, 8:49:39 UTC Completed, validation inconclusive 2.25 0.30 pending SixTrack v451.07 (pni)
64872554 10343493 7 Apr 2015, 19:24:12 UTC 15 Apr 2015, 10:56:26 UTC In progress --- --- --- SixTrack v451.07 (sse2)

We see zeros, different versions and timeouts. Full set of all possible issues.
Could it be x86 and 64-bit compatibility issue? Could it be sse2 and pni compatibility issue?
ID: 27327 · Report as offensive     Reply Quote
Eric Mcintosh
Volunteer moderator
Project administrator
Project developer
Project tester
Project scientist

Send message
Joined: 12 Jul 11
Posts: 857
Credit: 1,619,050
RAC: 0
Message 27328 - Posted: 8 Apr 2015, 2:30:31 UTC - in response to Message 27327.  

Not a side thought; this is a very clear example of the biggest
problem AND it is a genuine production task (not one of my dud
wzero tasks). this must be a "new" host as it is not in my list of
a dozen or so bad hosts.

I do not believe there is a compatibility issue (but it is Windows 8)
and I never say never.

I am adding host to list as it has never produced a valid result! just
192 failures with 0 CPU seconds. Eric.
ID: 27328 · Report as offensive     Reply Quote
Eric Mcintosh
Volunteer moderator
Project administrator
Project developer
Project tester
Project scientist

Send message
Joined: 12 Jul 11
Posts: 857
Credit: 1,619,050
RAC: 0
Message 27329 - Posted: 8 Apr 2015, 2:46:28 UTC - in response to Message 27328.  

I have added 10342642 and 10343493 to the banned list.
(10298282 costa seems to be OK :-).

This is ad hoc though. For the time being I will just send another post with my first thoughts. Eric.
ID: 27329 · Report as offensive     Reply Quote
Eric Mcintosh
Volunteer moderator
Project administrator
Project developer
Project tester
Project scientist

Send message
Joined: 12 Jul 11
Posts: 857
Credit: 1,619,050
RAC: 0
Message 27330 - Posted: 8 Apr 2015, 3:31:52 UTC - in response to Message 27325.  

First, as usual, many thanks to all who posted. It is great to have
this support. Priorities are RFP, Reliability, Functionality, Performance
not forgetting Communication.

However, let us not forget that we are getting millions of tasks successfully
completed. It must be ten years now since we started but the quality of service
has deteriorated in the last year or so, even if the work done has really increased.

I am a volunteer too and feel responsible but have no authority. I don't have
all permissions like root password (and I don't want them as I make too
many mistakes!). I have tried to avoid being too involved in the BOINC
infrastructure itself as I am pretty much full time on SixTrack portability,
debugging compilers, maintaining the CERN customer SixTrack/SixDesk
scripts and utilities and researching the portability issues.
I also have to fight the limited CERN infrastructure and remember
we don't have a budget either. Still the CERN experiments are perhaps
jumping on the bandwagon, so we shall see.

Don't forget either we are pretty much unique in getting identical,
0 ULP, results on "any" PC on all flavours of Windows, Linux (and
soon Mac again), with different compilers, at all levels of standard
compliant optimisation.

I agree procedures must be automated as noone has the time to spend all day
checking logs etc. Communication and information is vital; must let clients
know what is going on and use the boinc_dev etc mailing lists perhaps as
Berkeley support from Professor Anderson's team seems to be very very good.

"BOINC Manager Notices
It's unfair to users just silently block their hosts. We need to explain and make some proposals how to fix it." AGREED.

"Also User needs to have an ability to resume host in testing phase to make sure host is in correct action before returning to full production." AGREED, SixTrack is a great test.

"Might worth to establish new sticked thread or leave this one ONLY with host IDs and procedure to fix or resolve for users with limited ability to comment for us all, only admins to fill it up and make easy to get through lists of hosts." AGREED, or something similar.

"These training tasks in limited numbers could be deployed to willing users so I'm ready to sacrifice some of my machine time to make correlative results." THANKS again.

Now I think it is a question of priorities; when there is too much to do PRIORITISE.

Sorry to waffle, but it will take some time and collaboration to get this sorted.
I am a trouble shooter, problem solver, at heart and HATE all the management issues.
which I thought I could avoid in retirement. They have to be sorted though. Eric.
ID: 27330 · Report as offensive     Reply Quote
Previous · 1 · 2 · 3 · 4 · 5 · 6 · 7 · 8 . . . 9 · Next

Message boards : Number crunching : Host messing up tons of results


©2024 CERN