Message boards : News : Status and Plans, Sunday 4th November
Message board moderation

To post messages, you must log in.

AuthorMessage
Eric Mcintosh
Volunteer moderator
Project administrator
Project developer
Project tester
Project scientist

Send message
Joined: 12 Jul 11
Posts: 857
Credit: 1,619,050
RAC: 0
Message 24924 - Posted: 4 Nov 2012, 15:08:44 UTC

First service continues to run well; the first intensity scan is nearing completion with well over a million results in 15 studies successfully returned. Just a couple of hundred thousand more!
(Sadly no one study is complete but a couple are very close and I shall start post-processing and analysis soon. I am still reflecting on the thread "Number crunching; WU not being sent to another user".
This is not easy, trying to get studies complete, but keeping the system busy. I am the "feeder" and since in the end I need all the studies I am rather prioritising keeping WUs available.)

Just checked and we have over 80,000, yes eighty thousand WUs active and this is a new (recent) record.

Draft documentation of the User side is now available thanks to my colleague R. Demaria. If you are interested
[url=SixDesk Doc]http://sixtrack-ng.web.cern.ch/sixtrack-ng/[/url]
and I hope you can access it (otherwise I shall put a copy to LHC@home).

Right now I hope to try new executables with new physics on our test server and I mght shortly appeal for some volunteers to help (and also to run a few more 10 million turn jobs). I do NOT want to risk the production service while it is running so smoothly.

Otherwise (At Last!) I shall start writing my paper on how to get identical results on ANY IEEE 754 hardware with ANY standard compiler
at ANY level of Optimisation. Thanks to all. Eric.

ID: 24924 · Report as offensive     Reply Quote
Richard Haselgrove

Send message
Joined: 27 Oct 07
Posts: 186
Credit: 3,297,640
RAC: 0
Message 24925 - Posted: 4 Nov 2012, 15:31:34 UTC - in response to Message 24924.  

1) SixDesk Doc is accessible here, but you have to swap the parameters to url= over :P

2) I've just come across a wee problemette - WU 4413334. I think the middle one should have been set to 'invalid' after the third user reported - I'm not sure whether it will be properly marked off as completed like this.

3) Stick my name on the list of volunteers for the new app and the 10M turn jobs. Host 9990937 is set up for quick turnround.
ID: 24925 · Report as offensive     Reply Quote
Eric Mcintosh
Volunteer moderator
Project administrator
Project developer
Project tester
Project scientist

Send message
Joined: 12 Jul 11
Posts: 857
Credit: 1,619,050
RAC: 0
Message 24928 - Posted: 5 Nov 2012, 7:47:36 UTC - in response to Message 24925.  

Thanks Richard.

The correct URL is (I hope):
SixDesk Doc
I shall look at the WU you mention at the office.

I have also noticed a reduction in the average run time of the WUs. I
suspect this is because the study w13cbb has the highest bunch charge
and is therefore showing the onset of chaos and lost particles well
before the million turns are completed. Eric.
ID: 24928 · Report as offensive     Reply Quote
mikey
Avatar

Send message
Joined: 30 Oct 11
Posts: 26
Credit: 4,772,123
RAC: 6
Message 24929 - Posted: 5 Nov 2012, 14:56:19 UTC - in response to Message 24925.  


2) I've just come across a wee problemette - WU 4413334. I think the middle one should have been set to 'invalid' after the third user reported - I'm not sure whether it will be properly marked off as completed like this.


I think the problem is here, this seems to be a normal unit:
minimum quorum 2
initial replication 2

while this is the unit in question:
minimum quorum 2
initial replication 3

If a unit is INITIALLY sent to three pc's but only two are required for validation and ALL units are returned prior to the deadline, how does the Server side handle all three units? Yes then umber are MUCH different for the second pc than the other two, but shouldn't Boinc have granted credits based on the first and second pc's and NOT used the third one since the first two were NOT marked as invalid? OR does "inconclusive" mean the same thing to Boinc?

A further question is what would have happened to the third unit if the second was not "inconclusive"? Would it have been aborted? What if the pc was half way thru crunching it, or even only had seconds left to finish? One would HOPE that the third unit would have been allowed to finish and be returned and ALSO granted credits, IF it returned a valid result. Especially since the user was NOT at fault for receiving it.
ID: 24929 · Report as offensive     Reply Quote
Richard Haselgrove

Send message
Joined: 27 Oct 07
Posts: 186
Credit: 3,297,640
RAC: 0
Message 24930 - Posted: 5 Nov 2012, 15:32:08 UTC - in response to Message 24929.  

initial replication 3

No, don't worry about that. It's a well-known terminological inexactitude (mistake) in the BOINC server code.

Two tasks were sent out on 30 Oct 2012, an initial replication of two.

When they failed to agree, a third instance was created and sent out on 3 Nov 2012, to make a current replication of 3.

BOINC updates the number, but it doesn't update the word.
ID: 24930 · Report as offensive     Reply Quote
mikey
Avatar

Send message
Joined: 30 Oct 11
Posts: 26
Credit: 4,772,123
RAC: 6
Message 24939 - Posted: 6 Nov 2012, 15:10:28 UTC - in response to Message 24930.  
Last modified: 6 Nov 2012, 15:18:20 UTC

initial replication 3

No, don't worry about that. It's a well-known terminological inexactitude (mistake) in the BOINC server code.

Two tasks were sent out on 30 Oct 2012, an initial replication of two.

When they failed to agree, a third instance was created and sent out on 3 Nov 2012, to make a current replication of 3.

BOINC updates the number, but it doesn't update the word.


Sort of...YES 2 were initially sent out, but the 1st unit errored out the same day and the 3rd unit was sent out 2 days AFTER the 2nd unit was returned to the Project. 2 units were sent out 30 Oct, 1st unit returned the same day, 1 unit returned 1 Nov. The 3rd unit was sent out 3 Nov and returned 4 Nov. This could be due to VERY slow Server responses or a Server 'glitch' that ended up causing the current situation. The replacement unit SHOULD have been sent out immediately after the 'inconclusive' unit was returned, NOT 3 days later.

COULD this be a part of the problems of not sending the bad units to another user? Is the Server NOT recognizing the invalid or inconclusive units properly and therefore NOT resending the units?
ID: 24939 · Report as offensive     Reply Quote
Richard Haselgrove

Send message
Joined: 27 Oct 07
Posts: 186
Credit: 3,297,640
RAC: 0
Message 24941 - Posted: 6 Nov 2012, 18:29:38 UTC - in response to Message 24939.  

initial replication 3

No, don't worry about that. It's a well-known terminological inexactitude (mistake) in the BOINC server code.

Two tasks were sent out on 30 Oct 2012, an initial replication of two.

When they failed to agree, a third instance was created and sent out on 3 Nov 2012, to make a current replication of 3.

BOINC updates the number, but it doesn't update the word.


Sort of...YES 2 were initially sent out, but the 1st unit errored out the same day and the 3rd unit was sent out 2 days AFTER the 2nd unit was returned to the Project. 2 units were sent out 30 Oct, 1st unit returned the same day, 1 unit returned 1 Nov. The 3rd unit was sent out 3 Nov and returned 4 Nov. This could be due to VERY slow Server responses or a Server 'glitch' that ended up causing the current situation. The replacement unit SHOULD have been sent out immediately after the 'inconclusive' unit was returned, NOT 3 days later.

COULD this be a part of the problems of not sending the bad units to another user? Is the Server NOT recognizing the invalid or inconclusive units properly and therefore NOT resending the units?

Sorry, not true (if we're looking at the same workunit). Both of the first two tasks returned 'success' status, so no problem could possibly be detected until the second report was received at 1 Nov 2012 | 6:44:33 UTC, and the validator was able to detect the mismatch.

After that, the third - tie-breaker - task 9891207 was created at 1 Nov 2012 | 6:44:48 UTC. 15 seconds for task creation isn't excessive: the delay between creation and distribution is a queue function, as we've discussed elsewhere.
ID: 24941 · Report as offensive     Reply Quote
Profile jujube

Send message
Joined: 25 Jan 11
Posts: 179
Credit: 83,858
RAC: 0
Message 24943 - Posted: 6 Nov 2012, 22:11:40 UTC - in response to Message 24941.  

Good analysis, Richard. Appears that is exactly the way it went down.
ID: 24943 · Report as offensive     Reply Quote
mikey
Avatar

Send message
Joined: 30 Oct 11
Posts: 26
Credit: 4,772,123
RAC: 6
Message 24950 - Posted: 9 Nov 2012, 14:42:47 UTC - in response to Message 24941.  

initial replication 3

No, don't worry about that. It's a well-known terminological inexactitude (mistake) in the BOINC server code.

Two tasks were sent out on 30 Oct 2012, an initial replication of two.

When they failed to agree, a third instance was created and sent out on 3 Nov 2012, to make a current replication of 3.

BOINC updates the number, but it doesn't update the word.


Sort of...YES 2 were initially sent out, but the 1st unit errored out the same day and the 3rd unit was sent out 2 days AFTER the 2nd unit was returned to the Project. 2 units were sent out 30 Oct, 1st unit returned the same day, 1 unit returned 1 Nov. The 3rd unit was sent out 3 Nov and returned 4 Nov. This could be due to VERY slow Server responses or a Server 'glitch' that ended up causing the current situation. The replacement unit SHOULD have been sent out immediately after the 'inconclusive' unit was returned, NOT 3 days later.

COULD this be a part of the problems of not sending the bad units to another user? Is the Server NOT recognizing the invalid or inconclusive units properly and therefore NOT resending the units?

Sorry, not true (if we're looking at the same workunit). Both of the first two tasks returned 'success' status, so no problem could possibly be detected until the second report was received at 1 Nov 2012 | 6:44:33 UTC, and the validator was able to detect the mismatch.

After that, the third - tie-breaker - task 9891207 was created at 1 Nov 2012 | 6:44:48 UTC. 15 seconds for task creation isn't excessive: the delay between creation and distribution is a queue function, as we've discussed elsewhere.


Ahh I see, what I am seeing is the queue delay, okay that makes sense, THANKS!
ID: 24950 · Report as offensive     Reply Quote
[AF>FAH-Addict.net]toTOW

Send message
Joined: 9 Oct 10
Posts: 77
Credit: 3,671,357
RAC: 0
Message 24954 - Posted: 10 Nov 2012, 18:21:21 UTC

I'm ready for more huge WUs too :)

Like I said before, I think that would be great to keep them along the regular simulations, but in a separate "application", so that donors can check it in their preferences according to what they want to get.
ID: 24954 · Report as offensive     Reply Quote
Hans Sveen

Send message
Joined: 2 Sep 04
Posts: 21
Credit: 4,038,144
RAC: 0
Message 24965 - Posted: 22 Nov 2012, 9:54:22 UTC - in response to Message 24924.  
Last modified: 22 Nov 2012, 9:54:48 UTC

Hello!
Good work, Thank You!
I'm ready to do some testing for the new science and executables, and I would also like some more of the LONG wu's !
If needed, I would even run 100 billion turn wu's!

Good luck!

Greetings from
Hans Sveen
Oslo, Norway


ID: 24965 · Report as offensive     Reply Quote

Message boards : News : Status and Plans, Sunday 4th November


©2024 CERN