Message boards : Number crunching : New server stuff
Message board moderation

To post messages, you must log in.

AuthorMessage
Profile Markku Degerholm

Send message
Joined: 3 Sep 04
Posts: 212
Credit: 4,545
RAC: 0
Message 3859 - Posted: 15 Oct 2004, 23:53:27 UTC
Last modified: 16 Oct 2004, 6:32:48 UTC

Some changes on the server side:
-more effective job transfer system from physicists to server
-download & upload directory hierarchy enabled
-each result must match with two others (better validator is still on todo-list)

The changes shouldn't affect users in any way... But as some changes to server code and configuration were needed, bad server behaviour may occur.

Download errors (MD5 checksum failure) and fortran errors are still too common, but no solutions to fix them exist at the time being.

Good news is that there are about 100000 work units getting ready to be crunched. This means that about 400000 results will be generated. Bad news is that servers will be on high load for some time... Let's hope they can hold it.

Markku Degerholm
LHC@home Admin
ID: 3859 · Report as offensive     Reply Quote
Profile sysfried

Send message
Joined: 27 Sep 04
Posts: 282
Credit: 1,415,417
RAC: 0
Message 3860 - Posted: 16 Oct 2004, 0:17:50 UTC

My Opteron running win 2003 server beta x64 gets this error... i don't know whether it has to do with boinc 4.13 or the current high server load... maybe someone can help me?

LHC@home - 2004-10-16 02:09:16 - Master file fetch failed

ID: 3860 · Report as offensive     Reply Quote
Profile Markku Degerholm

Send message
Joined: 3 Sep 04
Posts: 212
Credit: 4,545
RAC: 0
Message 3861 - Posted: 16 Oct 2004, 0:25:07 UTC - in response to Message 3860.  

> My Opteron running win 2003 server beta x64 gets this error... i don't know
> whether it has to do with boinc 4.13 or the current high server load... maybe
> someone can help me?
>
> LHC@home - 2004-10-16 02:09:16 - Master file fetch failed

Does it repeat? I mean, are you able to get work at all?




Markku Degerholm
LHC@home Admin
ID: 3861 · Report as offensive     Reply Quote
Toby

Send message
Joined: 1 Sep 04
Posts: 137
Credit: 1,691,526
RAC: 48
Message 3862 - Posted: 16 Oct 2004, 0:58:23 UTC

I'm seeing some download errors but I think they are just a sign of high server load. It took about 10 minutes to download all the work my client had requested and some of them failed. example:

LHC@home - 2004-10-15 19:44:41 - Giving up on download of v64lhc1000proten-45s8_1056.48_1_sixvf_72.zip: Downloaded file had wrong size: expected 276996, got 0
LHC@home - 2004-10-15 19:44:41 - MD5 computation error for v64lhc1000proten-45s8_1056.48_1_sixvf_72.zip: -108
LHC@home - 2004-10-15 19:44:41 - Checksum or signature error for v64lhc1000proten-45s8_1056.48_1_sixvf_72.zip
LHC@home - 2004-10-15 19:44:41 - Unrecoverable error for result v64lhc1000proten-45s8_1056.48_1_sixvf_72_3 (WU download error: couldn't get input files: v64lhc1000proten-45s8_1056.48_1_sixvf_72.zip: MD5 computation error)


I'm not sure why BOINC doesn't time out and try again later. Maybe the server is actually opening a connection but then not sending any data so the client thinks it has received something and tries to handle it like a successful download.

I got 42 work units. Maybe 6 or 7 failed. Didn't keep exact count :)


--------------------------------------
A member of The Knights Who Say Ni!
My BOINC stats site
ID: 3862 · Report as offensive     Reply Quote
para_doks

Send message
Joined: 2 Sep 04
Posts: 9
Credit: 316,683
RAC: 0
Message 3864 - Posted: 16 Oct 2004, 1:07:37 UTC - in response to Message 3859.  
Last modified: 16 Oct 2004, 1:16:59 UTC

> -each work unit must match with two others (better validator is still on
> todo-list)

i hope that means credit system has been changed from "lowest of first two results" to "middle of three".
(or maybe something like "middle of first three", since new WU's are sent to four users)
ID: 3864 · Report as offensive     Reply Quote
Profile [AF_FRANCE_ALPES] Philibert

Send message
Joined: 22 Sep 04
Posts: 3
Credit: 5,355
RAC: 0
Message 3868 - Posted: 16 Oct 2004, 6:57:22 UTC

Hello,

Reception of 19 wus
OK 19/19

P4 3.0Ghz 512mo XP Home SP2

Thank you

ID: 3868 · Report as offensive     Reply Quote
Profile sysfried

Send message
Joined: 27 Sep 04
Posts: 282
Credit: 1,415,417
RAC: 0
Message 3870 - Posted: 16 Oct 2004, 7:29:21 UTC - in response to Message 3861.  

> > My Opteron running win 2003 server beta x64 gets this error... i don't
> know
> > whether it has to do with boinc 4.13 or the current high server load...
> maybe
> > someone can help me?
> >
> > LHC@home - 2004-10-16 02:09:16 - Master file fetch failed
>
> Does it repeat? I mean, are you able to get work at all?
>
>
>
>
> Markku Degerholm
> LHC@home Admin
>
I added S@H and CPN to Boinc and they didn't download anything at all... so I re-installed boinc from scratch. Since there were no WU's left to crunch....

now i have a few wu's :-) Thanks
ID: 3870 · Report as offensive     Reply Quote
Gaspode the UnDressed

Send message
Joined: 1 Sep 04
Posts: 506
Credit: 118,619
RAC: 0
Message 3874 - Posted: 16 Oct 2004, 9:33:34 UTC
Last modified: 16 Oct 2004, 9:34:31 UTC

Duplicate post - sorry!
ID: 3874 · Report as offensive     Reply Quote
Gaspode the UnDressed

Send message
Joined: 1 Sep 04
Posts: 506
Credit: 118,619
RAC: 0
Message 3875 - Posted: 16 Oct 2004, 9:33:38 UTC
Last modified: 16 Oct 2004, 9:36:10 UTC

>> Bad news is that servers will be on high load for some time... Let's hope they can hold it.

Oops - several database connection limit messages in the last few minutes...
09:32 UTC, @439

Giskard - the first telepathic robot.


ID: 3875 · Report as offensive     Reply Quote
Profile B-Roy

Send message
Joined: 1 Sep 04
Posts: 55
Credit: 21,144
RAC: 20
Message 3876 - Posted: 16 Oct 2004, 9:59:14 UTC

"Database overload - please hold connections"
At least the forums are working again.
ID: 3876 · Report as offensive     Reply Quote
STE\/E

Send message
Joined: 2 Sep 04
Posts: 352
Credit: 1,393,150
RAC: 0
Message 3881 - Posted: 16 Oct 2004, 12:35:37 UTC

It took about 10 minutes to download all the work my client had requested and some of them failed.
==========

It took me about an hour to download I guess 100 WU's to 2 different PC's, with cable no less, 40 of them failed to Download though ... 40 out of 200, I guess thats an acceptable rate of failure, at least I got 160 of them through the pipeline anyway ... :)
ID: 3881 · Report as offensive     Reply Quote
BarkerJr

Send message
Joined: 30 Sep 04
Posts: 21
Credit: 50,260
RAC: 0
Message 3883 - Posted: 16 Oct 2004, 14:30:39 UTC

Do all of these PHP connections use persistant connections? If not, why not?
ID: 3883 · Report as offensive     Reply Quote
Profile sysfried

Send message
Joined: 27 Sep 04
Posts: 282
Credit: 1,415,417
RAC: 0
Message 3978 - Posted: 18 Oct 2004, 20:54:32 UTC - in response to Message 3859.  

> Some changes on the server side:

> Good news is that there are about 100000 work units getting ready to be
> crunched. This means that about 400000 results will be generated. Bad news is
> that servers will be on high load for some time... Let's hope they can hold
> it.
>
> Markku Degerholm
> LHC@home Admin
>

Dear Markku.

I wonder whether LHC will ever be able to generate enough WU's (no matter how much cpu power you put on the server side, there will be much more on the clients side) to have the project run with 50.000 or 500.000 participants.

We've seen 400.000 results to be generated and they were downloaded in less than 1 day if i'm not mistaken.

I hope you guys have more work ( speaking in terms of WU's ) for us in the next 2 years than we will able to process! ;-) Or at least a matching amount! ;-)

Greetings,

Thorsten
ID: 3978 · Report as offensive     Reply Quote

Message boards : Number crunching : New server stuff


©2024 CERN