Message boards : Number crunching : Server problems
Message board moderation

To post messages, you must log in.

1 · 2 · Next

AuthorMessage
Profile Markku Degerholm

Send message
Joined: 3 Sep 04
Posts: 212
Credit: 4,545
RAC: 0
Message 3111 - Posted: 2 Oct 2004, 16:12:40 UTC
Last modified: 2 Oct 2004, 16:13:44 UTC

We had to disable even message boards for a while, the system was really stuck up. There is certainly a limit in how many database connections our server can handle at a time... But there was also bug in the main page which unnecessarily made database queries each time it was loaded. We fixed that, and also tuned the database server so now the forums are open again. Let's hope the system works better now.

Results and Pending credits pages have been disabled until we see that we have enough capacity to handle them.


Markku Degerholm
LHC@home Admin
ID: 3111 · Report as offensive     Reply Quote
Guido Alexander Waldenmeier

Send message
Joined: 2 Sep 04
Posts: 321
Credit: 10,607
RAC: 0
Message 3112 - Posted: 2 Oct 2004, 16:20:20 UTC
Last modified: 2 Oct 2004, 16:59:22 UTC

every day 400-500 new users more need a -Real-Server- not a PC as Server ;-)



ID: 3112 · Report as offensive     Reply Quote
Profile Alex

Send message
Joined: 2 Sep 04
Posts: 378
Credit: 10,765
RAC: 0
Message 3121 - Posted: 2 Oct 2004, 18:36:15 UTC - in response to Message 3112.  

> every day 400-500 new users more need a -Real-Server- not a PC as Server ;-)
>
>
>
>


http://msteiner.home.cern.ch/msteiner/computercenter-visit/index.html
______________________________________________________________
Did your tech wear a static strap? No? Well, there ya go! :p
ID: 3121 · Report as offensive     Reply Quote
Guido Alexander Waldenmeier

Send message
Joined: 2 Sep 04
Posts: 321
Credit: 10,607
RAC: 0
Message 3122 - Posted: 2 Oct 2004, 18:42:07 UTC

nice animals Alex ;-) but is this the servers from lhc@home project i think NO
<a>[url=http://guidowaldenmeier.de]<a>


ID: 3122 · Report as offensive     Reply Quote
Profile Markku Degerholm

Send message
Joined: 3 Sep 04
Posts: 212
Credit: 4,545
RAC: 0
Message 3126 - Posted: 2 Oct 2004, 19:15:47 UTC - in response to Message 3122.  

> nice animals Alex ;-) but is this the servers from lhc@home project i think NO

Well, those servers are running just a few floors below me. But no, they are not for LHC@home:( (but I think we have more computing power :)

Anyway, we will get another server soon.

Markku Degerholm
LHC@home Admin
ID: 3126 · Report as offensive     Reply Quote
Toby

Send message
Joined: 1 Sep 04
Posts: 137
Credit: 1,691,526
RAC: 48
Message 3138 - Posted: 3 Oct 2004, 0:20:04 UTC

Are you planning on doing the same thing seti@home has done? i.e. setting up a replicated mySQL server to handle the XML stats and much of the website? (and maybe other things) Or is the bottleneck somewhere else right now? In hindsight it looks like there might be a design flaw in BOINC. I think too much is being handled by one database. The message boards (for example) should be in a completely seperate database (IMHO) so that when the database goes down, people can still get to the message boards and see what is going on. Would also take some load off the main DB. But then who am I to speak? Just a newly graduated programmer with no experience :)


--------------------------------------
A member of The Knights Who Say Ni!
My BOINC stats site
ID: 3138 · Report as offensive     Reply Quote
STE\/E

Send message
Joined: 2 Sep 04
Posts: 352
Credit: 1,393,150
RAC: 0
Message 3139 - Posted: 3 Oct 2004, 0:26:25 UTC

Results and Pending credits pages have been disabled until we see that we have enough capacity to handle them.
==========

Sigh, it's Deja Vu all over again. Seems like I went through this with BOINC Seti & they still haven't got it right over there yet...

The Problems didn't start until you went to 5000 Users & only will get worse as you add more Users...It's obvious the Hardware can't handle the Load...

ID: 3139 · Report as offensive     Reply Quote
EclipseHA

Send message
Joined: 18 Sep 04
Posts: 47
Credit: 1,886,234
RAC: 0
Message 3141 - Posted: 3 Oct 2004, 1:28:23 UTC - in response to Message 3139.  

> Results and Pending credits pages have been disabled until we see that we have
> enough capacity to handle them.
> ==========
>
> Sigh, it's Deja Vu all over again. Seems like I went through this with BOINC
> Seti & they still haven't got it right over there yet...

Let's hope we hear nothing about a SNAP Appliance! :)
ID: 3141 · Report as offensive     Reply Quote
Profile Alex

Send message
Joined: 2 Sep 04
Posts: 378
Credit: 10,765
RAC: 0
Message 3142 - Posted: 3 Oct 2004, 2:59:01 UTC - in response to Message 3141.  


>
> Let's hope we hear nothing about a SNAP Appliance! :)
>
>

Their tech guy is wearing it.

What' he's not wearing is a static strap.


______________________________________________________________
Did your tech wear a static strap? No? Well, there ya go! :p
ID: 3142 · Report as offensive     Reply Quote
Profile PeterHallgarten
Avatar

Send message
Joined: 2 Sep 04
Posts: 14
Credit: 33,774
RAC: 0
Message 3144 - Posted: 3 Oct 2004, 4:48:07 UTC
Last modified: 3 Oct 2004, 9:06:44 UTC

Hi All,

As a person that works in the IT industry on all sorts of scales of projects, it is sometimes hard to fullly spec the requirements as a system that grows, ie. how many servers do I split the various components of the system over, what are the server requirements and specifications.

We also have to relise that Boinc is new as well and its design is still evolving and only by putting the the process of large scale projects liek Seti etc will bugs be killed. I looked at seti and they are spread over 5 servers and are processing:

"State Approximate #results
Ready to sed 1,262,222
In progress 3,966,013"

This is a huge amount of data to handle.

Reconfigurations of running live systems are very complicated and my hat is off to the technical support people at each of there projects.

73 de Peter VK3AVE


ID: 3144 · Report as offensive     Reply Quote
Heffed

Send message
Joined: 2 Sep 04
Posts: 71
Credit: 8,657
RAC: 0
Message 3145 - Posted: 3 Oct 2004, 5:04:31 UTC - in response to Message 3141.  

> Let's hope we hear nothing about a SNAP Appliance! :)

Didn't I read that SNAP is donating the hardware for the actual accelerator? ;)

Seriously though, I think a lot of the current load is these "chaotic" WUs they are currently testing. You can get a dozen of these bad boys and be done and asking for more in a minute.

Multiply by 5000+ users and that can't help matters any...

ID: 3145 · Report as offensive     Reply Quote
STE\/E

Send message
Joined: 2 Sep 04
Posts: 352
Credit: 1,393,150
RAC: 0
Message 3150 - Posted: 3 Oct 2004, 11:00:27 UTC
Last modified: 7 Oct 2004, 9:41:08 UTC

haha ... Funny Picture Alex ... Your right also Heff, on my 3.4 only 2 out of the last 14 Tunescan WU's actually made a full run. The other 12 are only in the 1 min to 20 min range ... It doesn't take long to run through a mess of them at that rate ...




ID: 3150 · Report as offensive     Reply Quote
joe

Send message
Joined: 2 Sep 04
Posts: 24
Credit: 12,288
RAC: 0
Message 3152 - Posted: 3 Oct 2004, 12:21:40 UTC - in response to Message 3111.  

> ... But there was also bug in the main page which
> unnecessarily made database queries each time it was loaded. We fixed that,
> ...


The main page still tries to access the database :

Server Status

Up,
Warning: Too many connections in /shift/lxfsrk429/data01/boinc/projects/lhcathome/html/inc/db_ops.inc on line 11

Warning: MySQL Connection Failed: Too many connections in /shift/lxfsrk429/data01/boinc/projects/lhcathome/html/inc/db_ops.inc on line 11
Unable to connect to database - please try again laterToo many connections
ID: 3152 · Report as offensive     Reply Quote
Profile Markku Degerholm

Send message
Joined: 3 Sep 04
Posts: 212
Credit: 4,545
RAC: 0
Message 3154 - Posted: 3 Oct 2004, 14:41:58 UTC - in response to Message 3152.  

>
> The main page still tries to access the database :

Yes, but not every time the page is loaded. The server status cell is generated by cron process every 10 minutes or so, and if the database is stuck then it will show the message there too. It's not pretty, though - maybe it should check for the error and tell that system is under heavy load in this case.

Markku Degerholm
LHC@home Admin
ID: 3154 · Report as offensive     Reply Quote
Profile Bruno G. Olsen & ESEA @ greenh...
Avatar

Send message
Joined: 17 Sep 04
Posts: 52
Credit: 247,983
RAC: 0
Message 3155 - Posted: 3 Oct 2004, 14:47:04 UTC

Yup, there are server problems :D

LHC@home - 2004-10-03 14:57:33 - Scheduler RPC to http://lhcathome-sched1.cern.ch/scheduler/cgi succeeded
LHC@home - 2004-10-03 14:57:33 - SCHEDULER_REPLY::parse(): bad first tag Content-type: text/plain
LHC@home - 2004-10-03 14:57:33 - Can't parse scheduler reply
LHC@home - 2004-10-03 15:17:26 - Scheduler RPC to http://lhcathome-sched1.cern.ch/scheduler/cgi failed
LHC@home - 2004-10-03 15:17:26 - No schedulers responded

Those are not single instances ;) I'm guessing too many connections ;)

Upload seems to work fine though.


ID: 3155 · Report as offensive     Reply Quote
Guido Alexander Waldenmeier

Send message
Joined: 2 Sep 04
Posts: 321
Credit: 10,607
RAC: 0
Message 3156 - Posted: 3 Oct 2004, 14:59:56 UTC

@cern staff
its hard to say but best is stop creating new accounts and fix the errors
every day new users and you have more troble with the -micky mouse mySQL-
its good for private homepages or small company but NOT for a great project with
1000 and more Users ;-(
<a>[url=http://guidowaldenmeier.de]<a>


ID: 3156 · Report as offensive     Reply Quote
Profile Markku Degerholm

Send message
Joined: 3 Sep 04
Posts: 212
Credit: 4,545
RAC: 0
Message 3158 - Posted: 3 Oct 2004, 15:34:36 UTC - in response to Message 3156.  

> @cern staff
> its hard to say but best is stop creating new accounts and fix the errors
> every day new users and you have more troble with the -micky mouse mySQL-
> its good for private homepages or small company but NOT for a great project
> with
> 1000 and more Users ;-(
> <a>[url=http://guidowaldenmeier.de]<a>

Our user limit is at 5000 and stays there until we can server more. Currently we have about 4800 registered users with 6700 active hosts. Problems started at about 4000 users, I think.

Having experience with many different databases, I don't know any outstanding performer when compared to MySQL. Most likely the problem is with our storage system because the CPU load stays below 50% all the times. We have a triple-controller SCSI system with RAID-0 striped disks which should be good enough... but apparently not. Or maybe there is something else we are missing.

Markku Degerholm
LHC@home Admin
ID: 3158 · Report as offensive     Reply Quote
Profile Alex

Send message
Joined: 2 Sep 04
Posts: 378
Credit: 10,765
RAC: 0
Message 3168 - Posted: 4 Oct 2004, 3:27:19 UTC
Last modified: 4 Oct 2004, 4:12:13 UTC



The seti@home guys had similar problems when they left 'beta' and went 'live'.

They had charts and graphs to show throughput on thier network.
Seti's problems were hard to diagnose. There were a lot of people on message boards with their own amateur theories (viruses, networks, various hardware issues, more memory, etc)
The Seti guys first diagnosed their network as the culprit.. then the problem came back.. they had a few resets of servers too.

From the seti news boards:
http://setiweb.ssl.berkeley.edu/old_news.php

Maybe LHC is running into the same bugs that Seti ran into...(I'm thinking maybe their Aug 4'th 'too many files in upload folder')

Whatever the problem is.. you guys will have a lot of things to check.
Best of luck.

ID: 3168 · Report as offensive     Reply Quote
Profile sysfried

Send message
Joined: 27 Sep 04
Posts: 282
Credit: 1,415,417
RAC: 0
Message 3172 - Posted: 4 Oct 2004, 8:52:29 UTC

I think it's weird that LHC staff can't decide whether to send out large WU's (vtune iirc) or standard ones.

My Dual cpu system has somewhat 40 wu's ready to run... and that's aprox. 2 days of work. Why not continue to send out large wu's? Hey, we could live with that...

And it would reduce load on servers due to the number of less downloads, less wu's to handle....

just my 2 cents...


ID: 3172 · Report as offensive     Reply Quote
Profile Markku Degerholm

Send message
Joined: 3 Sep 04
Posts: 212
Credit: 4,545
RAC: 0
Message 3179 - Posted: 4 Oct 2004, 9:57:13 UTC - in response to Message 3172.  

> I think it's weird that LHC staff can't decide whether to send out large WU's
> (vtune iirc) or standard ones.
>

Point is that those short WU's are needed just as much as the long ones. But we try to generate a mix of short and long work units such that the average is good, and we get the short ones crunched as well. But I think it will take a few more days before our physicists are able to start submitting those longer jobs.

Markku Degerholm
LHC@home Admin
ID: 3179 · Report as offensive     Reply Quote
1 · 2 · Next

Message boards : Number crunching : Server problems


©2024 CERN