Message boards :
Number crunching :
Server problems
Message board moderation
Author | Message |
---|---|
![]() Send message Joined: 3 Sep 04 Posts: 212 Credit: 4,545 RAC: 0 |
We had to disable even message boards for a while, the system was really stuck up. There is certainly a limit in how many database connections our server can handle at a time... But there was also bug in the main page which unnecessarily made database queries each time it was loaded. We fixed that, and also tuned the database server so now the forums are open again. Let's hope the system works better now. Results and Pending credits pages have been disabled until we see that we have enough capacity to handle them. Markku Degerholm LHC@home Admin |
Send message Joined: 2 Sep 04 Posts: 321 Credit: 10,607 RAC: 0 |
every day 400-500 new users more need a -Real-Server- not a PC as Server ;-) ![]() |
![]() Send message Joined: 2 Sep 04 Posts: 378 Credit: 10,765 RAC: 0 |
> every day 400-500 new users more need a -Real-Server- not a PC as Server ;-) > ![]() > > > http://msteiner.home.cern.ch/msteiner/computercenter-visit/index.html ______________________________________________________________ Did your tech wear a static strap? No? Well, there ya go! :p |
Send message Joined: 2 Sep 04 Posts: 321 Credit: 10,607 RAC: 0 |
nice animals Alex ;-) but is this the servers from lhc@home project i think NO <a>[url=http://guidowaldenmeier.de]<a> |
![]() Send message Joined: 3 Sep 04 Posts: 212 Credit: 4,545 RAC: 0 |
> nice animals Alex ;-) but is this the servers from lhc@home project i think NO Well, those servers are running just a few floors below me. But no, they are not for LHC@home:( (but I think we have more computing power :) Anyway, we will get another server soon. Markku Degerholm LHC@home Admin |
Send message Joined: 1 Sep 04 Posts: 137 Credit: 1,769,043 RAC: 2 |
Are you planning on doing the same thing seti@home has done? i.e. setting up a replicated mySQL server to handle the XML stats and much of the website? (and maybe other things) Or is the bottleneck somewhere else right now? In hindsight it looks like there might be a design flaw in BOINC. I think too much is being handled by one database. The message boards (for example) should be in a completely seperate database (IMHO) so that when the database goes down, people can still get to the message boards and see what is going on. Would also take some load off the main DB. But then who am I to speak? Just a newly graduated programmer with no experience :) -------------------------------------- A member of The Knights Who Say Ni! My BOINC stats site |
Send message Joined: 2 Sep 04 Posts: 352 Credit: 1,748,908 RAC: 0 ![]() ![]() |
Results and Pending credits pages have been disabled until we see that we have enough capacity to handle them. ========== Sigh, it's Deja Vu all over again. Seems like I went through this with BOINC Seti & they still haven't got it right over there yet... The Problems didn't start until you went to 5000 Users & only will get worse as you add more Users...It's obvious the Hardware can't handle the Load... |
Send message Joined: 18 Sep 04 Posts: 47 Credit: 1,886,234 RAC: 0 |
> Results and Pending credits pages have been disabled until we see that we have > enough capacity to handle them. > ========== > > Sigh, it's Deja Vu all over again. Seems like I went through this with BOINC > Seti & they still haven't got it right over there yet... Let's hope we hear nothing about a SNAP Appliance! :) |
![]() Send message Joined: 2 Sep 04 Posts: 378 Credit: 10,765 RAC: 0 |
> > Let's hope we hear nothing about a SNAP Appliance! :) > > Their tech guy is wearing it. What' he's not wearing is a static strap. ______________________________________________________________ Did your tech wear a static strap? No? Well, there ya go! :p |
![]() ![]() Send message Joined: 2 Sep 04 Posts: 14 Credit: 33,774 RAC: 0 |
Hi All, As a person that works in the IT industry on all sorts of scales of projects, it is sometimes hard to fullly spec the requirements as a system that grows, ie. how many servers do I split the various components of the system over, what are the server requirements and specifications. We also have to relise that Boinc is new as well and its design is still evolving and only by putting the the process of large scale projects liek Seti etc will bugs be killed. I looked at seti and they are spread over 5 servers and are processing: "State Approximate #results Ready to sed 1,262,222 In progress 3,966,013" This is a huge amount of data to handle. Reconfigurations of running live systems are very complicated and my hat is off to the technical support people at each of there projects. 73 de Peter VK3AVE |
Send message Joined: 2 Sep 04 Posts: 71 Credit: 8,657 RAC: 0 |
> Let's hope we hear nothing about a SNAP Appliance! :) Didn't I read that SNAP is donating the hardware for the actual accelerator? ;) Seriously though, I think a lot of the current load is these "chaotic" WUs they are currently testing. You can get a dozen of these bad boys and be done and asking for more in a minute. Multiply by 5000+ users and that can't help matters any... |
Send message Joined: 2 Sep 04 Posts: 352 Credit: 1,748,908 RAC: 0 ![]() ![]() |
haha ... Funny Picture Alex ... Your right also Heff, on my 3.4 only 2 out of the last 14 Tunescan WU's actually made a full run. The other 12 are only in the 1 min to 20 min range ... It doesn't take long to run through a mess of them at that rate ... |
Send message Joined: 2 Sep 04 Posts: 24 Credit: 12,288 RAC: 0 |
> ... But there was also bug in the main page which > unnecessarily made database queries each time it was loaded. We fixed that, > ... The main page still tries to access the database : Server Status Up, Warning: Too many connections in /shift/lxfsrk429/data01/boinc/projects/lhcathome/html/inc/db_ops.inc on line 11 Warning: MySQL Connection Failed: Too many connections in /shift/lxfsrk429/data01/boinc/projects/lhcathome/html/inc/db_ops.inc on line 11 Unable to connect to database - please try again laterToo many connections |
![]() Send message Joined: 3 Sep 04 Posts: 212 Credit: 4,545 RAC: 0 |
> > The main page still tries to access the database : Yes, but not every time the page is loaded. The server status cell is generated by cron process every 10 minutes or so, and if the database is stuck then it will show the message there too. It's not pretty, though - maybe it should check for the error and tell that system is under heavy load in this case. Markku Degerholm LHC@home Admin |
![]() ![]() Send message Joined: 17 Sep 04 Posts: 52 Credit: 247,983 RAC: 0 |
Yup, there are server problems :D LHC@home - 2004-10-03 14:57:33 - Scheduler RPC to http://lhcathome-sched1.cern.ch/scheduler/cgi succeeded LHC@home - 2004-10-03 14:57:33 - SCHEDULER_REPLY::parse(): bad first tag Content-type: text/plain LHC@home - 2004-10-03 14:57:33 - Can't parse scheduler reply LHC@home - 2004-10-03 15:17:26 - Scheduler RPC to http://lhcathome-sched1.cern.ch/scheduler/cgi failed LHC@home - 2004-10-03 15:17:26 - No schedulers responded Those are not single instances ;) I'm guessing too many connections ;) Upload seems to work fine though. |
Send message Joined: 2 Sep 04 Posts: 321 Credit: 10,607 RAC: 0 |
@cern staff its hard to say but best is stop creating new accounts and fix the errors every day new users and you have more troble with the -micky mouse mySQL- its good for private homepages or small company but NOT for a great project with 1000 and more Users ;-( <a>[url=http://guidowaldenmeier.de]<a> |
![]() Send message Joined: 3 Sep 04 Posts: 212 Credit: 4,545 RAC: 0 |
> @cern staff > its hard to say but best is stop creating new accounts and fix the errors > every day new users and you have more troble with the -micky mouse mySQL- > its good for private homepages or small company but NOT for a great project > with > 1000 and more Users ;-( > <a>[url=http://guidowaldenmeier.de]<a> Our user limit is at 5000 and stays there until we can server more. Currently we have about 4800 registered users with 6700 active hosts. Problems started at about 4000 users, I think. Having experience with many different databases, I don't know any outstanding performer when compared to MySQL. Most likely the problem is with our storage system because the CPU load stays below 50% all the times. We have a triple-controller SCSI system with RAID-0 striped disks which should be good enough... but apparently not. Or maybe there is something else we are missing. Markku Degerholm LHC@home Admin |
![]() Send message Joined: 2 Sep 04 Posts: 378 Credit: 10,765 RAC: 0 |
The seti@home guys had similar problems when they left 'beta' and went 'live'. They had charts and graphs to show throughput on thier network. Seti's problems were hard to diagnose. There were a lot of people on message boards with their own amateur theories (viruses, networks, various hardware issues, more memory, etc) The Seti guys first diagnosed their network as the culprit.. then the problem came back.. they had a few resets of servers too. From the seti news boards: http://setiweb.ssl.berkeley.edu/old_news.php Maybe LHC is running into the same bugs that Seti ran into...(I'm thinking maybe their Aug 4'th 'too many files in upload folder') Whatever the problem is.. you guys will have a lot of things to check. Best of luck. |
![]() Send message Joined: 27 Sep 04 Posts: 282 Credit: 1,415,417 RAC: 0 |
I think it's weird that LHC staff can't decide whether to send out large WU's (vtune iirc) or standard ones. My Dual cpu system has somewhat 40 wu's ready to run... and that's aprox. 2 days of work. Why not continue to send out large wu's? Hey, we could live with that... And it would reduce load on servers due to the number of less downloads, less wu's to handle.... just my 2 cents... |
![]() Send message Joined: 3 Sep 04 Posts: 212 Credit: 4,545 RAC: 0 |
> I think it's weird that LHC staff can't decide whether to send out large WU's > (vtune iirc) or standard ones. > Point is that those short WU's are needed just as much as the long ones. But we try to generate a mix of short and long work units such that the average is good, and we get the short ones crunched as well. But I think it will take a few more days before our physicists are able to start submitting those longer jobs. Markku Degerholm LHC@home Admin |
©2025 CERN