Server problems

Author	Message
Markku Degerholm Send message Joined: 3 Sep 04 Posts: 212 Credit: 4,545 RAC: 0	Message 3111 - Posted: 2 Oct 2004, 16:12:40 UTC Last modified: 2 Oct 2004, 16:13:44 UTC We had to disable even message boards for a while, the system was really stuck up. There is certainly a limit in how many database connections our server can handle at a time... But there was also bug in the main page which unnecessarily made database queries each time it was loaded. We fixed that, and also tuned the database server so now the forums are open again. Let's hope the system works better now. Results and Pending credits pages have been disabled until we see that we have enough capacity to handle them. Markku Degerholm LHC@home Admin ID: 3111 · Reply Quote

Guido Alexander Waldenmeier Send message Joined: 2 Sep 04 Posts: 321 Credit: 10,607 RAC: 0	Message 3112 - Posted: 2 Oct 2004, 16:20:20 UTC Last modified: 2 Oct 2004, 16:59:22 UTC every day 400-500 new users more need a -Real-Server- not a PC as Server ;-) ID: 3112 · Reply Quote

Alex Send message Joined: 2 Sep 04 Posts: 378 Credit: 10,765 RAC: 0	Message 3121 - Posted: 2 Oct 2004, 18:36:15 UTC - in response to Message 3112. > every day 400-500 new users more need a -Real-Server- not a PC as Server ;-) > > > > http://msteiner.home.cern.ch/msteiner/computercenter-visit/index.html ______________________________________________________________ Did your tech wear a static strap? No? Well, there ya go! :p ID: 3121 · Reply Quote

Guido Alexander Waldenmeier Send message Joined: 2 Sep 04 Posts: 321 Credit: 10,607 RAC: 0	Message 3122 - Posted: 2 Oct 2004, 18:42:07 UTC nice animals Alex ;-) but is this the servers from lhc@home project i think NO <a>[url=http://guidowaldenmeier.de]<a> ID: 3122 · Reply Quote

Markku Degerholm Send message Joined: 3 Sep 04 Posts: 212 Credit: 4,545 RAC: 0	Message 3126 - Posted: 2 Oct 2004, 19:15:47 UTC - in response to Message 3122. > nice animals Alex ;-) but is this the servers from lhc@home project i think NO Well, those servers are running just a few floors below me. But no, they are not for LHC@home:( (but I think we have more computing power :) Anyway, we will get another server soon. Markku Degerholm LHC@home Admin ID: 3126 · Reply Quote

Toby Send message Joined: 1 Sep 04 Posts: 137 Credit: 1,783,308 RAC: 1,104	Message 3138 - Posted: 3 Oct 2004, 0:20:04 UTC Are you planning on doing the same thing seti@home has done? i.e. setting up a replicated mySQL server to handle the XML stats and much of the website? (and maybe other things) Or is the bottleneck somewhere else right now? In hindsight it looks like there might be a design flaw in BOINC. I think too much is being handled by one database. The message boards (for example) should be in a completely seperate database (IMHO) so that when the database goes down, people can still get to the message boards and see what is going on. Would also take some load off the main DB. But then who am I to speak? Just a newly graduated programmer with no experience :) -------------------------------------- A member of The Knights Who Say Ni! My BOINC stats site ID: 3138 · Reply Quote

STE\/E Send message Joined: 2 Sep 04 Posts: 352 Credit: 2,898,606 RAC: 105	Message 3139 - Posted: 3 Oct 2004, 0:26:25 UTC Results and Pending credits pages have been disabled until we see that we have enough capacity to handle them. ========== Sigh, it's Deja Vu all over again. Seems like I went through this with BOINC Seti & they still haven't got it right over there yet... The Problems didn't start until you went to 5000 Users & only will get worse as you add more Users...It's obvious the Hardware can't handle the Load... ID: 3139 · Reply Quote

EclipseHA Send message Joined: 18 Sep 04 Posts: 47 Credit: 1,886,234 RAC: 0	Message 3141 - Posted: 3 Oct 2004, 1:28:23 UTC - in response to Message 3139. > Results and Pending credits pages have been disabled until we see that we have > enough capacity to handle them. > ========== > > Sigh, it's Deja Vu all over again. Seems like I went through this with BOINC > Seti & they still haven't got it right over there yet... Let's hope we hear nothing about a SNAP Appliance! :) ID: 3141 · Reply Quote

Alex Send message Joined: 2 Sep 04 Posts: 378 Credit: 10,765 RAC: 0	Message 3142 - Posted: 3 Oct 2004, 2:59:01 UTC - in response to Message 3141. > > Let's hope we hear nothing about a SNAP Appliance! :) > > Their tech guy is wearing it. What' he's not wearing is a static strap. ______________________________________________________________ Did your tech wear a static strap? No? Well, there ya go! :p ID: 3142 · Reply Quote

PeterHallgarten Send message Joined: 2 Sep 04 Posts: 14 Credit: 33,774 RAC: 0	Message 3144 - Posted: 3 Oct 2004, 4:48:07 UTC Last modified: 3 Oct 2004, 9:06:44 UTC Hi All, As a person that works in the IT industry on all sorts of scales of projects, it is sometimes hard to fullly spec the requirements as a system that grows, ie. how many servers do I split the various components of the system over, what are the server requirements and specifications. We also have to relise that Boinc is new as well and its design is still evolving and only by putting the the process of large scale projects liek Seti etc will bugs be killed. I looked at seti and they are spread over 5 servers and are processing: "State Approximate #results Ready to sed 1,262,222 In progress 3,966,013" This is a huge amount of data to handle. Reconfigurations of running live systems are very complicated and my hat is off to the technical support people at each of there projects. 73 de Peter VK3AVE ID: 3144 · Reply Quote

Heffed Send message Joined: 2 Sep 04 Posts: 71 Credit: 8,657 RAC: 0	Message 3145 - Posted: 3 Oct 2004, 5:04:31 UTC - in response to Message 3141. > Let's hope we hear nothing about a SNAP Appliance! :) Didn't I read that SNAP is donating the hardware for the actual accelerator? ;) Seriously though, I think a lot of the current load is these "chaotic" WUs they are currently testing. You can get a dozen of these bad boys and be done and asking for more in a minute. Multiply by 5000+ users and that can't help matters any... ID: 3145 · Reply Quote

STE\/E Send message Joined: 2 Sep 04 Posts: 352 Credit: 2,898,606 RAC: 105	Message 3150 - Posted: 3 Oct 2004, 11:00:27 UTC Last modified: 7 Oct 2004, 9:41:08 UTC haha ... Funny Picture Alex ... Your right also Heff, on my 3.4 only 2 out of the last 14 Tunescan WU's actually made a full run. The other 12 are only in the 1 min to 20 min range ... It doesn't take long to run through a mess of them at that rate ... ID: 3150 · Reply Quote

joe Send message Joined: 2 Sep 04 Posts: 24 Credit: 12,288 RAC: 0	Message 3152 - Posted: 3 Oct 2004, 12:21:40 UTC - in response to Message 3111. > ... But there was also bug in the main page which > unnecessarily made database queries each time it was loaded. We fixed that, > ... The main page still tries to access the database : Server Status Up, Warning: Too many connections in /shift/lxfsrk429/data01/boinc/projects/lhcathome/html/inc/db_ops.inc on line 11 Warning: MySQL Connection Failed: Too many connections in /shift/lxfsrk429/data01/boinc/projects/lhcathome/html/inc/db_ops.inc on line 11 Unable to connect to database - please try again laterToo many connections ID: 3152 · Reply Quote

Markku Degerholm Send message Joined: 3 Sep 04 Posts: 212 Credit: 4,545 RAC: 0	Message 3154 - Posted: 3 Oct 2004, 14:41:58 UTC - in response to Message 3152. > > The main page still tries to access the database : Yes, but not every time the page is loaded. The server status cell is generated by cron process every 10 minutes or so, and if the database is stuck then it will show the message there too. It's not pretty, though - maybe it should check for the error and tell that system is under heavy load in this case. Markku Degerholm LHC@home Admin ID: 3154 · Reply Quote

Bruno G. Olsen & ESEA @ greenh... Send message Joined: 17 Sep 04 Posts: 52 Credit: 247,983 RAC: 0	Message 3155 - Posted: 3 Oct 2004, 14:47:04 UTC Yup, there are server problems :D LHC@home - 2004-10-03 14:57:33 - Scheduler RPC to http://lhcathome-sched1.cern.ch/scheduler/cgi succeeded LHC@home - 2004-10-03 14:57:33 - SCHEDULER_REPLY::parse(): bad first tag Content-type: text/plain LHC@home - 2004-10-03 14:57:33 - Can't parse scheduler reply LHC@home - 2004-10-03 15:17:26 - Scheduler RPC to http://lhcathome-sched1.cern.ch/scheduler/cgi failed LHC@home - 2004-10-03 15:17:26 - No schedulers responded Those are not single instances ;) I'm guessing too many connections ;) Upload seems to work fine though. ID: 3155 · Reply Quote

Guido Alexander Waldenmeier Send message Joined: 2 Sep 04 Posts: 321 Credit: 10,607 RAC: 0	Message 3156 - Posted: 3 Oct 2004, 14:59:56 UTC @cern staff its hard to say but best is stop creating new accounts and fix the errors every day new users and you have more troble with the -micky mouse mySQL- its good for private homepages or small company but NOT for a great project with 1000 and more Users ;-( <a>[url=http://guidowaldenmeier.de]<a> ID: 3156 · Reply Quote

Markku Degerholm Send message Joined: 3 Sep 04 Posts: 212 Credit: 4,545 RAC: 0	Message 3158 - Posted: 3 Oct 2004, 15:34:36 UTC - in response to Message 3156. > @cern staff > its hard to say but best is stop creating new accounts and fix the errors > every day new users and you have more troble with the -micky mouse mySQL- > its good for private homepages or small company but NOT for a great project > with > 1000 and more Users ;-( > <a>[url=http://guidowaldenmeier.de]<a> Our user limit is at 5000 and stays there until we can server more. Currently we have about 4800 registered users with 6700 active hosts. Problems started at about 4000 users, I think. Having experience with many different databases, I don't know any outstanding performer when compared to MySQL. Most likely the problem is with our storage system because the CPU load stays below 50% all the times. We have a triple-controller SCSI system with RAID-0 striped disks which should be good enough... but apparently not. Or maybe there is something else we are missing. Markku Degerholm LHC@home Admin ID: 3158 · Reply Quote

Alex Send message Joined: 2 Sep 04 Posts: 378 Credit: 10,765 RAC: 0	Message 3168 - Posted: 4 Oct 2004, 3:27:19 UTC Last modified: 4 Oct 2004, 4:12:13 UTC The seti@home guys had similar problems when they left 'beta' and went 'live'. They had charts and graphs to show throughput on thier network. Seti's problems were hard to diagnose. There were a lot of people on message boards with their own amateur theories (viruses, networks, various hardware issues, more memory, etc) The Seti guys first diagnosed their network as the culprit.. then the problem came back.. they had a few resets of servers too. From the seti news boards: http://setiweb.ssl.berkeley.edu/old_news.php Maybe LHC is running into the same bugs that Seti ran into...(I'm thinking maybe their Aug 4'th 'too many files in upload folder') Whatever the problem is.. you guys will have a lot of things to check. Best of luck. ID: 3168 · Reply Quote

sysfried Send message Joined: 27 Sep 04 Posts: 282 Credit: 1,415,417 RAC: 0	Message 3172 - Posted: 4 Oct 2004, 8:52:29 UTC I think it's weird that LHC staff can't decide whether to send out large WU's (vtune iirc) or standard ones. My Dual cpu system has somewhat 40 wu's ready to run... and that's aprox. 2 days of work. Why not continue to send out large wu's? Hey, we could live with that... And it would reduce load on servers due to the number of less downloads, less wu's to handle.... just my 2 cents... ID: 3172 · Reply Quote

Markku Degerholm Send message Joined: 3 Sep 04 Posts: 212 Credit: 4,545 RAC: 0	Message 3179 - Posted: 4 Oct 2004, 9:57:13 UTC - in response to Message 3172. > I think it's weird that LHC staff can't decide whether to send out large WU's > (vtune iirc) or standard ones. > Point is that those short WU's are needed just as much as the long ones. But we try to generate a mix of short and long work units such that the average is good, and we get the short ones crunched as well. But I think it will take a few more days before our physicists are able to start submitting those longer jobs. Markku Degerholm LHC@home Admin ID: 3179 · Reply Quote

LHC@home