Message boards : Cafe LHC : SETI technical news
Message board moderation

To post messages, you must log in.

AuthorMessage
Profile Ed and Harriet Griffith
Avatar

Send message
Joined: 18 Sep 04
Posts: 37
Credit: 4,051
RAC: 0
Message 8222 - Posted: 30 Jun 2005, 4:13:35 UTC

June 29, 2005 - 23:00 UTC
Addendum from previous post:

The outage took a bit longer than expected - the database dump had to be restarted twice (we reorganized our backup method a little bit, which required some "debugging"). We did everything we set out to do except the UPS testing, so that will be postponed.

The machine "gates" wasn't working out as a splitter, so we went with "sagan" instead (even though it is still the classic SETI@home data server and therefore quite busy). Every little bit helps. Eventually we added "kosh" as well, as it wasn't doing much at the time.

June 29, 2005 - 19:00 UTC
Since we're in the middle of an outage, why not write up another general update?

The validators are still disabled. The only public effect is a delay in crediting results. No credit should be lost, as it is always granted to results that still exist in the database, and they aren't deleted until they are validated and assimilated. So various queues are building up, but that's about it.

While this is an inconvenience for our users, repairing this program has taken a back seat to higher priority items (that we expected or appeared out of nowhere).

First and foremost, galileo crashed last night. We haven't yet fully diagnosed the cause (as we've been busy keeping to the scheduled outage for mundane but necessary items like database backups, rebooting servers to pick up new automounts, and UPS testing). At this point we think it is a CPU board failure, but the server is back up (and working as a scheduling server, but not much else). That's the bad news.

The good news is that arriving today (just in the nick of time) is a new/used E3500 identical to galileo (graciously donated by Patrick Jeski - thanks Patrick!). It should be arriving at the loading dock as I type this message. So at least we already have replacement parts on site. Whether or not we need these parts remains to be seen, but the extra server definitely creates a warm, fuzzy feeling.

With galileo failing, and other splitter machines buckling under the load of increased demand, we are slowly running out of work to send out. We tried to add the machine "gates," but due its low RAM (and the fact it is still serving a bunch of SETI classic cgi requests) it didn't work very well. We'll try to add more splitter power today after the outage.

One of our main priorities right now is ramping down all the remaining pieces of SETI classic and preparing for the final shutdown. This includes sending out a mass e-mail, converting all the cgi programs to prevent future editing (account updates, team creation, joining, etc.), and buffing up the BOINC servers as best we can before the dam breaks.

As well, the air conditioning in our closet began failing again over the past week. While this time machines didn't get as hot as before, facilities took a long look at the system and determined that there is indeed a gas leak (freon or whatever they use besides freon these days). More gas was added which will last a few weeks until the problem is fixed.

ID: 8222 · Report as offensive     Reply Quote
Profile Ed and Harriet Griffith
Avatar

Send message
Joined: 18 Sep 04
Posts: 37
Credit: 4,051
RAC: 0
Message 8242 - Posted: 1 Jul 2005, 0:16:06 UTC

30.6.2005 10:29 UTC
We are happy to announce the start of a large production run, with the first set of 27,000 million-turn jobs just submitted. We will soon decide whether to open up to more users. Many thanks for your patience waiting for this good news!

ID: 8242 · Report as offensive     Reply Quote
Profile Ed and Harriet Griffith
Avatar

Send message
Joined: 18 Sep 04
Posts: 37
Credit: 4,051
RAC: 0
Message 8243 - Posted: 1 Jul 2005, 0:20:00 UTC

June 30, 2005 - 17:00 UTC
Last night the upload/download server ran out of processes. This happened because the load was very heavy, which causes adverse effects in apache. When hourly apache restarts were issued (for log rotation), old processes wouldn't die and new ones would fill the process queue. By this morning we had over 7000 httpd processes on the machine! Apparently some apache tuning is in order.

This went unnoticed, though the lack of server status page updates did get noticed. The page gets updated every 10 minutes (along with all kinds of internal-use BOINC status files). Once every few hours the whole system "skips a turn" due to some funny interaction with cron. But occasionally the whole system stops altogether until somebody comes along and "kicks it" (i.e. removes some stale lock files).

So we noticed the status page was stale, "kicked" the whole system and it started up again (temporarily). Everything looked okay, so we went to bed, only to realize the gravity of the problem in the morning (the system was hanging because it would get stuck trying to talk to hosed server).

There was also a 2-hour lab-wide network outage during all this. Not sure what happened there, but that's out of our hands.


ID: 8243 · Report as offensive     Reply Quote

Message boards : Cafe LHC : SETI technical news


©2024 CERN