Re: Why database is corrupted after re-booting

Scott Marlowe <smarlowe@xxxxxxxxxxxxxxxxx> · Wed, 26 Oct 2005 13:56:08 -0500

On Wed, 2005-10-26 at 13:38, Wes Williams wrote:
> Even with a primary UPS on the *entire PostgreSQL server* does one still
> need, or even still recommend, a battery-backed cache on the RAID controller
> card?  [ref SCSI 320, of course]
> 
> If so, I'd be interest in knowing briefly why.

I'll tell you a quick little story.

Got a new server, aged out the old one.  new server was a dual P-IV 2800
with 2 gigs ram and a pair of 36 gig U320 drives in a RAID-1 mirror
under a battery backed cache.  This machine also had four 120 gig IDE
drives for file storage.  But the database was on the dual SCSIs under
the RAID controller.

I tested it with the power off test, etc... And it passed with flying
colors.  Put it into production.  Many other servers, including our
Oracle servers, were not tested in this way.

This machine had dual redundant power supplies with separate power
cables running into two separate rails, each running off of a different
UPS.  The UPSes were fed by power conditioners, and there was a switch
on the other side of that to switch us over to diesel generators should
the power go out.  The UPSes were quite large, and even with a hundred
or so computers in the hosting center, there was about 3 hours of
battery time before the diesel generator HAD to be up or we'd lose
power.

Seems pretty solid, right?  We're talking a multi million dollar hosting
center, the kind with an ops center that looks like the deck of the
Enterprise.  Raised floors, everything.

Fast forward six months.  An electrician working on the wiring in the
ceiling above one of the power conditioners clips off a tiny piece of
wire.  Said tiny piece of wire drops into the power conditioner.  Said
power conditioner overloads, and trips the other two power conditioners
in the hosting center.  This also blew out the master controller on the
UPS setup, so it didn't come up.  The switch for the Diesel generator
would have switched over, but it was fried too.  The UPSes, luckily,
were the constant on variety, so they took the hit for the computers on
the other side of them, about half the UPSes were destroyed.

After about 3 hours, we had enough of the power jury rigged to bring the
systems back up.  In a company with dozens and dozens, ranging from
MySQL to Oracle to PostgreSQL to Ingres to MSSQL to interbase to foxpro,
exactly one of our database servers came up without any errors.  You
already know which one it was, or I wouldn't be writing this letter.

Power supplies fail, UPSes fail, hard drives fail, and raid controllers
and batter backed caches fail.  You can remove every possibility of
failure, but you can limit the number of things that can harm you should
they fail.

I do know that after that outage, I never once got shit for using
postgresql ever again from anybody.  The sad thing is, if any of those
other machines had had battery backed raid controllers with local
storage (many were running on NFS or SMB mounts) they would have been
fine too.  But many of the DBAs for those other databases had the same
"who needs to worry about sudden power off when we have UPSes and power
conditioners."  You can guess what optional feature suddenly seemed like
a good idea for every new database server after that.

---------------------------(end of broadcast)---------------------------
TIP 1: if posting/reading through Usenet, please send an appropriate
       subscribe-nomail command to majordomo@xxxxxxxxxxxxxx so that your
       message can get through to the mailing list cleanly