Re: database corruption

"Ian Westmacott" <ianw@xxxxxxxxxxxxxx> · Fri, 15 Apr 2005 23:39:26 -0400

Hi Chris,

> I think it is important to figure out why this is happening.  I would 
> not want to run any production databases on systems that were failing 
> like this.

You and me both :)  (in our application though, it is not a total
disaster to lose the last 5 minutes of transactions, it is a disaster
if the database is unusable when it comes up)

> 1)  Any other computers suffer random application crashes, power downs, 
> etc. in your building?

No, but more importantly, we have seen this failure happen in different
buildings (in different cities), on same spec but different hardware
(at least three motherboards and power supplies, six disks).  That's why
it really feels like a bug or configuration error.

> 2)  I take it there are no Raid controllers involved?

No.  But we get this error with and without software RAID, FWIW.

> 3)  RAM is non-ECC?

I'll have to double-check, but I think it is.

> 4)  Are the systems on UPS's?

Yes.

> If I could make a wild (and probably wrong) guess, I would wonder if 
> something external to the system (like electrical supply) was 
> introducing glitches into memory, causing bad data to be written.  I am 
> only mentioning it because I have implicated electrical supply in other 
> cases where rare computer failurres weer affecting many systems...

I would tend to agree, but this occurs on multiple systems in
multiple locations (but, oddly enough, we are having trouble reproducing
it in our lab).  And we have run memtest.

However, it is true that all the systems on which 
this has been seen have the same spec power supply/UPS.  I would think
though, that this could cause error at any time -- all
of these failures occur after reboot (that is, no corruption, reboot,
immediate corruption).  I have stopped/started Postgres
while the application is running, without corruption.  (smells like
a dirty buffer not being written to disk, which is why we focused on
the filesystem).

Here are some further details:

- 865PE/G Neo2-P (MS-6728) ATX motherboard (and similar for IDE)
- 2x 512MB/400MHz DIMM RAM
- Intel Pentium 4/3.2GHz/1MB/800MHz CPU (hyperthreading enabled)
- 2x WD 250GB/7200RPM/8MB/SATA-150 on ICH5 SATA ports (also tested
  similar IDE drives), writethrough
- XFS and JFS (not seen on ext3, but not fully tested)
- either software RAID 0 on both drives, or one drive alone without RAID
- SuSE 9.1
- 2.6.6 kernel
- Postgres 7.4.2
- 300 TPS against DB containing 5-50GB data, no more than a dozen
  concurrent connections.
- fsync (or not) and fdatasync
- Postgres may be taken down (via init script) with connections open to
  it (in fact the application may aggressively try to re-establish the
  connection as it goes down).
- we have put syncs, sleeps and large dd to the disk in the shutdown
  scripts, none of which work.

At this point, I'm really looking for fresh ideas.  Thanks,

	--Ian