Hi Chris, > I think it is important to figure out why this is happening. I would > not want to run any production databases on systems that were failing > like this. You and me both :) (in our application though, it is not a total disaster to lose the last 5 minutes of transactions, it is a disaster if the database is unusable when it comes up) > 1) Any other computers suffer random application crashes, power downs, > etc. in your building? No, but more importantly, we have seen this failure happen in different buildings (in different cities), on same spec but different hardware (at least three motherboards and power supplies, six disks). That's why it really feels like a bug or configuration error. > 2) I take it there are no Raid controllers involved? No. But we get this error with and without software RAID, FWIW. > 3) RAM is non-ECC? I'll have to double-check, but I think it is. > 4) Are the systems on UPS's? Yes. > If I could make a wild (and probably wrong) guess, I would wonder if > something external to the system (like electrical supply) was > introducing glitches into memory, causing bad data to be written. I am > only mentioning it because I have implicated electrical supply in other > cases where rare computer failurres weer affecting many systems... I would tend to agree, but this occurs on multiple systems in multiple locations (but, oddly enough, we are having trouble reproducing it in our lab). And we have run memtest. However, it is true that all the systems on which this has been seen have the same spec power supply/UPS. I would think though, that this could cause error at any time -- all of these failures occur after reboot (that is, no corruption, reboot, immediate corruption). I have stopped/started Postgres while the application is running, without corruption. (smells like a dirty buffer not being written to disk, which is why we focused on the filesystem). Here are some further details: - 865PE/G Neo2-P (MS-6728) ATX motherboard (and similar for IDE) - 2x 512MB/400MHz DIMM RAM - Intel Pentium 4/3.2GHz/1MB/800MHz CPU (hyperthreading enabled) - 2x WD 250GB/7200RPM/8MB/SATA-150 on ICH5 SATA ports (also tested similar IDE drives), writethrough - XFS and JFS (not seen on ext3, but not fully tested) - either software RAID 0 on both drives, or one drive alone without RAID - SuSE 9.1 - 2.6.6 kernel - Postgres 7.4.2 - 300 TPS against DB containing 5-50GB data, no more than a dozen concurrent connections. - fsync (or not) and fdatasync - Postgres may be taken down (via init script) with connections open to it (in fact the application may aggressively try to re-establish the connection as it goes down). - we have put syncs, sleeps and large dd to the disk in the shutdown scripts, none of which work. At this point, I'm really looking for fresh ideas. Thanks, --Ian