On 15/11/2009 11:57 AM, Laszlo Nagy wrote: > Ok, I'm getting confused here. There is the WAL, which is written > sequentially. If the WAL is not corrupted, then it can be replayed on > next database startup. Please somebody enlighten me! In my mind, fsync > is only needed for the WAL. If I could configure postgresql to put the > WAL on a real hard drive that has BBU and write cache, then I cannot > loose data. Meanwhile, product table data could be placed on the SSD > drive, and I sould be able to turn on write cache safely. Am I wrong? A change has been written to the WAL and fsync()'d, so Pg knows it's hit disk. It can now safely apply the change to the tables themselves, and does so, calling fsync() to tell the drive containing the tables to commit those changes to disk. The drive lies, returning success for the fsync when it's just cached the data in volatile memory. Pg carries on, shortly deleting the WAL archive the changes were recorded in or recycling it and overwriting it with new change data. The SSD is still merrily buffering data to write cache, and hasn't got around to writing your particular change yet. The machine loses power. Oops! A hole just appeared in history. A WAL replay won't re-apply the changes that the database guaranteed had hit disk, but the changes never made it onto the main database storage. Possible fixes for this are: - Don't let the drive lie about cache flush operations, ie disable write buffering. - Give Pg some way to find out, from the drive, when particular write operations have actually hit disk. AFAIK there's no such mechanism at present, and I don't think the drives are even capable of reporting this data. If they were, Pg would have to be capable of applying entries from the WAL "sparsely" to account for the way the drive's write cache commits changes out-of-order, and Pg would have to maintain a map of committed / uncommitted WAL records. Pg would need another map of tablespace blocks to WAL records to know, when a drive write cache commit notice came in, what record in what WAL archive was affected. It'd also require Pg to keep WAL archives for unbounded and possibly long periods of time, making disk space management for WAL much harder. So - "not easy" is a bit of an understatement here. You still need to turn off write caching. -- Craig Ringer -- Sent via pgsql-performance mailing list (pgsql-performance@xxxxxxxxxxxxxx) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-performance