Re: SSD + RAID

Craig Ringer <craig@xxxxxxxxxxxxxxxxxxxxx> · Sun, 15 Nov 2009 16:46:56 +0800

On 15/11/2009 11:57 AM, Laszlo Nagy wrote:

> Ok, I'm getting confused here. There is the WAL, which is written
> sequentially. If the WAL is not corrupted, then it can be replayed on
> next database startup. Please somebody enlighten me! In my mind, fsync
> is only needed for the WAL. If I could configure postgresql to put the
> WAL on a real hard drive that has BBU and write cache, then I cannot
> loose data. Meanwhile, product table data could be placed on the SSD
> drive, and I sould be able to turn on write cache safely. Am I wrong?

A change has been written to the WAL and fsync()'d, so Pg knows it's hit
disk. It can now safely apply the change to the tables themselves, and
does so, calling fsync() to tell the drive containing the tables to
commit those changes to disk.

The drive lies, returning success for the fsync when it's just cached
the data in volatile memory. Pg carries on, shortly deleting the WAL
archive the changes were recorded in or recycling it and overwriting it
with new change data. The SSD is still merrily buffering data to write
cache, and hasn't got around to writing your particular change yet.

The machine loses power.

Oops! A hole just appeared in history. A WAL replay won't re-apply the
changes that the database guaranteed had hit disk, but the changes never
made it onto the main database storage.

Possible fixes for this are:

- Don't let the drive lie about cache flush operations, ie disable write
buffering.

- Give Pg some way to find out, from the drive, when particular write
operations have actually hit disk. AFAIK there's no such mechanism at
present, and I don't think the drives are even capable of reporting this
data. If they were, Pg would have to be capable of applying entries from
the WAL "sparsely" to account for the way the drive's write cache
commits changes out-of-order, and Pg would have to maintain a map of
committed / uncommitted WAL records. Pg would need another map of
tablespace blocks to WAL records to know, when a drive write cache
commit notice came in, what record in what WAL archive was affected.
It'd also require Pg to keep WAL archives for unbounded and possibly
long periods of time, making disk space management for WAL much harder.
So - "not easy" is a bit of an understatement here.

You still need to turn off write caching.

--
Craig Ringer

-- 
Sent via pgsql-performance mailing list (pgsql-performance@xxxxxxxxxxxxxx)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-performance