Re: SSD + RAID

Craig Ringer <craig@xxxxxxxxxxxxxxxxxxxxx> · Sun, 15 Nov 2009 18:17:24 +0800

On 15/11/2009 2:05 PM, Laszlo Nagy wrote:
> 
>> A change has been written to the WAL and fsync()'d, so Pg knows it's hit
>> disk. It can now safely apply the change to the tables themselves, and
>> does so, calling fsync() to tell the drive containing the tables to
>> commit those changes to disk.
>>
>> The drive lies, returning success for the fsync when it's just cached
>> the data in volatile memory. Pg carries on, shortly deleting the WAL
>> archive the changes were recorded in or recycling it and overwriting it
>> with new change data. The SSD is still merrily buffering data to write
>> cache, and hasn't got around to writing your particular change yet.
>>   
> All right. I believe you. In the current Pg implementation, I need to
> turn of disk cache.

That's certainly my understanding. I've been wrong many times before :S

> #1. user wants to change something, resulting in a write_to_disk(data) call
> #2. data is written into the WAL and fsync()-ed
> #3. at this point the write_to_disk(data) call CAN RETURN, the user can
> continue his work (the WAL is already written, changes cannot be lost)
> #4. Pg can continue writting data onto the disk, and fsync() it.
> #5. Then WAL archive data can be deleted.
> 
> Now maybe I'm wrong, but between #3 and #5, the data to be written is
> kept in memory. This is basically a write cache, implemented in OS
> memory. We could really handle it like a write cache. E.g. everything
> would remain the same, except that we add some latency. We can wait some
> time after the last modification of a given block, and then write it out.

I don't know enough about the whole affair to give you a good
explanation ( I tried, and it just showed me how much I didn't know )
but here are a few issues:

- Pg doesn't know the erase block sizes or positions. It can't group
writes up by erase block except by hoping that, within a given file,
writing in page order will get the blocks to the disk in roughly
erase-block order. So your write caching isn't going to do anywhere near
as good a job as the SSD's can.

- The only way to make this help the SSD out much would be to use a LOT
of RAM for write cache and maintain a LOT of WAL archives. That's RAM
not being used for caching read data. The large number of WAL archives
means incredibly long WAL replay times after a crash.

- You still need a reliable way to tell the SSD "really flush your cache
now" after you've flushed the changes from your huge chunks of WAL files
and are getting ready to recycle them.

I was thinking that write ordering would be an issue too, as some
changes in the WAL would hit main disk before others that were earlier
in the WAL. However, I don't think that matters if full_page_writes are
on. If you replay from the start, you'll reapply some changes with older
versions, but they'll be corrected again by a later WAL record. So
ordering during WAL replay shouldn't be a problem. On the other hand,
the INCREDIBLY long WAL replay times during recovery would be a nightmare.

> I don't think that any SSD drive has more than some
> megabytes of write cache.

The big, lots-of-$$ ones have HUGE battery backed caches for exactly
this reason.

> The same amount of write cache could easily be
> implemented in OS memory, and then Pg would always know what hit the disk.

Really? How does Pg know what order the SSD writes things out from its
cache?

--
Craig Ringer

-- 
Sent via pgsql-performance mailing list (pgsql-performance@xxxxxxxxxxxxxx)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-performance