Re: SSD + RAID

Scott Carey <scott@xxxxxxxxxxxxxxxxx> · Thu, 3 Dec 2009 16:04:32 -0800

On 11/19/09 1:04 PM, "Greg Smith" <greg@xxxxxxxxxxxxxxx> wrote:

> That won't help.  Once the checkpoint is done, the problem isn't just
> that the WAL segments are recycled.  The server isn't going to use them
> even if they were there.  The reason why you can erase/recycle them is
> that you're doing so *after* writing out a checkpoint record that says
> you don't have to ever look at them again.  What you'd actually have to
> do is hack the server code to insert that delay after every fsync--there
> are none that you can cheat on and not introduce a corruption
> possibility.  The whole WAL/recovery mechanism in PostgreSQL doesn't
> make a lot of assumptions about what the underlying disk has to actually
> do beyond the fsync requirement; the flip side to that robustness is
> that it's the one you can't ever violate safely.

Yeah, I guess its not so easy.  Having the system "hold" one extra
checkpoint worth of segments and then during recovery, always replay that
previoius one plus the current might work, but I don't know if that could
cause corruption.  I assume replaying a log twice won't, so replaying N-1
checkpoint, then the current one, might work.  If so that would be a cool
feature -- so long as the N-2 checkpoint is no longer in the OS or I/O
hardware caches when checkpoint N completes, you're safe!  Its probably more
complicated though, especially with respect to things like MVCC on DDL
changes.

> Right.  It's not used like the write-cache on a regular hard drive,
> where they're buffering 8MB-32MB worth of writes just to keep seek
> overhead down.  It's there primarily to allow combining writes into
> large chunks, to better match the block size of the underlying SSD flash
> cells (128K).  Having enough space for two full cells allows spooling
> out the flash write to a whole block while continuing to buffer the next
> one.
> 
> This is why turning the cache off can tank performance so badly--you're
> going to be writing a whole 128K block no matter what if it's force to
> disk without caching, even if it's just to write a 8K page to it.

As others mentioned, flash must erase a whole block at once, but it can
write sequentially to a block in much smaller chunks.   I believe that MLC
and SLC differ a bit here, SLC can write smaller subsections of the erase
block.

A little old but still very useful:
http://research.microsoft.com/apps/pubs/?id=63596

> That's only going to reach 1/16 of the usual write speed on single page
> writes.  And that's why you should also be concerned at whether
> disabling the write cache impacts the drive longevity, lots of small
> writes going out in small chunks is going to wear flash out much faster
> than if the drive is allowed to wait until it's got a full sized block
> to write every time.

This is still a concern, since even if the SLC cells are technically capable
of writing sequentially in smaller chunks, with the write cache off they may
not do so.  

> 
> The fact that the cache is so small is also why it's harder to catch the
> drive doing the wrong thing here.  The plug test is pretty sensitive to
> a problem when you've got megabytes worth of cached writes that are
> spooling to disk at spinning hard drive speeds.  The window for loss on
> a SSD with no seek overhead and only a moderate number of KB worth of
> cached data is much, much smaller.  Doesn't mean it's gone though.  It's
> a shame that the design wasn't improved just a little bit; a cheap
> capacitor and blocking new writes once the incoming power dropped is all
> it would take to make these much more reliable for database use.  But
> that would raise the price, and not really help anybody but the small
> subset of the market that cares about durable writes.

Yup.  There are manufacturers who claim no data loss on power failure,
hopefully these become more common.
http://www.wdc.com/en/products/ssd/technology.asp?id=1

I still contend its a lot more safe than a hard drive.  I have not seen one
fail yet (out of about 150 heavy use drive-years on X25-Ms).  Any system
that does not have a battery backed write cache will be faster and safer if
an SSD, with write cache on, than hard drives with write cache on.

BBU caching is not fail-safe either, batteries wear out, cards die or
malfunction.
If you need the maximum data integrity, you will probably go with a
battery-backed cache raid setup with or without SSDs.  If you don't go that
route SSD's seem like the best option.  The 'middle ground' of software raid
with hard drives with their write caches off doesn't seem useful to me at
all.  I can't think of one use case that isn't better served by a slightly
cheaper array of disks with a hardware bbu card (if the data is important or
data size is large) OR a set of SSD's (if performance is more important than
data safety). 

>> 4: Yet another solution:  The drives DO adhere to write barriers properly.
>> A filesystem that used these in the process of fsync() would be fine too.
>> So XFS without LVM or MD (or the newer versions of those that don't ignore
>> barriers) would work too.
>>  
> If I really trusted anything beyond the very basics of the filesystem to
> really work well on Linux, this whole issue would be moot for most of
> the production deployments I do.  Ideally, fsync would just push out the
> minimum of what's needed, it would call the appropriate write cache
> flush mechanism the way the barrier implementation does when that all
> works, life would be good.  Alternately, you might even switch to using
> O_SYNC writes instead, which on a good filesystem implementation are
> both accelerated and safe compared to write/fsync (I've seen that work
> as expected on Vertias VxFS for example).
> 

We could all move to OpenSolaris where that stuff does work right...  ;)
I think a lot of the things that make ZFS slower for some tasks is that it
correctly implements and uses write barriers...

> Meanwhile, in the actual world we live, patches that make writes more
> durable by default are dropped by the Linux community because they tank
> performance for too many types of loads, I'm frightened to turn on
> O_SYNC at all on ext3 because of reports of corruption on the lists
> here, fsync does way more work than it needs to, and the way the
> filesystem and block drivers have been separated makes it difficult to
> do any sort of device write cache control from userland.  This is why I
> try to use the simplest, best tested approach out there whenever possible.
> 

Oh I hear you :)  At least ext4 looks like an improvement for the
RHEL6/CentOS6 timeframe.  Checksums are handy.

Many of my systems though don't need the highest data reliability.  And a
raid 0 of X-25 M's will be much, much more safe than the same thing of
regular hard drives, and faster.  Putting in a few of those on one system
soon (yes M, won't put WAL on it).  2 such drives kick the crap out of
anything else for the price when performance is most important and the data
is just a copy of something stored in a much safer place than any single
server.  Previously on such systems, a caching raid card would be needed for
performance, but without a bbu data loss risk is very high (much higher than
a ssd with caching on -- 256K versus 512M cache!).  And a SSD costs less
than the raid card.  So long as the total data size isn't too big they work
well.  And even then, some tablespaces can be put on a large HD leaving the
more critical ones on the SSD.
I estimate the likelihood of complete data loss from a 2 SSD raid-0 as the
same as a 4-disk RAID 5 of hard drives.  There is a big difference between a
couple corrupted files and a lost drive...  I have recovered postgres
systems with corruption by reindexing and restoring single tables from
backups.  When one drive in a stripe is lost or a pair in a raid 10 go down,
all is lost.

I wonder -- has anyone seen an  Intel SSD randomly die like a hard drive?
I'm still trying to get a "M" to wear out by writing about 120GB a day to it
for a year.  But rough calculations show that I'm likely years from
trouble...  By then I'll have upgraded to the gen 3 or 4 drives.

> --
> Greg Smith    2ndQuadrant   Baltimore, MD
> PostgreSQL Training, Services and Support
> greg@xxxxxxxxxxxxxxx  www.2ndQuadrant.com
> 
> 

-- 
Sent via pgsql-performance mailing list (pgsql-performance@xxxxxxxxxxxxxx)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-performance