Re: BBU Cache vs. spindles

Greg Smith <greg@xxxxxxxxxxxxxxx> · Sun, 24 Oct 2010 12:53:13 -0400

James Mansion wrote:
When I looked at the internals of TokyoCabinet for example, the design 
was flawed but
would be 'fairly robust' so long as mmap'd pages that were dirtied did 
not get persisted
until msync, and were then persisted atomically.

If TokyoCabinet presumes that's true and overwrites existing blocks with 
that assumption, it would land onto my list of databases I wouldn't 
trust to hold my TODO list.  Flip off power to a server, and you have no 
idea what portion of the blocks sitting in the drive's cache actually 
made it to disk; that's not even guaranteed atomic to the byte level.  
Torn pages happen all the time unless you either a) put the entire write 
into a non-volatile cache before writing any of it, b) write and sync 
somewhere else first and then do a journaled filesystem pointer swap 
from the old page to the new one, or c) journal the whole write the way 
PostgreSQL does with full_page_writes and the WAL.  The discussion here 
veered off over whether (a) was sufficiently satisfied just by having a 
RAID controller with battery backup, and what I concluded from the dive 
into the details is that it's definitely not true unless the filesystem 
block size exactly matches the database one.  And even then, make sure 
you test heavily.

--
Greg Smith   2ndQuadrant US    greg@xxxxxxxxxxxxxxx   Baltimore, MD
PostgreSQL Training, Services and Support        www.2ndQuadrant.us
"PostgreSQL 9.0 High Performance": http://www.2ndQuadrant.com/books

--
Sent via pgsql-performance mailing list (pgsql-performance@xxxxxxxxxxxxxx)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-performance