Re: BBU Cache vs. spindles

Bruce Momjian <bruce@xxxxxxxxxx> · Tue, 30 Nov 2010 22:07:18 -0500 (EST)

Greg Smith wrote:
> Kevin Grittner wrote:
> > I assume that we send a full
> > 8K to the OS cache, and the file system writes disk sectors
> > according to its own algorithm.  With either platters or BBU cache,
> > the data is persisted on fsync; why do you see a risk with one but
> > not the other
> 
> I'd like a 10 minute argument please.  I started to write something to 
> refute this, only to clarify in my head the sequence of events that 
> leads to the most questionable result, where I feel a bit less certain 
> than I did before of the safety here.  Here is the worst case I believe 
> you're describing:
> 
> 1) Transaction is written to the WAL and sync'd; client receives 
> COMMIT.  Since full_page_writes is off, the data in the WAL consists 
> only of the delta of what changed on the page.
> 2) 8K database page is written to OS cache
> 3) PG calls fsync to force the database block out
> 4) OS writes first 4K block of the change to the BBU write cache.  Worst 
> case, this fills the cache, and it takes a moment for some random writes 
> to process before it has space to buffer again (makes this more likely 
> to happen, but it's not required to see the failure case here)
> 5) Sudden power interruption, second half of the page write is lost
> 6) Server restarts
> 7) That 4K write is now replayed from the battery's cache
> 
> At this point, you now have a torn 8K page, with 1/2 old and 1/2 new 

Based on this report, I think we need to update our documentation and
backpatch removal of text that says that BBU users can safely turn off
full-page writes.  Patch attached.

I think we have fallen into a trap I remember from the late 1990's where
I was assuming that an 8k-block based file system would write to the
disk atomically in 8k segments, which of course it cannot.  My bet is
that even if you write to the kernel in 8k pages, and have an 8k file
system, the disk is still accessed via 512-byte blocks, even with a BBU.

-- 
  Bruce Momjian  <bruce@xxxxxxxxxx>        http://momjian.us
  EnterpriseDB                             http://enterprisedb.com

  + It's impossible for everything to be true. +
diff --git a/doc/src/sgml/wal.sgml b/doc/src/sgml/wal.sgml
index a2724fa..1e67bbd 100644
*** /tmp/pgrevert.14281/7sLqTb_wal.sgml	Tue Nov 30 21:57:17 2010
--- doc/src/sgml/wal.sgml	Tue Nov 30 21:56:49 2010
***************
*** 164,173 ****
     <productname>PostgreSQL</> periodically writes full page images to
     permanent WAL storage <emphasis>before</> modifying the actual page on
     disk. By doing this, during crash recovery <productname>PostgreSQL</> can
!    restore partially-written pages.  If you have a battery-backed disk
!    controller or file-system software that prevents partial page writes
!    (e.g., ZFS),  you can turn off this page imaging by turning off the
!    <xref linkend="guc-full-page-writes"> parameter.
    </para>
   </sect1>

--- 164,175 ----
     <productname>PostgreSQL</> periodically writes full page images to
     permanent WAL storage <emphasis>before</> modifying the actual page on
     disk. By doing this, during crash recovery <productname>PostgreSQL</> can
!    restore partially-written pages.  If you have file-system software
!    that prevents partial page writes (e.g., ZFS),  you can turn off
!    this page imaging by turning off the <xref
!    linkend="guc-full-page-writes"> parameter. Battery-Backed unit
!    (BBU) disk controllers do not prevent partial page writes unless
!    they guarantee that data is written to the BBU as full (8kB) pages.
    </para>
   </sect1>

-- 
Sent via pgsql-performance mailing list (pgsql-performance@xxxxxxxxxxxxxx)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-performance