Re: BBU Cache vs. spindles

Bruce Momjian <bruce@xxxxxxxxxx> · Wed, 22 Dec 2010 21:12:23 -0500 (EST)

Bruce Momjian wrote:
> Greg Smith wrote:
> > Kevin Grittner wrote:
> > > I assume that we send a full
> > > 8K to the OS cache, and the file system writes disk sectors
> > > according to its own algorithm.  With either platters or BBU cache,
> > > the data is persisted on fsync; why do you see a risk with one but
> > > not the other
> > 
> > I'd like a 10 minute argument please.  I started to write something to 
> > refute this, only to clarify in my head the sequence of events that 
> > leads to the most questionable result, where I feel a bit less certain 
> > than I did before of the safety here.  Here is the worst case I believe 
> > you're describing:
> > 
> > 1) Transaction is written to the WAL and sync'd; client receives 
> > COMMIT.  Since full_page_writes is off, the data in the WAL consists 
> > only of the delta of what changed on the page.
> > 2) 8K database page is written to OS cache
> > 3) PG calls fsync to force the database block out
> > 4) OS writes first 4K block of the change to the BBU write cache.  Worst 
> > case, this fills the cache, and it takes a moment for some random writes 
> > to process before it has space to buffer again (makes this more likely 
> > to happen, but it's not required to see the failure case here)
> > 5) Sudden power interruption, second half of the page write is lost
> > 6) Server restarts
> > 7) That 4K write is now replayed from the battery's cache
> > 
> > At this point, you now have a torn 8K page, with 1/2 old and 1/2 new 
> 
> Based on this report, I think we need to update our documentation and
> backpatch removal of text that says that BBU users can safely turn off
> full-page writes.  Patch attached.
> 
> I think we have fallen into a trap I remember from the late 1990's where
> I was assuming that an 8k-block based file system would write to the
> disk atomically in 8k segments, which of course it cannot.  My bet is
> that even if you write to the kernel in 8k pages, and have an 8k file
> system, the disk is still accessed via 512-byte blocks, even with a BBU.

Doc patch applied.

-- 
  Bruce Momjian  <bruce@xxxxxxxxxx>        http://momjian.us
  EnterpriseDB                             http://enterprisedb.com

  + It's impossible for everything to be true. +
diff --git a/doc/src/sgml/wal.sgml b/doc/src/sgml/wal.sgml
index a2724fa..1e67bbd 100644
*** /tmp/pgrevert.14281/7sLqTb_wal.sgml	Tue Nov 30 21:57:17 2010
--- doc/src/sgml/wal.sgml	Tue Nov 30 21:56:49 2010
***************
*** 164,173 ****
     <productname>PostgreSQL</> periodically writes full page images to
     permanent WAL storage <emphasis>before</> modifying the actual page on
     disk. By doing this, during crash recovery <productname>PostgreSQL</> can
!    restore partially-written pages.  If you have a battery-backed disk
!    controller or file-system software that prevents partial page writes
!    (e.g., ZFS),  you can turn off this page imaging by turning off the
!    <xref linkend="guc-full-page-writes"> parameter.
    </para>
   </sect1>

--- 164,175 ----
     <productname>PostgreSQL</> periodically writes full page images to
     permanent WAL storage <emphasis>before</> modifying the actual page on
     disk. By doing this, during crash recovery <productname>PostgreSQL</> can
!    restore partially-written pages.  If you have file-system software
!    that prevents partial page writes (e.g., ZFS),  you can turn off
!    this page imaging by turning off the <xref
!    linkend="guc-full-page-writes"> parameter. Battery-Backed unit
!    (BBU) disk controllers do not prevent partial page writes unless
!    they guarantee that data is written to the BBU as full (8kB) pages.
    </para>
   </sect1>

-- 
Sent via pgsql-performance mailing list (pgsql-performance@xxxxxxxxxxxxxx)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-performance