Re: Weird XFS WAL problem

Bruce Momjian <bruce@xxxxxxxxxx> · Wed, 7 Jul 2010 10:42:55 -0400 (EDT)

Greg Smith wrote:
> Kevin Grittner wrote:
> > I don't know at the protocol level; I just know that write barriers
> > do *something* which causes our controllers to wait for actual disk
> > platter persistence, while fsync does not
> 
> It's in the docs now:  
> http://www.postgresql.org/docs/9.0/static/wal-reliability.html
> 
> FLUSH CACHE EXT is the ATAPI-6 call that filesystems use to enforce 
> barriers on that type of drive.  Here's what the relevant portion of the 
> ATAPI spec says:
> 
> "This command is used by the host to request the device to flush the 
> write cache. If there is data in the write
> cache, that data shall be written to the media.The BSY bit shall remain 
> set to one until all data has been
> successfully written or an error occurs."
> 
> SAS systems have a similar call named SYNCHRONIZE CACHE.
> 
> The improvement I actually expect to arrive here first is a reliable 
> implementation of O_SYNC/O_DSYNC writes.  Both SAS and SATA drives that 
> capable of doing Native Command Queueing support a write type called 
> "Force Unit Access", which is essentially just like a direct write that 
> cannot be cached.  When we get more kernels with reliable sync writing 
> that maps under the hood to FUA, and can change wal_sync_method to use 
> them, the need to constantly call fsync for every write to the WAL will 
> go away.  Then the "blow out the RAID cache when barriers are on" 
> behavior will only show up during checkpoint fsyncs, which will make 
> things a lot better (albeit still not ideal).

Great information!  I have added the attached documentation patch to
explain the write-barrier/BBU interaction.  This will appear in the 9.0
documentation.

-- 
  Bruce Momjian  <bruce@xxxxxxxxxx>        http://momjian.us
  EnterpriseDB                             http://enterprisedb.com

  + None of us is going to be here forever. +
Index: doc/src/sgml/wal.sgml
===================================================================
RCS file: /cvsroot/pgsql/doc/src/sgml/wal.sgml,v
retrieving revision 1.66
diff -c -c -r1.66 wal.sgml
*** doc/src/sgml/wal.sgml	13 Apr 2010 14:15:25 -0000	1.66
--- doc/src/sgml/wal.sgml	7 Jul 2010 13:55:58 -0000
***************
*** 48,68 ****
     some later time. Such caches can be a reliability hazard because the
     memory in the disk controller cache is volatile, and will lose its
     contents in a power failure.  Better controller cards have
!    <firstterm>battery-backed</> caches, meaning the card has a battery that
     maintains power to the cache in case of system power loss.  After power
     is restored the data will be written to the disk drives.
    </para>

    <para>
     And finally, most disk drives have caches. Some are write-through
!    while some are write-back, and the
!    same concerns about data loss exist for write-back drive caches as
!    exist for disk controller caches.  Consumer-grade IDE and SATA drives are
!    particularly likely to have write-back caches that will not survive a
!    power failure, though <acronym>ATAPI-6</> introduced a drive cache
!    flush command (FLUSH CACHE EXT) that some file systems use, e.g. <acronym>ZFS</>.
!    Many solid-state drives (SSD) also have volatile write-back
!    caches, and many do not honor cache flush commands by default.
     To check write caching on <productname>Linux</> use
     <command>hdparm -I</>;  it is enabled if there is a <literal>*</> next
     to <literal>Write cache</>; <command>hdparm -W</> to turn off
--- 48,74 ----
     some later time. Such caches can be a reliability hazard because the
     memory in the disk controller cache is volatile, and will lose its
     contents in a power failure.  Better controller cards have
!    <firstterm>battery-backed unit</> (<acronym>BBU</>) caches, meaning
!    the card has a battery that
     maintains power to the cache in case of system power loss.  After power
     is restored the data will be written to the disk drives.
    </para>

    <para>
     And finally, most disk drives have caches. Some are write-through
!    while some are write-back, and the same concerns about data loss
!    exist for write-back drive caches as exist for disk controller
!    caches.  Consumer-grade IDE and SATA drives are particularly likely
!    to have write-back caches that will not survive a power failure,
!    though <acronym>ATAPI-6</> introduced a drive cache flush command
!    (<command>FLUSH CACHE EXT</>) that some file systems use, e.g.
!    <acronym>ZFS</>, <acronym>ext4</>.  (The SCSI command
!    <command>SYNCHRONIZE CACHE</> has long been available.) Many
!    solid-state drives (SSD) also have volatile write-back caches, and
!    many do not honor cache flush commands by default.
!   </para>
! 
!   <para>
     To check write caching on <productname>Linux</> use
     <command>hdparm -I</>;  it is enabled if there is a <literal>*</> next
     to <literal>Write cache</>; <command>hdparm -W</> to turn off
***************
*** 83,88 ****
--- 89,113 ----
    </para>

    <para>
+    Many file systems that use write barriers (e.g.  <acronym>ZFS</>,
+    <acronym>ext4</>) internally use <command>FLUSH CACHE EXT</> or
+    <command>SYNCHRONIZE CACHE</> commands to flush data to the platers on
+    write-back-enabled drives.  Unfortunately, such write barrier file
+    systems behave suboptimally when combined with battery-backed unit
+    (<acronym>BBU</>) disk controllers.  In such setups, the synchronize
+    command forces all data from the BBU to the disks, eliminating much
+    of the benefit of the BBU.  You can run the utility
+    <filename>src/tools/fsync</> in the PostgreSQL source tree to see
+    if you are effected.  If you are effected, the performance benefits
+    of the BBU cache can be regained by turning off write barriers in
+    the file system or reconfiguring the disk controller, if that is
+    an option.  If write barriers are turned off, make sure the battery
+    remains active; a faulty battery can potentially lead to data loss.
+    Hopefully file system and disk controller designers will eventually
+    address this suboptimal behavior.
+   </para>
+ 
+   <para>
     When the operating system sends a write request to the storage hardware,
     there is little it can do to make sure the data has arrived at a truly
     non-volatile storage area. Rather, it is the
-- 
Sent via pgsql-performance mailing list (pgsql-performance@xxxxxxxxxxxxxx)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-performance