Greg Smith wrote: > Kevin Grittner wrote: > > I don't know at the protocol level; I just know that write barriers > > do *something* which causes our controllers to wait for actual disk > > platter persistence, while fsync does not > > It's in the docs now: > http://www.postgresql.org/docs/9.0/static/wal-reliability.html > > FLUSH CACHE EXT is the ATAPI-6 call that filesystems use to enforce > barriers on that type of drive. Here's what the relevant portion of the > ATAPI spec says: > > "This command is used by the host to request the device to flush the > write cache. If there is data in the write > cache, that data shall be written to the media.The BSY bit shall remain > set to one until all data has been > successfully written or an error occurs." > > SAS systems have a similar call named SYNCHRONIZE CACHE. > > The improvement I actually expect to arrive here first is a reliable > implementation of O_SYNC/O_DSYNC writes. Both SAS and SATA drives that > capable of doing Native Command Queueing support a write type called > "Force Unit Access", which is essentially just like a direct write that > cannot be cached. When we get more kernels with reliable sync writing > that maps under the hood to FUA, and can change wal_sync_method to use > them, the need to constantly call fsync for every write to the WAL will > go away. Then the "blow out the RAID cache when barriers are on" > behavior will only show up during checkpoint fsyncs, which will make > things a lot better (albeit still not ideal). Great information! I have added the attached documentation patch to explain the write-barrier/BBU interaction. This will appear in the 9.0 documentation. -- Bruce Momjian <bruce@xxxxxxxxxx> http://momjian.us EnterpriseDB http://enterprisedb.com + None of us is going to be here forever. +
Index: doc/src/sgml/wal.sgml =================================================================== RCS file: /cvsroot/pgsql/doc/src/sgml/wal.sgml,v retrieving revision 1.66 diff -c -c -r1.66 wal.sgml *** doc/src/sgml/wal.sgml 13 Apr 2010 14:15:25 -0000 1.66 --- doc/src/sgml/wal.sgml 7 Jul 2010 13:55:58 -0000 *************** *** 48,68 **** some later time. Such caches can be a reliability hazard because the memory in the disk controller cache is volatile, and will lose its contents in a power failure. Better controller cards have ! <firstterm>battery-backed</> caches, meaning the card has a battery that maintains power to the cache in case of system power loss. After power is restored the data will be written to the disk drives. </para> <para> And finally, most disk drives have caches. Some are write-through ! while some are write-back, and the ! same concerns about data loss exist for write-back drive caches as ! exist for disk controller caches. Consumer-grade IDE and SATA drives are ! particularly likely to have write-back caches that will not survive a ! power failure, though <acronym>ATAPI-6</> introduced a drive cache ! flush command (FLUSH CACHE EXT) that some file systems use, e.g. <acronym>ZFS</>. ! Many solid-state drives (SSD) also have volatile write-back ! caches, and many do not honor cache flush commands by default. To check write caching on <productname>Linux</> use <command>hdparm -I</>; it is enabled if there is a <literal>*</> next to <literal>Write cache</>; <command>hdparm -W</> to turn off --- 48,74 ---- some later time. Such caches can be a reliability hazard because the memory in the disk controller cache is volatile, and will lose its contents in a power failure. Better controller cards have ! <firstterm>battery-backed unit</> (<acronym>BBU</>) caches, meaning ! the card has a battery that maintains power to the cache in case of system power loss. After power is restored the data will be written to the disk drives. </para> <para> And finally, most disk drives have caches. Some are write-through ! while some are write-back, and the same concerns about data loss ! exist for write-back drive caches as exist for disk controller ! caches. Consumer-grade IDE and SATA drives are particularly likely ! to have write-back caches that will not survive a power failure, ! though <acronym>ATAPI-6</> introduced a drive cache flush command ! (<command>FLUSH CACHE EXT</>) that some file systems use, e.g. ! <acronym>ZFS</>, <acronym>ext4</>. (The SCSI command ! <command>SYNCHRONIZE CACHE</> has long been available.) Many ! solid-state drives (SSD) also have volatile write-back caches, and ! many do not honor cache flush commands by default. ! </para> ! ! <para> To check write caching on <productname>Linux</> use <command>hdparm -I</>; it is enabled if there is a <literal>*</> next to <literal>Write cache</>; <command>hdparm -W</> to turn off *************** *** 83,88 **** --- 89,113 ---- </para> <para> + Many file systems that use write barriers (e.g. <acronym>ZFS</>, + <acronym>ext4</>) internally use <command>FLUSH CACHE EXT</> or + <command>SYNCHRONIZE CACHE</> commands to flush data to the platers on + write-back-enabled drives. Unfortunately, such write barrier file + systems behave suboptimally when combined with battery-backed unit + (<acronym>BBU</>) disk controllers. In such setups, the synchronize + command forces all data from the BBU to the disks, eliminating much + of the benefit of the BBU. You can run the utility + <filename>src/tools/fsync</> in the PostgreSQL source tree to see + if you are effected. If you are effected, the performance benefits + of the BBU cache can be regained by turning off write barriers in + the file system or reconfiguring the disk controller, if that is + an option. If write barriers are turned off, make sure the battery + remains active; a faulty battery can potentially lead to data loss. + Hopefully file system and disk controller designers will eventually + address this suboptimal behavior. + </para> + + <para> When the operating system sends a write request to the storage hardware, there is little it can do to make sure the data has arrived at a truly non-volatile storage area. Rather, it is the
-- Sent via pgsql-performance mailing list (pgsql-performance@xxxxxxxxxxxxxx) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-performance