Bruce Momjian wrote: > Greg Smith wrote: > > Kevin Grittner wrote: > > > I assume that we send a full > > > 8K to the OS cache, and the file system writes disk sectors > > > according to its own algorithm. With either platters or BBU cache, > > > the data is persisted on fsync; why do you see a risk with one but > > > not the other > > > > I'd like a 10 minute argument please. I started to write something to > > refute this, only to clarify in my head the sequence of events that > > leads to the most questionable result, where I feel a bit less certain > > than I did before of the safety here. Here is the worst case I believe > > you're describing: > > > > 1) Transaction is written to the WAL and sync'd; client receives > > COMMIT. Since full_page_writes is off, the data in the WAL consists > > only of the delta of what changed on the page. > > 2) 8K database page is written to OS cache > > 3) PG calls fsync to force the database block out > > 4) OS writes first 4K block of the change to the BBU write cache. Worst > > case, this fills the cache, and it takes a moment for some random writes > > to process before it has space to buffer again (makes this more likely > > to happen, but it's not required to see the failure case here) > > 5) Sudden power interruption, second half of the page write is lost > > 6) Server restarts > > 7) That 4K write is now replayed from the battery's cache > > > > At this point, you now have a torn 8K page, with 1/2 old and 1/2 new > > Based on this report, I think we need to update our documentation and > backpatch removal of text that says that BBU users can safely turn off > full-page writes. Patch attached. > > I think we have fallen into a trap I remember from the late 1990's where > I was assuming that an 8k-block based file system would write to the > disk atomically in 8k segments, which of course it cannot. My bet is > that even if you write to the kernel in 8k pages, and have an 8k file > system, the disk is still accessed via 512-byte blocks, even with a BBU. Doc patch applied. -- Bruce Momjian <bruce@xxxxxxxxxx> http://momjian.us EnterpriseDB http://enterprisedb.com + It's impossible for everything to be true. +
diff --git a/doc/src/sgml/wal.sgml b/doc/src/sgml/wal.sgml index a2724fa..1e67bbd 100644 *** /tmp/pgrevert.14281/7sLqTb_wal.sgml Tue Nov 30 21:57:17 2010 --- doc/src/sgml/wal.sgml Tue Nov 30 21:56:49 2010 *************** *** 164,173 **** <productname>PostgreSQL</> periodically writes full page images to permanent WAL storage <emphasis>before</> modifying the actual page on disk. By doing this, during crash recovery <productname>PostgreSQL</> can ! restore partially-written pages. If you have a battery-backed disk ! controller or file-system software that prevents partial page writes ! (e.g., ZFS), you can turn off this page imaging by turning off the ! <xref linkend="guc-full-page-writes"> parameter. </para> </sect1> --- 164,175 ---- <productname>PostgreSQL</> periodically writes full page images to permanent WAL storage <emphasis>before</> modifying the actual page on disk. By doing this, during crash recovery <productname>PostgreSQL</> can ! restore partially-written pages. If you have file-system software ! that prevents partial page writes (e.g., ZFS), you can turn off ! this page imaging by turning off the <xref ! linkend="guc-full-page-writes"> parameter. Battery-Backed unit ! (BBU) disk controllers do not prevent partial page writes unless ! they guarantee that data is written to the BBU as full (8kB) pages. </para> </sect1>
-- Sent via pgsql-performance mailing list (pgsql-performance@xxxxxxxxxxxxxx) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-performance