On Mon, 31 Mar 2008, James Mansion wrote:
Is it correct that POSIX requires that the updates to a single
file are serialised in the filesystem layer?
Quoting from Lewine's "POSIX Programmer's Guide":
"After a write() to a regular file has successfully returned, any
successful read() from each byte position in the file that was modified by
that write() will return the data that was written by the write()...a
similar requirement applies to multiple write operations to the same file
position"
That's the "contract" that has to be honored. How your filesystem
actually implements this contract is none of a POSIX write() call's
business, so long as it does.
It is the case that multiple writers to the same file can get serialized
somewhere because of how this call is implemented though, so you're
correct about that aspect of the practical impact being a possibility.
So, if we have a number of dirty pages to write back to a single
file in the database (whether a table or index) then we cannot
pass these through the POSIX filesystem layer into the TCQ/NCQ
system on the disk drive, so it can reorder them?
As long as the reordering mechanism also honors that any reads that come
after a write to a block reflect that write, they can be reordered. The
filesystem and drives are already doing elevator sorting and similar
mechanisms underneath you to optimize things. Unless you use a sync
operation or some sort of write barrier, you don't really know what has
happened.
I have seen suggestions that on Solaris this can be relaxed.
There's some good notes in this area at:
http://www.solarisinternals.com/wiki/index.php/Direct_I/O and
http://www.solarisinternals.com/wiki/index.php/ZFS_Performance
It's clear that such relaxation has benefits with some of Oracle's
mechanisms as described. But amusingly, PostgreSQL doesn't even support
Solaris's direct I/O method right now unless you override the filesystem
mounting options, so you end up needing to split it out and hack at that
level regardless.
I *assume* that PostgreSQL's lack of threads or AIO and the
single bgwriter means that PostgreSQL 8.x does not normally
attempt to make any use of such a relaxation but could do so if the
bgwriter fails to keep up and other backends initiate flushes.
PostgreSQL writes transactions to the WAL. When they have reached disk,
confirmed by a successful f[data]sync or a completed syncronous write,
that transactions is now committed. Eventually the impacted items in the
buffer cache will be written as well. At checkpoint time, things are
reconciled such that all dirty buffers at that point have been written,
and now f[data]sync is called on each touched file to make sure those
changes have made it to disk.
Writes are assumed to be lost in some memory (kernel, filesystem or disk
cache) until they've been confirmed to be written to disk via the sync
mechanism. When a backend flushes a buffer out, as soon as the OS caches
that write the database backend moves on without being concerned about how
it's eventually going to get to disk one day. As long as the newly
written version comes back again if it's read, the database doesn't worry
about what's happening until it specifically asks for a sync that proves
everything is done. So if the backends or the background writer are
spewing updates out, they don't care if the OS doesn't guarantee the order
they hit disk until checkpoint time; it's only the synchronous WAL writes
that do.
Also note that it's usually the case that backends write a substantial
percentage of the buffers out themselves. You should assume that's the
case unless you've done some work to prove the background writer is
handling most writes (which is difficult to even know before 8.3, much
less tune for).
That how I understand everything to work at least. I will add the
disclaimer that I haven't looked at the archive recovery code much yet.
Maybe there's some expectation it has for general database write ordering
in order for the WAL replay mechanism to work correctly, I can't imagine
how that could work though.
--
* Greg Smith gsmith@xxxxxxxxxxxxx http://www.gregsmith.com Baltimore, MD
--
Sent via pgsql-performance mailing list (pgsql-performance@xxxxxxxxxxxxxx)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-performance