Re: POSIX file updates

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 



On Mon, 31 Mar 2008, James Mansion wrote:

Is it correct that POSIX requires that the updates to a single
file are serialised in the filesystem layer?

Quoting from Lewine's "POSIX Programmer's Guide":

"After a write() to a regular file has successfully returned, any successful read() from each byte position in the file that was modified by that write() will return the data that was written by the write()...a similar requirement applies to multiple write operations to the same file position"

That's the "contract" that has to be honored. How your filesystem actually implements this contract is none of a POSIX write() call's business, so long as it does.

It is the case that multiple writers to the same file can get serialized somewhere because of how this call is implemented though, so you're correct about that aspect of the practical impact being a possibility.

So, if we have a number of dirty pages to write back to a single
file in the database (whether a table or index) then we cannot
pass these through the POSIX filesystem layer into the TCQ/NCQ
system on the disk drive, so it can reorder them?

As long as the reordering mechanism also honors that any reads that come after a write to a block reflect that write, they can be reordered. The filesystem and drives are already doing elevator sorting and similar mechanisms underneath you to optimize things. Unless you use a sync operation or some sort of write barrier, you don't really know what has happened.

I have seen suggestions that on Solaris this can be relaxed.

There's some good notes in this area at:

http://www.solarisinternals.com/wiki/index.php/Direct_I/O and http://www.solarisinternals.com/wiki/index.php/ZFS_Performance

It's clear that such relaxation has benefits with some of Oracle's mechanisms as described. But amusingly, PostgreSQL doesn't even support Solaris's direct I/O method right now unless you override the filesystem mounting options, so you end up needing to split it out and hack at that level regardless.

I *assume* that PostgreSQL's lack of threads or AIO and the
single bgwriter means that PostgreSQL 8.x does not normally
attempt to make any use of such a relaxation but could do so if the
bgwriter fails to keep up and other backends initiate flushes.

PostgreSQL writes transactions to the WAL. When they have reached disk, confirmed by a successful f[data]sync or a completed syncronous write, that transactions is now committed. Eventually the impacted items in the buffer cache will be written as well. At checkpoint time, things are reconciled such that all dirty buffers at that point have been written, and now f[data]sync is called on each touched file to make sure those changes have made it to disk.

Writes are assumed to be lost in some memory (kernel, filesystem or disk cache) until they've been confirmed to be written to disk via the sync mechanism. When a backend flushes a buffer out, as soon as the OS caches that write the database backend moves on without being concerned about how it's eventually going to get to disk one day. As long as the newly written version comes back again if it's read, the database doesn't worry about what's happening until it specifically asks for a sync that proves everything is done. So if the backends or the background writer are spewing updates out, they don't care if the OS doesn't guarantee the order they hit disk until checkpoint time; it's only the synchronous WAL writes that do.

Also note that it's usually the case that backends write a substantial percentage of the buffers out themselves. You should assume that's the case unless you've done some work to prove the background writer is handling most writes (which is difficult to even know before 8.3, much less tune for).

That how I understand everything to work at least. I will add the disclaimer that I haven't looked at the archive recovery code much yet. Maybe there's some expectation it has for general database write ordering in order for the WAL replay mechanism to work correctly, I can't imagine how that could work though.

--
* Greg Smith gsmith@xxxxxxxxxxxxx http://www.gregsmith.com Baltimore, MD

--
Sent via pgsql-performance mailing list (pgsql-performance@xxxxxxxxxxxxxx)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-performance

[Postgresql General]     [Postgresql PHP]     [PHP Users]     [PHP Home]     [PHP on Windows]     [Kernel Newbies]     [PHP Classes]     [PHP Books]     [PHP Databases]     [Yosemite]

  Powered by Linux