On Wed, 18 Mar 2009, Marco Colombo wrote:
If you fsync() after each write you want ordered, there can't be any "subsequent I/O" (unless there are many different processes cuncurrently writing to the file w/o synchronization).
Inside PostgreSQL, each of the database backend processes ends up writing blocks to the database disk, if they need to allocate a new buffer and the one they are handed is dirty. You can easily have several of those writing to the same 1GB underlying file on disk. So that prerequisite is there. The main potential for a problem here would be if a stray unsynchronized write from one of those backends happened in a way that wasn't accounted for by the WAL+checkpoint design. What I was suggesting is that the way that synchronization happens in the database provides some defense from running into problems in this area.
The way backends handle writes themselves is also why your suggestion about the database being able to utilize barriers isn't really helpful. Those trickle out all the time, and normally you don't even have to care about ordering them. The only you do need to care, at checkpoint time, only a hard line is really practical--all writes up to that point, period. Trying to implement ordered writes for everything that happened before then would complicate the code base, which isn't going to happen for such a platform+filesystem specific feature, one that really doesn't offer much acceleration from the database's perspective.
only when the journal wraps around there's a (extremely) small window of vulnerability. You need to write a careful crafted torture program to get any chance to observe that... such program exists, and triggers the problem
Yeah, I've been following all that. The PostgreSQL WAL design works on ext2 filesystems with no journal at all. Some people even put their pg_xlog directory onto ext2 filesystems for best performance, relying on the WAL to be the journal. As long as fsync is honored correctly, the WAL writes should be re-writing already allocated space, which makes this category of journal mayhem not so much of a problem. But when I read about fsync doing unexpected things, that gets me more concerned.
-- * Greg Smith gsmith@xxxxxxxxxxxxx http://www.gregsmith.com Baltimore, MD - Sent via pgsql-general mailing list (pgsql-general@xxxxxxxxxxxxxx) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-general