Greg Smith wrote: > On Wed, 18 Mar 2009, Marco Colombo wrote: > >> If you fsync() after each write you want ordered, there can't be any >> "subsequent I/O" (unless there are many different processes >> cuncurrently writing to the file w/o synchronization). > > Inside PostgreSQL, each of the database backend processes ends up > writing blocks to the database disk, if they need to allocate a new > buffer and the one they are handed is dirty. You can easily have > several of those writing to the same 1GB underlying file on disk. So > that prerequisite is there. The main potential for a problem here would > be if a stray unsynchronized write from one of those backends happened > in a way that wasn't accounted for by the WAL+checkpoint design. Wow, that would be quite a bug. That's why I wrote "w/o synchronization". "stray" + "unaccounted" + "cuncurrent" smells like the recipe for an explosive to me :) > What I > was suggesting is that the way that synchronization happens in the > database provides some defense from running into problems in this area. I hope it's "full defence". If you have two processes doing at the same time write(); fsycn(); on the same file, either there are no order requirements, or it will boom sooner or later... fsync() works inside a single process, but any system call may put the process to sleep, and who knows when it will be awakened and what other processes did to that file meanwhile. I'm pretty confident that PG code protects access to shared resources with synchronization primitives. Anyway I was referring to WAL writes... due to the nature of a log, it's hard to think of many unordered writes and of cuncurrent access w/o synchronization. But inside a critical region, there can be more than one single write, and you may need to enforce an order, but no more than that before the final fsycn(). If so, userland originated barriers instead of full fsync()'s may help with performance. But I'm speculating. > The way backends handle writes themselves is also why your suggestion > about the database being able to utilize barriers isn't really helpful. > Those trickle out all the time, and normally you don't even have to care > about ordering them. The only you do need to care, at checkpoint time, > only a hard line is really practical--all writes up to that point, > period. Trying to implement ordered writes for everything that happened > before then would complicate the code base, which isn't going to happen > for such a platform+filesystem specific feature, one that really doesn't > offer much acceleration from the database's perspective. I don't know the internals of WAL writing, I can't really reply on that. >> only when the journal wraps around there's a (extremely) small window >> of vulnerability. You need to write a careful crafted torture program >> to get any chance to observe that... such program exists, and triggers >> the problem > > Yeah, I've been following all that. The PostgreSQL WAL design works on > ext2 filesystems with no journal at all. Some people even put their > pg_xlog directory onto ext2 filesystems for best performance, relying on > the WAL to be the journal. As long as fsync is honored correctly, the > WAL writes should be re-writing already allocated space, which makes > this category of journal mayhem not so much of a problem. But when I > read about fsync doing unexpected things, that gets me more concerned. Well, that's highly dependant on your expectations :) I don't expect a fsync to trigger a journal commit, if metadata hasn't changed. That's obviuosly true for metadata-only journals (like most of them, with notable exceptions of ext3 in data=journal mode). Yet, if you're referring to this http://article.gmane.org/gmane.linux.file-systems/21373 well that seems to me the same usual thing/bug, fsync() allows disks to lie when it comes to caching writes. Nothing new under the sun. Barriers don't change much, because they don't replace a flush. They're about consistency, not durability. So even with full barriers support, a fsync implementation needs to end up in a disk cache flush, to be fully compliant with its own semantics. .TM. - Sent via pgsql-general mailing list (pgsql-general@xxxxxxxxxxxxxx) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-general