On Tue, Aug 15, 2006 at 03:02:56PM -0400, Michael Stone wrote: > On Tue, Aug 15, 2006 at 02:33:27PM -0400, mark@xxxxxxxxxxxxxx wrote: > >>>Are 'we' sure that such a setup can't lose any data? > >>Yes. If you check the archives, you can even find the last time this was > >>discussed... > >I looked last night (coincidence actually) and didn't find proof that > >you cannot lose data. > You aren't going to find proof, any more than you'll find proof that you > won't lose data if you do lose a journalling fs. (Because there isn't > any.) Unfortunately, many people misunderstand the what a metadata > journal does for you, and overstate its importance in this type of > application. Yes, many people do. :-) > >How do you deal with the file system structure being updated before the > >data blocks are (re-)written? > *That's what the postgres log is for.* If the latest xlog entries don't > make it to disk, they won't be replayed; if they didn't make it to > disk, the transaction would not have been reported as commited. An > application that understands filesystem semantics can guarantee data > integrity without metadata journaling. No. This is not true. Updating the file system structure (inodes, indirect blocks) touches a separate part of the disk than the actual data. If the file system structure is modified, say, to extend a file to allow it to contain more data, but the data itself is not written, then upon a restore, with a system such as ext2, or ext3 with writeback, or xfs, it is possible that the end of the file, even the postgres log file, will contain a random block of data from the disk. If this random block of data happens to look like a valid xlog block, it may be played back, and the database corrupted. If the file system is only used for xlog data, the chance that it looks like a valid block increases, would it not? > >>The bottom line is that the only reason you need a metadata journalling > >>filesystem is to save the fsck time when you come up. On a little > >>partition like xlog, that's not an issue. > >fsck isn't only about time to fix. fsck is needed, because the file system > >is broken. > fsck is needed to reconcile the metadata with the on-disk allocations. > To do that, it reads all the inodes and their corresponding directory > entries. The time to do that is proportional to the size of the > filesystem, hence the comment about time. fsck is not needed "because > the filesystem is broken", it's needed because the filesystem is marked > dirty. This is also wrong. fsck is needed because the file system is broken. It takes time, because it doesn't have a journal to help it, therefore it must look through the entire file system and guess what the problems are. There are classes of problems such as I describe above, for which fsck *cannot* guess how to solve the problem. There is not enough information available for it to deduce that anything is wrong at all. The probability is low, for sure - but then, the chance of a file system failure is already low. Betting on ext2 + postgresql xlog has not been confirmed to me as reliable. Telling me that journalling is misunderstood doesn't prove to me that you understand it. I don't mean to be offensive, but I won't accept what you say, as it does not make sense with my understanding of how file systems work. :-) Cheers, mark -- mark@xxxxxxxxx / markm@xxxxxx / markm@xxxxxxxxxx __________________________ . . _ ._ . . .__ . . ._. .__ . . . .__ | Neighbourhood Coder |\/| |_| |_| |/ |_ |\/| | |_ | |/ |_ | | | | | | \ | \ |__ . | | .|. |__ |__ | \ |__ | Ottawa, Ontario, Canada One ring to rule them all, one ring to find them, one ring to bring them all and in the darkness bind them... http://mark.mielke.cc/