On Tue, Sep 08, 2009 at 09:00:50PM -0700, Curt Wohlgemuth wrote: > > > In ext3 and ext4, metadata blocks (such as > > extent tree blocks), aren't stored in the page cache. > > Hmm. You're saying that in the absence of a journal, all metadata > writes go direct to disk? Where should I look for this in the code? Sorry, let me be more precise. All metadata writes, regardless of whether a journal is present or not, are written via the buffer head (bh) abstraction. They have to, because that's how we do our journalling; the jbd/jbd2 layer is built on top of the bh I/O request layer, and even when a journal is not present, we are still doing our metadata I/O via the submit_bh and ll_rw_block interface. It used to be the case (in Linux 2.4) that the buffer cache was stored separately from the page cache. In Linux 2.6, the buffer cache is implemented on top of the page cache, so technically, the metadata blocks are stored in the page cache; however, they are only *accessed* via the buffer cache abstraction. > The problem is that I've seen this in real life. And the patch below > seems to fix it. (Unfortunately, I haven't been able to recreate this > in a simple example, after several days work. I've only seen this in > a *very* small number of cases on heavily loaded machines.) I believe that you have a problem. The problem is you have a dirty bh which is getting written out after the block gets reallocated for use as a data block. But a bforget() call should have the problem just as as well. In fact, I think the real fix should be this. commit 1b58b00e02893b4bbab2b5f137316b82feadac52 Author: Theodore Ts'o <tytso@xxxxxxx> Date: Wed Sep 9 11:18:42 2009 -0400 ext4: Use bforget() in no journal mode when in ext4_journal_forget() When ext4 is using a journal, a metadata block which is deallocated must be passed into the journal layer so it can be "revoked". The jbd2_journal_forget() function is also responsible for calling bforget(). Without a journal, ext4_journal_forget() must call bforget(), to avoid a race from a dirty metadata block getting written back after it has been reallocated and reused for another inode's data block. Signed-off-by: "Theodore Ts'o" <tytso@xxxxxxx> diff --git a/fs/ext4/ext4_jbd2.c b/fs/ext4/ext4_jbd2.c index eb27fd0..d4f4b39 100644 --- a/fs/ext4/ext4_jbd2.c +++ b/fs/ext4/ext4_jbd2.c @@ -44,7 +44,7 @@ int __ext4_journal_forget(const char *where, handle_t *handle, handle, err); } else - brelse(bh); + bforget(bh); return err; } - Ted -- To unsubscribe from this list: send the line "unsubscribe linux-ext4" in the body of a message to majordomo@xxxxxxxxxxxxxxx More majordomo info at http://vger.kernel.org/majordomo-info.html