On Fri, Nov 13, 2009 at 03:46:09PM -0800, Curt Wohlgemuth wrote: > So when a directory is removed with "rm -rf foo" , as the files are deleted, > the directory block(s) are marked dirty. But when the directory blocks > themselves are freed up, bforget() isn't called for their bufferheads, and > so they remain dirty in the page cache, and can be written down later, after > their blocks have been reused. Well, it should be happening as part of the call to ext4_forget(). It looks like it's happening on the code paths that release the blocks. This is critically important if journalling is enabled, since we have to call jbd2_journal_revoke() on directory blocks before they can be reused as data blocks. Hmm, if the buffer was part of a transaction, we don't call __bforget() on it in jbd2_journal_forget(). So I can see how you might be seeing a problem with journalling enabled, but I'm puzzled why you were also seeing a problem without journalling. So to help debug what is going on, I tried adding the a new tracepoint to ext4_forget(). I've attached it to the end of this message. Using it, I can confirm that ext4_forget() is getting called for directories. I do see a problem though that we're not checking to see if the inode is a directory; in that case, we need to treat it as if it were metadata, and call ext4_journal_revoke() instead of ext4_journal_forget(). Otherwise we could have the problem that after a crash, on a journal replay, we might end up replaying the directory block after it has been reallocated and used as a data block. (Hmm.... I think this might be a problem for ext3 as well.) I am very puzzled why you are seeing a problem in no journal mode, though. It looks like the right thing should be happening. Is the 8MB data file is getting written via direct I/O or buffered I/O? - Ted diff --git a/fs/ext4/inode.c b/fs/ext4/inode.c index 554c679..13de1dd 100644 --- a/fs/ext4/inode.c +++ b/fs/ext4/inode.c @@ -89,6 +89,7 @@ int ext4_forget(handle_t *handle, int is_metadata, struct inode *inode, might_sleep(); + trace_ext4_forget(inode, is_metadata, blocknr); BUFFER_TRACE(bh, "enter"); jbd_debug(4, "forgetting bh %p: is_metadata = %d, mode %o, " diff --git a/include/trace/events/ext4.h b/include/trace/events/ext4.h index d09550b..b390e1f 100644 --- a/include/trace/events/ext4.h +++ b/include/trace/events/ext4.h @@ -907,6 +907,32 @@ TRACE_EVENT(ext4_mballoc_free, __entry->result_len, __entry->result_logical) ); +TRACE_EVENT(ext4_forget, + TP_PROTO(struct inode *inode, int is_metadata, __u64 block), + + TP_ARGS(inode, is_metadata, block), + + TP_STRUCT__entry( + __field( dev_t, dev ) + __field( ino_t, ino ) + __field( umode_t, mode ) + __field( int, is_metadata ) + __field( __u64, block ) + ), + + TP_fast_assign( + __entry->dev = inode->i_sb->s_dev; + __entry->ino = inode->i_ino; + __entry->mode = inode->i_mode; + __entry->is_metadata = is_metadata; + __entry->block = block; + ), + + TP_printk("dev %s ino %lu mode %d is_metadata %d block %llu", + jbd2_dev_to_name(__entry->dev), (unsigned long) __entry->ino, + __entry->mode, __entry->is_metadata, __entry->block) +); + #endif /* _TRACE_EXT4_H */ /* This part must be outside protection */ -- To unsubscribe from this list: send the line "unsubscribe linux-ext4" in the body of a message to majordomo@xxxxxxxxxxxxxxx More majordomo info at http://vger.kernel.org/majordomo-info.html