On Mon 25-07-11 17:58:03, Amir Goldstein wrote: > On Mon, Jul 25, 2011 at 5:26 PM, Jan Kara <jack@xxxxxxx> wrote: > > Hello Amir, > > > > On Sat 23-07-11 16:21:55, Amir Goldstein wrote: > >> On Sat, Jul 23, 2011 at 3:39 AM, Jan Kara <jack@xxxxxxx> wrote: > >> > When journalling data for an inode (either because it is a symlink or > >> > because the filesystem is mounted in data=journal mode), > >> > ext4_evict_inode() can discard unwritten data by calling > >> > truncate_inode_pages(). This is because we don't mark the buffer / page > >> > dirty when journalling data but only add the buffer to the running > >> > transaction and thus mm does not know there are still unwritten data. > >> > > >> > Fix the problem by carefully tracking transaction containing inode's > >> > data, committing this transaction, and writing uncheckpointed buffers > >> > when inode should be reaped. > >> > > >> > Signed-off-by: Jan Kara <jack@xxxxxxx> --- fs/ext4/inode.c | 29 > >> > +++++++++++++++++++++++++++++ 1 files changed, 29 insertions(+), 0 > >> > deletions(-) > >> > > >> > This is ext4 version of an ext3 fix I sent a while ago. It received > >> > only light testing but I figured you might want get the patch earlier > >> > rather than later given the merge window is open. > >> > > >> > diff --git a/fs/ext4/inode.c b/fs/ext4/inode.c index e3126c0..019995b > >> > 100644 --- a/fs/ext4/inode.c +++ b/fs/ext4/inode.c @@ -190,6 +190,33 @@ > >> > void ext4_evict_inode(struct inode *inode) > >> > > >> > trace_ext4_evict_inode(inode); if (inode->i_nlink) { + > >> > /* + * When journalling data dirty buffers > >> > are tracked only in the + * journal. So although mm > >> > thinks everything is clean and + * ready for reaping the > >> > inode might still have some pages to + * write in the > >> > running transaction or waiting to be + * checkpointed. > >> > Thus calling jbd2_journal_invalidatepage() + * (via > >> > truncate_inode_pages()) to discard these buffers can + * > >> > cause data loss. Also even if we did not discard these + > >> > * buffers, we would have no way to find them after the inode + > >> > * is reaped and thus user could see stale data if he tries to + > >> > * read them before the transaction is checkpointed. So > >> > be + * careful and force everything to disk here... We > >> > use + * ei->i_datasync_tid to store the newest > >> > transaction + * containing inode's data. + > >> > * + * Note that directories do not have this problem > >> > because they + * don't use page cache. + > >> > */ + if (ext4_should_journal_data(inode) && + > >> > (S_ISLNK(inode->i_mode) || S_ISREG(inode->i_mode))) { + > >> > journal_t *journal = EXT4_SB(inode->i_sb)->s_journal; + > >> > tid_t commit_tid = EXT4_I(inode)->i_datasync_tid; + > >> > + jbd2_log_start_commit(journal, commit_tid); + > >> > jbd2_log_wait_commit(journal, commit_tid); + > >> > filemap_write_and_wait(&inode->i_data); + > >> > } truncate_inode_pages(&inode->i_data, 0); > >> > goto no_delete; } @@ -1863,6 +1890,7 @@ static int > >> > ext4_journalled_write_end(struct file *file, if (new_i_size > > >> > inode->i_size) i_size_write(inode, pos+copied); > >> > ext4_set_inode_state(inode, EXT4_STATE_JDATA); + > >> > EXT4_I(inode)->i_datasync_tid = handle->h_transaction->t_tid; if > >> > (new_i_size > EXT4_I(inode)->i_disksize) { > >> > ext4_update_i_disksize(inode, new_i_size); ret2 = > >> > ext4_mark_inode_dirty(handle, inode); @@ -2571,6 +2599,7 @@ static int > >> > __ext4_journalled_writepage(struct page *page, > >> > write_end_fn); if (ret == 0) ret = err; + > >> > EXT4_I(inode)->i_datasync_tid = handle->h_transaction->t_tid; > >> > err = ext4_journal_stop(handle); if (!ret) ret > >> > = err; -- 1.7.1 > >> > > >> Patch looks correct to me, but I am uncomfortable with i_datasync_tid > >> being treated differently in journalled write - that is, being set on > >> different places in the write paths. > >> > >> How about setting i_datasync_tid in a more generic place like > >> ext4_{,da_}write_begin()? I know it's a bit redundant to setting dirty > >> pages, but at least this way i_datasync_tid can be checked in all journal > >> modes and have a consistent meaning. > > Well, I kept the meaning that i_datasync_tid is ID of a transaction that > > must be committed for a data of an inode to be safely on disk. It is true > > that in data=journal mode, we need to update this number differently than > > in other journaling modes but that's not important I think. Currently, we > > just force commit in data=journal mode in every case and thus we do not > > really care about the value of i_datasync_tid for fsync. In future we could > > be more clever and avoid transaction commits for fsync in data=journal mode > > in some cases. So in fact I'd say the code is now *more* consistent than > > it used to be. The only thing that isn't quite consistent is that I didn't > > bother with updating i_sync_tid because we currently do not use it. If > > people want, that might be a useful cleanup which I can do. > > > >> Perhaps we can even use i_datasync_tid to optimize away things like > >> fiemap checks for dirty pages. > > Umm, I'm not sure which checks do you mean... > > I thought that ext4_ext_fiemap_cb() looks for dirty pages to display as delayed > allocation extents and that this lookup can be avoided if we know that the inode > data is not dirty, but I could have been wrong. No, that won't really work. We don't want to update i_datasync_tid for delayed allocation write. We don't really have a valid tid to store in that call path. BTW, you made me look at the checks in ext4_ext_fiemap_cb() and that code is just broken. It can oops in a number of ways when it races with page reclaim. I'll write to Yongqiang who seems to an author of this... Honza -- Jan Kara <jack@xxxxxxx> SUSE Labs, CR -- To unsubscribe from this list: send the line "unsubscribe linux-ext4" in the body of a message to majordomo@xxxxxxxxxxxxxxx More majordomo info at http://vger.kernel.org/majordomo-info.html