On Sat, Jul 23, 2011 at 3:39 AM, Jan Kara <jack@xxxxxxx> wrote: > When journalling data for an inode (either because it is a symlink or > because the filesystem is mounted in data=journal mode), ext4_evict_inode() > can discard unwritten data by calling truncate_inode_pages(). This is > because we don't mark the buffer / page dirty when journalling data but only > add the buffer to the running transaction and thus mm does not know there > are still unwritten data. > > Fix the problem by carefully tracking transaction containing inode's data, > committing this transaction, and writing uncheckpointed buffers when inode > should be reaped. > > Signed-off-by: Jan Kara <jack@xxxxxxx> > --- > fs/ext4/inode.c | 29 +++++++++++++++++++++++++++++ > 1 files changed, 29 insertions(+), 0 deletions(-) > > This is ext4 version of an ext3 fix I sent a while ago. It received only > light testing but I figured you might want get the patch earlier rather than > later given the merge window is open. > > diff --git a/fs/ext4/inode.c b/fs/ext4/inode.c > index e3126c0..019995b 100644 > --- a/fs/ext4/inode.c > +++ b/fs/ext4/inode.c > @@ -190,6 +190,33 @@ void ext4_evict_inode(struct inode *inode) > > trace_ext4_evict_inode(inode); > if (inode->i_nlink) { > + /* > + * When journalling data dirty buffers are tracked only in the > + * journal. So although mm thinks everything is clean and > + * ready for reaping the inode might still have some pages to > + * write in the running transaction or waiting to be > + * checkpointed. Thus calling jbd2_journal_invalidatepage() > + * (via truncate_inode_pages()) to discard these buffers can > + * cause data loss. Also even if we did not discard these > + * buffers, we would have no way to find them after the inode > + * is reaped and thus user could see stale data if he tries to > + * read them before the transaction is checkpointed. So be > + * careful and force everything to disk here... We use > + * ei->i_datasync_tid to store the newest transaction > + * containing inode's data. > + * > + * Note that directories do not have this problem because they > + * don't use page cache. > + */ > + if (ext4_should_journal_data(inode) && > + (S_ISLNK(inode->i_mode) || S_ISREG(inode->i_mode))) { > + journal_t *journal = EXT4_SB(inode->i_sb)->s_journal; > + tid_t commit_tid = EXT4_I(inode)->i_datasync_tid; > + > + jbd2_log_start_commit(journal, commit_tid); > + jbd2_log_wait_commit(journal, commit_tid); > + filemap_write_and_wait(&inode->i_data); > + } > truncate_inode_pages(&inode->i_data, 0); > goto no_delete; > } > @@ -1863,6 +1890,7 @@ static int ext4_journalled_write_end(struct file *file, > if (new_i_size > inode->i_size) > i_size_write(inode, pos+copied); > ext4_set_inode_state(inode, EXT4_STATE_JDATA); > + EXT4_I(inode)->i_datasync_tid = handle->h_transaction->t_tid; > if (new_i_size > EXT4_I(inode)->i_disksize) { > ext4_update_i_disksize(inode, new_i_size); > ret2 = ext4_mark_inode_dirty(handle, inode); > @@ -2571,6 +2599,7 @@ static int __ext4_journalled_writepage(struct page *page, > write_end_fn); > if (ret == 0) > ret = err; > + EXT4_I(inode)->i_datasync_tid = handle->h_transaction->t_tid; > err = ext4_journal_stop(handle); > if (!ret) > ret = err; > -- > 1.7.1 > Hi Jan, Patch looks correct to me, but I am uncomfortable with i_datasync_tid being treated differently in journalled write - that is, being set on different places in the write paths. How about setting i_datasync_tid in a more generic place like ext4_{,da_}write_begin()? I know it's a bit redundant to setting dirty pages, but at least this way i_datasync_tid can be checked in all journal modes and have a consistent meaning. Perhaps we can even use i_datasync_tid to optimize away things like fiemap checks for dirty pages. Just a though. Amir. -- To unsubscribe from this list: send the line "unsubscribe linux-ext4" in the body of a message to majordomo@xxxxxxxxxxxxxxx More majordomo info at http://vger.kernel.org/majordomo-info.html