Re: [PATCH] ext4: Fix data corruption in inodes with journalled data

Jan Kara <jack@xxxxxxx> · Mon, 25 Jul 2011 17:47:36 +0200

On Mon 25-07-11 17:58:03, Amir Goldstein wrote:
> On Mon, Jul 25, 2011 at 5:26 PM, Jan Kara <jack@xxxxxxx> wrote:
> >  Hello Amir,
> >
> > On Sat 23-07-11 16:21:55, Amir Goldstein wrote:
> >> On Sat, Jul 23, 2011 at 3:39 AM, Jan Kara <jack@xxxxxxx> wrote:
> >> > When journalling data for an inode (either because it is a symlink or
> >> > because the filesystem is mounted in data=journal mode),
> >> > ext4_evict_inode() can discard unwritten data by calling
> >> > truncate_inode_pages(). This is because we don't mark the buffer / page
> >> > dirty when journalling data but only add the buffer to the running
> >> > transaction and thus mm does not know there are still unwritten data.
> >> >
> >> > Fix the problem by carefully tracking transaction containing inode's
> >> > data, committing this transaction, and writing uncheckpointed buffers
> >> > when inode should be reaped.
> >> >
> >> > Signed-off-by: Jan Kara <jack@xxxxxxx> ---  fs/ext4/inode.c |   29
> >> > +++++++++++++++++++++++++++++  1 files changed, 29 insertions(+), 0
> >> > deletions(-)
> >> >
> >> >  This is ext4 version of an ext3 fix I sent a while ago. It received
> >> > only light testing but I figured you might want get the patch earlier
> >> > rather than later given the merge window is open.
> >> >
> >> > diff --git a/fs/ext4/inode.c b/fs/ext4/inode.c index e3126c0..019995b
> >> > 100644 --- a/fs/ext4/inode.c +++ b/fs/ext4/inode.c @@ -190,6 +190,33 @@
> >> > void ext4_evict_inode(struct inode *inode)
> >> >
> >> >        trace_ext4_evict_inode(inode);        if (inode->i_nlink) { +
> >> >             /* +                * When journalling data dirty buffers
> >> > are tracked only in the +                * journal. So although mm
> >> > thinks everything is clean and +                * ready for reaping the
> >> > inode might still have some pages to +                * write in the
> >> > running transaction or waiting to be +                * checkpointed.
> >> > Thus calling jbd2_journal_invalidatepage() +                * (via
> >> > truncate_inode_pages()) to discard these buffers can +                *
> >> > cause data loss. Also even if we did not discard these +
> >> >  * buffers, we would have no way to find them after the inode +
> >> >        * is reaped and thus user could see stale data if he tries to +
> >> >                * read them before the transaction is checkpointed. So
> >> > be +                * careful and force everything to disk here... We
> >> > use +                * ei->i_datasync_tid to store the newest
> >> > transaction +                * containing inode's data.  +
> >> >    * +                * Note that directories do not have this problem
> >> > because they +                * don't use page cache.  +
> >> >  */ +               if (ext4_should_journal_data(inode) && +
> >> >         (S_ISLNK(inode->i_mode) || S_ISREG(inode->i_mode))) { +
> >> >               journal_t *journal = EXT4_SB(inode->i_sb)->s_journal; +
> >> >                     tid_t commit_tid = EXT4_I(inode)->i_datasync_tid; +
> >> > +                       jbd2_log_start_commit(journal, commit_tid); +
> >> >                     jbd2_log_wait_commit(journal, commit_tid); +
> >> >                 filemap_write_and_wait(&inode->i_data); +
> >> > }                truncate_inode_pages(&inode->i_data, 0);
> >> >  goto no_delete;        } @@ -1863,6 +1890,7 @@ static int
> >> > ext4_journalled_write_end(struct file *file,        if (new_i_size >
> >> > inode->i_size)                i_size_write(inode, pos+copied);
> >> >  ext4_set_inode_state(inode, EXT4_STATE_JDATA); +
> >> > EXT4_I(inode)->i_datasync_tid = handle->h_transaction->t_tid;        if
> >> > (new_i_size > EXT4_I(inode)->i_disksize) {
> >> >  ext4_update_i_disksize(inode, new_i_size);                ret2 =
> >> > ext4_mark_inode_dirty(handle, inode); @@ -2571,6 +2599,7 @@ static int
> >> > __ext4_journalled_writepage(struct page *page,
> >> >        write_end_fn);        if (ret == 0)                ret = err; +
> >> >       EXT4_I(inode)->i_datasync_tid = handle->h_transaction->t_tid;
> >> >    err = ext4_journal_stop(handle);        if (!ret)                ret
> >> > = err; -- 1.7.1
> >> >
> >> Patch looks correct to me, but I am uncomfortable with i_datasync_tid
> >> being treated differently in journalled write - that is, being set on
> >> different places in the write paths.
> >>
> >> How about setting i_datasync_tid in a more generic place like
> >> ext4_{,da_}write_begin()?  I know it's a bit redundant to setting dirty
> >> pages, but at least this way i_datasync_tid can be checked in all journal
> >> modes and have a consistent meaning.
> >  Well, I kept the meaning that i_datasync_tid is ID of a transaction that
> > must be committed for a data of an inode to be safely on disk. It is true
> > that in data=journal mode, we need to update this number differently than
> > in other journaling modes but that's not important I think. Currently, we
> > just force commit in data=journal mode in every case and thus we do not
> > really care about the value of i_datasync_tid for fsync. In future we could
> > be more clever and avoid transaction commits for fsync in data=journal mode
> > in some cases.  So in fact I'd say the code is now *more* consistent than
> > it used to be.  The only thing that isn't quite consistent is that I didn't
> > bother with updating i_sync_tid because we currently do not use it. If
> > people want, that might be a useful cleanup which I can do.
> >
> >> Perhaps we can even use i_datasync_tid to optimize away things like
> >> fiemap checks for dirty pages.
> >  Umm, I'm not sure which checks do you mean...
> 
> I thought that ext4_ext_fiemap_cb() looks for dirty pages to display as delayed
> allocation extents and that this lookup can be avoided if we know that the inode
> data is not dirty, but I could have been wrong.
  No, that won't really work. We don't want to update i_datasync_tid for
delayed allocation write. We don't really have a valid tid to store in that
call path.

BTW, you made me look at the checks in ext4_ext_fiemap_cb() and that code
is just broken. It can oops in a number of ways when it races with page
reclaim. I'll write to Yongqiang who seems to an author of this...

								Honza
-- 
Jan Kara <jack@xxxxxxx>
SUSE Labs, CR
--
To unsubscribe from this list: send the line "unsubscribe linux-ext4" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at  http://vger.kernel.org/majordomo-info.html