Re: [External] Re: [PATCH 1/2] ext4: use ext4_ext_remove_space() for fast commit replay delete range

Ritesh Harjani <riteshh@xxxxxxxxxxxxx> · Fri, 4 Feb 2022 02:44:16 +0530

On 22/02/02 09:40PM, Xin Yin wrote:
> On Wed, Feb 2, 2022 at 4:34 AM Ritesh Harjani <riteshh@xxxxxxxxxxxxx> wrote:
> >
> > Hello Xin,
> >
> > Sorry about revisiting this thread so late :(
> > Recently when I was working on one of the fast_commit issue, I got interested
> > in looking into some of those recent fast_commit fixes.
> >
> > Hence some of these queries.
> >
> > On 21/12/23 11:23AM, Xin Yin wrote:
> > > For now ,we use ext4_punch_hole() during fast commit replay delete range
> > > procedure. But it will be affected by inode->i_size, which may not
> > > correct during fast commit replay procedure. The following test will
> > > failed.
> > >
> > > -create & write foo (len 1000K)
> > > -falloc FALLOC_FL_ZERO_RANGE foo (range 400K - 600K)
> > > -create & fsync bar
> > ^^^^ do you mean "fsync foo" or is this actually a new file create and fsync
> > bar?
> bar is a new created file, it is the brother file of foo , it would be
> like this.
> ./foo ./bar
>
> >
> >
> > > -falloc FALLOC_FL_PUNCH_HOLE foo (range 300K-500K)
> > > -fsync foo
> > > -crash before a full commit
> > >
> > > After the fast_commit reply procedure, the range 400K-500K will not be
> > > removed. Because in this case, when calling ext4_punch_hole() the
> > > inode->i_size is 0, and it just retruns with doing nothing.
> >
> > I tried looking into this, but I am not able to put my head around that when
> > will the inode->i_size will be 0?
> >
> > So, what I think should happen is when you are doing falocate/fsync foo in your
> > above list of operations then, anyways the inode i_disksize will be updated
> > using ext4_mark_inode_dirty() and during replay phase inode->i_size will hold
> > the right value no?
> yes, the inode->i_size hold the right value and ext4_fc_replay_inode()
> will update inode to the final state, but during replay phase
> ext4_fc_replay_inode() usually is the last step,  so before this the
> inode->i_size may not correct.
>
> >
> > Could you please help understand when, where and how will inode->i_size will be
> > 0?
> I didn't check why inode->i_size is 0, in this case. I just think

Ok, so I now know why the inode->i_size is 0 during replay phase (for file foo).
This is because inode->i_disksize is not really updated until after the
ext4_writepages() kicks in, which in this case, won't happen (for file foo)
when we are doing fsync on file bar. And hence fsync on file bar won't also
not ensure the delalloc blocks for file foo get's written out.

In fact this above information was something that I was assuming it all wrong.
Earlier I was of the opinion that fast_commit still pushes _all_ the dirty
pagecache data of other files to disk too (which is incorrect) and the only
performance gains happens via less writes to disk (since we write less metadata
on disk).

But I think what really happens is -
In case of fast_commit when fsync is called on any file (say bar), apart from that
file's (bar) dirty data, it only writes the necessary required metadata information
of the blocks of others files (in this case file foo) which are already allocated.
(which in this case was due to fzero operation).
It does not actually allocate the delalloc blocks due to buffered writes of any
other file (other than for file on which fsync is called).

This happens in
ext4_fc_perform_commit() -> ext4_fc_submit_inode_data_all() ->
jbd2_submit_inode_data -> jbd2_journal_submit_inode_data_buffers() ->
generic_writepages() -> using writepage() which won't do block allocation for
delalloc blocks.

So that above is what should give the major performance boost with fast_commit
in case of multiple file writes doing fsync. :)

@Jan/Harshad - could you please confirm if above is correct?

> inode->i_size should not affect the behavior of the replay phase.
> Another case is inode->i_size may not include unwritten blocks , and
> if a file has unwritten blocks at bottom, we can not use
> ext4_punch_hole() to remove the unwritten blocks beyond i_size during
> the replay phase.

Right. So then yes, we should not depend on inode->i_size during replay phase,
since it might have an entry in fast_commit area which is still only partially
correct (or in some transient state w.r.t i_disksize).

>
> >
> > Also - it would be helpful if you have some easy reproducer of this issue you
> > mentioned.
> The attached test code can reproduce this issue, hope it helps.

Thanks, yes it did help.

-ritesh

>
>
> >
> > -ritesh
> >
> > >
> > > Change to use ext4_ext_remove_space() instead of ext4_punch_hole()
> > > to remove blocks of inode directly.
> > >
> > > Signed-off-by: Xin Yin <yinxin.x@xxxxxxxxxxxxx>
> > > ---
> > >  fs/ext4/fast_commit.c | 13 ++++++++-----
> > >  1 file changed, 8 insertions(+), 5 deletions(-)
> > >
> > > diff --git a/fs/ext4/fast_commit.c b/fs/ext4/fast_commit.c
> > > index aa05b23f9c14..3deb97b22ca4 100644
> > > --- a/fs/ext4/fast_commit.c
> > > +++ b/fs/ext4/fast_commit.c
> > > @@ -1708,11 +1708,14 @@ ext4_fc_replay_del_range(struct super_block *sb, struct ext4_fc_tl *tl,
> > >               }
> > >       }
> > >
> > > -     ret = ext4_punch_hole(inode,
> > > -             le32_to_cpu(lrange.fc_lblk) << sb->s_blocksize_bits,
> > > -             le32_to_cpu(lrange.fc_len) <<  sb->s_blocksize_bits);
> > > -     if (ret)
> > > -             jbd_debug(1, "ext4_punch_hole returned %d", ret);
> > > +     down_write(&EXT4_I(inode)->i_data_sem);
> > > +     ret = ext4_ext_remove_space(inode, lrange.fc_lblk,
> > > +                             lrange.fc_lblk + lrange.fc_len - 1);
> > > +     up_write(&EXT4_I(inode)->i_data_sem);
> > > +     if (ret) {
> > > +             iput(inode);
> > > +             return 0;
> > > +     }
> > >       ext4_ext_replay_shrink_inode(inode,
> > >               i_size_read(inode) >> sb->s_blocksize_bits);
> > >       ext4_mark_inode_dirty(NULL, inode);
> > > --
> > > 2.20.1
> > >