Re: [RFC PATCH] jbd2: fix a potential assertion failure due to improperly dirtied buffer

Stephen Zhang <starzhangzsd@xxxxxxxxx> · Thu, 8 Aug 2024 11:05:26 +0800

Jan Kara <jack@xxxxxxx> 于2024年8月7日周三 20:07写道：
>
> On Wed 07-08-24 16:10:50, Stephen Zhang wrote:
> > Jan Kara <jack@xxxxxxx> 于2024年8月6日周二 21:40写道：
> > > On Sat 20-07-24 14:23:56, zhangshida wrote:
> > > > From: Shida Zhang <zhangshida@xxxxxxxxxx>
> > > >
> > > > On an old kernel version(4.19, ext3, journal=data, pagesize=64k),
> > > > an assertion failure will occasionally be triggered by the line below:
> > >
> > > OK, just out of curiosity, why are you using data=journal mode? It doesn't
> > > really get that much testing and the performance is quite bad...
> > >
> >
> > It is used by one of our customers. It's more like a historical issue:
> > About 12 years ago, they used data=journal mode for the benefit of user
> > data consistency brought by the mode.
> > Time goes by, they attempted to change, say, maybe change it to ext4
> > at least, but found it is no more stable than it was under ext3...
> > And yeah, they decided to just leave the thing as it was and keep the system
> > under that state until now...
>
> I see, thanks for sharing. I was asking because we are mostly trying to
> steer away people from using data=journal mode and deprecate it because it
> adds a lot of complexity into the code without significant benefit.
>

Yeah. Though I am not an experienced file system developer, I was thinking
the philosophy behind the data=journal design sometimes.

In essence, unlike the metadata, the user data could be dirtied in an unexpected
and uncontrollable way.

For example, calls like __block_write_begin() is a function that could dirty
the user data, but __block_write_begin() is beyond ext4 maintainer's control.
We cannot tell the block layer maintainer, ‘Hey, we want to trace the dirty user
data in ext4, can we add some special code for ext4 in __block_write_begin?’:P

Whilst for metadata, each time the dirting of each piece of metadata is managed
by ext4's internal code logic.

The uncontrollable dirting of user data is the root cause of all problems.

> > > > jbd2_journal_commit_transaction
> > > > {
> > > > ...
> > > > J_ASSERT_BH(bh, !buffer_dirty(bh));
> > > > /*
> > > > * The buffer on BJ_Forget list and not jbddirty means
> > > > ...
> > > > }
> > > >
> > > > AFAIC, that's how the problem works:
> > > > --------
> > > > journal_unmap_buffer
> > > > jbd2_journal_invalidatepage
> > > > __ext4_journalled_invalidatepage
> > > > ext4_journalled_invalidatepage
> > > > do_invalidatepage
> > > > truncate_inode_pages_range
> > > > truncate_inode_pages
> > > > truncate_pagecache
> > > > ext4_setattr
> > > > --------
> > > >
> > > > First try to truncate and invalidate the page.
> > > > Sometimes the buffer and the page won't be freed immediately.
> > > > the buffer will be sent to the BJ_Forget list of the currently
> > > > committing transaction. Maybe the transaction knows when and how
> > > > to free the buffer better.
> > > > The buffer's states now: !jbd_dirty !mapped !dirty
> > > >
> > > > Then jbd2_journal_commit_transaction(）will try to iterate over the
> > > > BJ_Forget list:
> > > > --------
> > > > jbd2_journal_commit_transaction()
> > > >       while (commit_transaction->t_forget) {
> > > >       ...
> > > >       }
> > > > --------
> > > >
> > > > At this exact moment, another write comes:
> > > > --------
> > > > mark_buffer_dirty
> > > > __block_write_begin_int
> > > > __block_write_begin
> > > > ext4_write_begin
> > > > --------
> > > > it sees a unmapped new buffer, and marks it as dirty.
> > >
> > > This should not happen. When ext4_setattr() truncates the file, we do not
> > > allow reallocating these blocks for other purposes until the transaction
> >
> > ext4_setattr() will try to free it by adding it to the BJ_Forget list
> > for further processing.
> > Put it more clearly,
> > when ext4_setattr() truncates the file, the buffer is not fully freed
> > yet. It's half-freed.
> > Furthermore,
> > Because the buffer is half-freed, the reallocating thing won't need to happen.
> > Now,
> > under that scenario, can we redirty the half-freed buffer on the BJ_Forget list?
> > The answer may be 'yes'.
> >
> > redirty it by the following code:
> > ext4_block_write_begin
> >     if (!buffer_mapped(bh)) { // check 1
> >          _ext4_get_block(inode, block, bh, 1);
> >         (buffer_new(bh)) { // check 2
> >              if (folio_test_uptodate(folio)) { // check 3
> >                  mark_buffer_dirty(bh);
>
> <snip>
>
> I see, right. It is not that the block would get reused. It is just that
> the buffer_head on the file's tail page gets reused and this causes issues.
> In fact, the problem is with ext4_block_write_begin() (and
> __block_write_begin_int()) that they call mark_buffer_dirty() on a
> journalled buffer before calling jbd2_journal_get_write_access() (which
> would remove the buffer from BJ_Forget list). This is what ultimately
> confuses the commit code.
>
> > For another proof, there is indeed a small window where the buffer could be
> > seen dirty.
> > Have a look at the code and comment in do_journal_get_write_access:
> > ----------------
> > int do_journal_get_write_access(handle_t *handle, struct inode *inode,
> > struct buffer_head *bh)
> > {
> > ...
> > /*
> > * __block_write_begin() could have dirtied some buffers. Clean
> > * the dirty bit as jbd2_journal_get_write_access() could complain
> > * otherwise about fs integrity issues. Setting of the dirty bit
> > * by __block_write_begin() isn't a real problem here as we clear
> > * the bit before releasing a page lock and thus writeback cannot
> > * ever write the buffer.
> > */
> > if (dirty)
> > clear_buffer_dirty(bh); // clear the dirty immdiately in case some bad
> > things happen
>
> OK, it was even me adding that comment 14 years ago ;) I already forgot
> about this nuance.
>

That's quite a long time. Jan, you are a great developer. Cheers for you.:P

> So I agree with your analysis now. But still don't like adding hacks to
> jbd2 to acommodate for this oddity of data=journal mode. Since we already
> have ext4_block_write_begin() implementation anyway, we should be able to
> tweak it to do the right thing for data=journal mode inodes...
>
> So we could replace uses of __block_write_begin() with
> ext4_block_write_begin() and then call do_journal_get_write_access() in
> ext4_block_write_begin() for inodes with journalled data after the buffer
> is mapped with get_block().
>
> From the part:
>                                 if (folio_test_uptodate(folio)) {
>                                         clear_buffer_new(bh);
>                                         set_buffer_uptodate(bh);
>                                         mark_buffer_dirty(bh);
>                                         continue;
>                                 }
>
> we can actually remove the clear_buffer_new() and mark_buffer_dirty() bits
> because they will be done by block_commit_write() or
> folio_zero_new_buffers() and they are superfluous and somewhat odd here
> anyway.
>
> And the call to folio_zero_new_buffers() from ext4_block_write_begin()
> needs to call ext4_journalled_zero_new_buffers() for inodes where data is
> journalled.
>
> Will you try to implement this or should I look into it?
>

Yeah, Thank you for giving me the opportunity to work on something truly
meaningful. All I can do until now is some small cleanups. And doing cleanups
all the time is annoyable to the maintainers and frustrating to me. I
will try my best.

So basically, we should:
1.Trace the user data dirting in ext4_block_write_begin().
2.Replace the uncontrollable __block_write_begin with ext4_block_write_begin().
3.Remove some superfluous things.

Cheers,
Stephen.

>                                                                 Honza
> --
> Jan Kara <jack@xxxxxxxx>
> SUSE Labs, CR