On 2021/4/9 9:47, Theodore Ts'o wrote:
On Wed, Apr 07, 2021 at 09:41:57AM +0800, yebin wrote:
If call ext4_ext_insert_extent failed but new extent already inserted, we just
update "ex->ee_len = orig_ex.ee_len", this will lead to extent overlap, then
cause bug on when cache extent.
How did this happen in the first place? It sounds like if the extent
was already inserted, that would be casue there was an on-disk file
system corruption, no?
In that case, shouldn't we call ext4_error() to declare the file
system has an inconsistency, so it can be fixed by fsck?
We inject IO fault when runing fsstress, JBD detect IO error then trigger
JBD abort. At the same time,
if ext4_ext_insert_extent already insert new exntent then call
ext4_ext_dirty to dirty metadata , but
JBD already aborted , ext4_ext_dirty will return error.
In ext4_ext_dirty function call ext4_ext_check_inode check extent if ok, if
not, trigger BUG_ON and
also print extent detail information.
In this particular case, skipping the "ex->ee_len = orig_ex.ee_len"
may avoid the BUG_ON. But it's not clear that this is always the
right thing to do. The fundamental question is what should we do we
run into an error while we are in the middle of making changes to
on-disk and in-memory data structures?
In the ideal world, we should undo the changes that we were in the
middle of making before we return an error. That way, the semantics
are very clear; on success, the function has made the requested change
to the file system. If the function returns an error, then no changes
should be made.
That was the reasoning behind resetting ex->ee_len to orig_ex.ee_len
in the fix_extent_len inside ext4_split_extent_at(). Unofrtunately,
ext4_ext_insert_extent() does *not* always follow this convention, and
that's because it would be extremely difficult for it to do so --- the
mutations that it makes can be quite complex, including potentially
increasing the height of the extent tree.
However, I don't think your fix is by any means the ideal one, because
the much more common way that ext4_ext_insert_extent() is when it
needs to insert a new leaf node, or need to increase the height of the
extent tree --- and in it returns an ENOSPC failure. In that case, it
won't have made any changes changes in the extent tree, and so having
ext4_split_extent_at() undo the change to ex->ee_len is the right
thing to do.
Having blocks get leaked when there is an ENOSPC failure, requiring
fsck to be run --- and without giving the user any warning that this
has happened is *not* a good way to fail. So I don't think the
proposed patch is the right way to go.
A better way to go would be to teach ext4_ext_insert_extent() so if
there is a late failure, that it unwinds the leaf node back to its
original state (at least from a semantic value). Since the extent
leaf node could have been split, and/or adjacent extent entries may
have been merged, what it would need to do is to remember the starting
block number and length, and make whatever changes are necessaries to
the extent entries in that leaf node corresponding to that starting
block number and length.
If you don't want to do that, then a "do no harm" fix would be
something like this:
...
} else if (err == -EROFS) {
return err;
} else if (err)
goto fix_extent_len;
Thanks for your advice. I will send v2 patch.
So in the journal abort case, when err is set to EROFS, we don't try
to reset the length, since in theory the file system is read-only
already anyway. However, in the ENOSPC case, we won't end up silently
leaking blocks that will be lost until the user somehow decides to run
fsck.
There are still times when this doesn't get things completely right
(e.g., what if we get a late ENOMEM error versus an early ENOMEM
failure), where the only real fix is to make ext4_ext_insert_extent()
obey the convention that if it returns an error, it must not result in
any user-visible state change.
Cheers,
- Ted
.