Re: [PATCH] xfs: fix incorrect usage of xfs_btree_check_block

"Darrick J. Wong" <djwong@xxxxxxxxxx> · Thu, 17 Nov 2022 11:07:22 -0800

On Thu, Nov 17, 2022 at 10:13:46AM -0600, Eric Sandeen wrote:
> On 11/7/22 7:50 PM, Guo Xuenan wrote:
> > On 2022/11/8 0:58, Darrick J. Wong wrote:
> >> On Thu, Nov 03, 2022 at 07:37:09PM +0800, Guo Xuenan wrote:
> >>> xfs_btree_check_block contains a tag XFS_ERRTAG_BTREE_CHECK_{L,S}BLOCK,
> >>> it is a fault injection tag, better not use it in the macro ASSERT.
> >>>
> >>> Since with XFS_DEBUG setting up, we can always trigger assert by `echo 1
> >>>> /sys/fs/xfs/${disk}/errortag/btree_chk_{s,l}blk`.
> >>> It's confusing and strange.
> >> Please be more specific about how this is confusing or strange.
> > I meant in current code, the ASSERT will alway happen,when we
> > `echo 1 > /sys/fs/xfs/${disk}/errortag/btree_chk_{s,l}blk`.
> > xfs_btree_islastblock
> >   ->ASSERT(block && xfs_btree_check_block(cur, block, level, bp) == 0);
> >     ->xfs_btree_check_{l/s}block
> >       ->XFS_TEST_ERROR(false, mp, XFS_ERRTAG_BTREE_CHECK_{S,L}BLOCK)
> > we can use error injection to trigger this ASSERT.
> 
> Hmmm...
> 
> > I think ASERRT macro and error injection are to find some effective problems,
> > not to create some kernel panic.
> 
> You can avoid a panic by turning XFS_ASSERT_FATAL off in Kconfig, or
> at runtime by setting fs.xfs.bug_on_assert to 0, but ...
> 
> > So, putting the error injection function in
> > ASSERT is a little strange.
> 
> Ok, so I think the argument is that in the default config, setting this error
> injection tag will immediately result in a system panic, which probably isn't
> what we want.  Is my understanding correct?
> 
> But in the bigger picture, isn't this:
> 
> ASSERT(block && xfs_btree_check_block(cur, block, level, bp) == 0);
> 
> putting a disk corruption check into an ASSERT? That in itself seems a bit
> suspect.  However, the ASSERT was all introduced in:
> 
> commit 27d9ee577dccec94fb0fc1a14728de64db342f86
> Author: Darrick J. Wong <darrick.wong@xxxxxxxxxx>
> Date:   Wed Nov 6 08:47:09 2019 -0800
> 
>     xfs: actually check xfs_btree_check_block return in xfs_btree_islastblock
>     
>     Coverity points out that xfs_btree_islastblock doesn't check the return
>     value of xfs_btree_check_block.  Since the question "Does the cursor
>     point to the last block in this level?" only makes sense if the caller
>     previously performed a lookup or seek operation, the block should
>     already have been checked.
>     
>     Therefore, check the return value in an ASSERT and turn the whole thing
>     into a static inline predicate.
>     
>     Coverity-id: 114069
>     Signed-off-by: Darrick J. Wong <darrick.wong@xxxxxxxxxx>
>     Reviewed-by: Christoph Hellwig <hch@xxxxxx>
> 
> which seems to imply that we really should not get here with a corrupt block during
> normal operation.
> 
> Perhaps the error tag can get set after the block "should already have been checked"
> but before this test in the ASSERT?

What I want to know here is *how* do we get to the point of tripping
this assertion via debugknob?  Won't the lookup or seek operation have
already checked the block and failed with EFSCORRUPTED?  And shouldn't
that be enough to stop whatever code calls xfs_btree_islastblock?  If
not, how do we get there?

Seriously, I don't want to burn more time discussing where and how to
fail on debugging knobs when there are all these other **far more
serious** corruption and deadlock problems that people are trying to get
merged.

Tell me specifically how to make the system fail.  "It's confusing and
strange" is not good enough.

--D

> -Eric
>