Re: [PATCH] common/fuzzy: don't attempt online scrubbing unless supported

"Darrick J. Wong" <darrick.wong@xxxxxxxxxx> · Mon, 27 Apr 2020 10:03:30 -0700

On Fri, Apr 24, 2020 at 01:03:30PM +0200, Anthony Iliopoulos wrote:
> On Tue, Apr 21, 2020 at 08:37:17AM -0700, Darrick J. Wong wrote:
> > On Tue, Apr 21, 2020 at 01:36:42PM +0200, Anthony Iliopoulos wrote:
> > > Many xfs metadata fuzzing tests invoke xfs_scrub to detect online errors
> > > even when _scratch_xfs_fuzz_metadata is invoked with "offline". This
> > > causes those tests to fail with output mismatches on kernels that don't
> > > enable CONFIG_XFS_ONLINE_SCRUB. Bypass scrubbing when not supported.
> > > 
> > > Signed-off-by: Anthony Iliopoulos <ailiop@xxxxxxxx>
> > > ---
> > >  common/fuzzy | 2 +-
> > >  1 file changed, 1 insertion(+), 1 deletion(-)
> > > 
> > > diff --git a/common/fuzzy b/common/fuzzy
> > > index 988203b1..83ddc3e8 100644
> > > --- a/common/fuzzy
> > > +++ b/common/fuzzy
> > > @@ -238,7 +238,7 @@ __scratch_xfs_fuzz_field_test() {
> > >  	if [ $res -eq 0 ]; then
> > >  		# Try an online scrub unless we're fuzzing ag 0's sb,
> > >  		# which scrub doesn't know how to fix.
> > > -		if [ "${repair}" != "none" ]; then
> > > +		if _supports_xfs_scrub "${SCRATCH_MNT}" "${SCRATCH_DEV}"; then
> > 
> > Huh?
> > 
> > This changes the behavior of the repair=none fuzz tests, which mutate
> > filesystems and then write to them without running any checking tools
> > whatsoever to see if we can trip over the mutations with only regular
> > filesystem operations.  Whereas before, we'd skip xfs_scrub, now we
> > always run it if it's supported.
> 
> oops...right, we want to let the verifiers catch the errors here.
> 
> Speaking of which, I've been staring at the scripts but it's not clear
> to me how the repair=none fuzz tests are expected to function. Many of
> those tests corrupt AG metadata headers (e.g. the AGI lsn), which means
> mount bails out with an error. But the golden output doesn't account for
> that, so those tests will fail (e.g. xfs/456).

None of the dangerous_fuzz* tests have golden output.  I've thought
about posting the ones that are in my dev tree, but there's probably
not much point until I/we finish fixing the things that repair misses.
Ditto with scrub.

TBH I wrote all these field fuzzers solely as a means to fuzz things
systematically, and didn't think too hard about using them as regression
tests.  Back when I introduced the dangerous_norepair tests, it was
success enough if the kernel didn't crash.  I /think/ we've tightened
things up enough that it's time to move on to more careful checking for
runtime errors.

> Further, for things like inode metadata fuzzing where the fs is usually
> mountable, the tests will always succeed irrespective of the verifiers
> firing or not (e.g. xfs/465).
> 
> I'd assume all those repair=none tests would need to check dmesg for
> metadata corruptions, so something like:
> 
> --- a/common/fuzzy
> +++ b/common/fuzzy
> @@ -254,3 +254,3 @@ __scratch_xfs_fuzz_field_test() {
>  		__scratch_xfs_fuzz_unmount
> -	else
> +	elif [ "${repair}" != "none" ]; then
>  		(>&2 echo "re-mount failed ($res) with ${field} = ${fuzzverb}.")

Hmm, seems reasonable.  The _scratch_fuzz_modify helper probably needs
to be modified to complain if the fs writes actually succeed.

> --- a/tests/xfs/456
> +++ b/tests/xfs/456
> @@ -43,2 +43,5 @@ echo "Done fuzzing AGI"
> 
> +_check_dmesg_for "Metadata corruption detected" || \
> +	_fail "Missing metadata corruption messages!"

I'd put this at the end of __scratch_xfs_fuzz_field_test, since there
are dozens of tests that use this function, not just xfs/456.

The long term goal is to make all the corruption bailouts report
themselves to the online health system so that userspace can query the
filesystem status (via xfs_spaceman) without having to scrape dmesg.

--D

> +
>  # success, all done
> 
> If this makes sense, I'll send a separate patch to address it, and fix
> all repair=none tests as above.
> 
> > The open-coded repair conditionals scattered through this funciton
> > probably ought to be refactored into helpers, e.g.
> > 
> > __fuzz_want_scrub_check() {
> > 	local repair="$1"
> > 	local field="$2"
> > 
> > 	test "${repair}" != "none" && \
> > 		_supports_xfs_scrub "${SCRATCH_MNT}" "${SCRATCH_DEV}" && \
> > 		[ "${field}" != "sb 0" ]
> > }
> > 
> > if [ $res -eq 0 ]; then
> > 	# Try an online scrub...
> > 	if __fuzz_want_scrub_check "${repair}" "${field}"; then
> > 		_scratch_scrub -n -a 1 -e continue 2>&1
> > 		...
> 
> Will do that and send a v2, thanks for the review!