Re: [PATCH] xfs/larp: Make test failures debuggable

Dave Chinner <david@xxxxxxxxxxxxx> · Sat, 7 May 2022 07:48:59 +1000

On Sat, May 07, 2022 at 04:02:12AM +0800, Zorro Lang wrote:
> On Fri, May 06, 2022 at 06:08:08PM +0000, Catherine Hoang wrote:
> > > On May 6, 2022, at 9:40 AM, Zorro Lang <zlang@xxxxxxxxxx> wrote:
> > > 
> > > On Fri, May 06, 2022 at 09:14:42AM -0700, Darrick J. Wong wrote:
> > >> On Fri, May 06, 2022 at 05:51:41PM +1000, Dave Chinner wrote:
> > >>> From: Dave Chinner <dchinner@xxxxxxxxxx>
> > >>> 
> > >>> Md5sum output for attributes created combined program output and
> > >>> attribute values. This adds variable path names to the md5sum, so
> > >>> there's no way to tell if the md5sum is actually correct for the
> > >>> given attribute value that is returned as it's not constant from
> > >>> test to test. Hence we can't actually say that the output is correct
> > >>> because we can't reproduce exactly what we are hashing easily.
> > >>> 
> > >>> Indeed, the last attr test in series (node w/ replace) had an
> > >>> invalid md5sum. The attr value being produced after recovery was
> > >>> correct, but the md5sum output match was failing. Golden output
> > >>> appears to be wrong.
> > >>> 
> > >>> Fix this issue by seperately dumping all the attributes on the inode
> > >>> via a list operation to indicate their size, then dump the value of
> > >>> the test attribute directly to md5sum. This means the md5sum for
> > >>> the attributes using the same fixed values are all identical, so
> > >>> it's easy to tell if the md5sum for a given test is correct. We also
> > >>> check that all attributes that should be present after recovery are
> > >>> still there (e.g. checks recovery didn't trash innocent bystanders).
> > >>> 
> > >>> Further, the attribute replace tests replace an attribute with an
> > >>> identical value, hence there is no way to tell if recovery has
> > >>> resulted in the original being left behind or the new attribute
> > >>> being fully recovered because both have the same name and value.
> > >>> When replacing the attribute value, use a different sized value so
> > >>> it is trivial to determine that we've recovered the new attribute
> > >>> value correctly.
> > >>> 
> > >>> Also, the test runs on the scratch device - there is no need to
> > >>> remove the testdir in _cleanup. Doing so prevents post-mortem
> > >>> failure analysis because it burns the dead, broken corpse to ash and
> > >>> leaves no way of to determine cause of death.
> > >>> 
> > >>> Signed-off-by: Dave Chinner <dchinner@xxxxxxxxxx>
> > >>> ---
> > >>> 
> > >>> Hi Catherine,
> > >>> 
> > >>> These are all the mods I needed to make to be able to understand the
> > >>> test failures I was getting as I debugged the new LARP recovery
> > >>> algorithm I've written.  You'll need to massage the test number in
> > >>> this patch to apply it on top of your patch.
> > >>> 
> > >>> I haven't added any new test cases yet, nor have I done anything to
> > >>> manage the larp sysfs knob, but we'll need to do those in the near
> > >>> future.
> > >>> 
> > >>> Zorro, can you consider merging this test in the near future?  We're
> > >>> right at the point of merging the upstream kernel code and so really
> > >>> need to start growing the test coverage of this feature, and this
> > >>> test should simply not-run on kernels that don't have the feature
> > >>> enabled....
> > >>> 
> > >>> Cheers,
> > >>> 
> > >>> Dave.
> > >>> ---
> > >>> 
> > >>> tests/xfs/600     |  20 +++++-----
> > >>> tests/xfs/600.out | 109 ++++++++++++++++++++++++++++++++++++------------------
> > >>> 2 files changed, 85 insertions(+), 44 deletions(-)
> > >>> 
> > >>> diff --git a/tests/xfs/600 b/tests/xfs/600
> > >>> index 252cdf27..84704646 100755
> > >>> --- a/tests/xfs/600
> > >>> +++ b/tests/xfs/600
> > >>> @@ -16,7 +16,7 @@ _begin_fstest auto quick attr
> > >>> 
> > >>> _cleanup()
> > >>> {
> > >>> -	rm -rf $tmp.* $testdir
> > >>> +	rm -rf $tmp.*
> > >>> 	test -w /sys/fs/xfs/debug/larp && \
> > >>> 		echo 0 > /sys/fs/xfs/debug/larp
> > >> 
> > >> Blergh, this ^^^^^^^^^ is going to need fixing too.

Yes, I did point that out.

> > >> 
> > >> Please save the old value, then write it back in the _cleanup function.
> > > 
> > > Ok, I'm going to do that when I merge it,

No, please don't. I don't want random changes added to the test on
commit right now as I'm still actively working on it. I've said
twice now that this needs fixing (3 if you count this mention) and
that the test coverage also needs improving. If someone is still
working on the tests, then why make more work for everyone by making
unnecessary, unreviewed changes on commit?

> > > if Catherine wouldn't like to do
> > > more changes in a V8 patch. If this case still need more changes, please tell
> > > me in time, and then it might have to wait the fstests release after next, if
> > > too late.
> > > 
> > > Thanks,
> > > Zorro
> > 
> > Based on Dave’s feedback, it looks like the patch will need a few more
> > changes before it’s ready.

That doesn't mean it can't be merged. It is a pain for mulitple
people to collaborate on a test that isn't merged because the test
number is not set in stone and chosing numbers always conflicts with
other tests under development. Getting the test merged early makes
knocking all the problems out of the test (and the code it is
testing!) much, much easier.

> Great, that would be really helpful if you'd like to rebase this patch to fstests
> for-next branch. And how about combine Dave's patch with your patch? I don't think
> it's worth merging two seperated patches for one signle new case. What does Dave think?
> 
> I just merged below two patchset[1] into xfs-5.18-fixes-1, then tried to test this case
> (with you and Dave's patch). But it always failed as [2].

You built a broken kernel as it has none of the dependencies and bug
fixes that had already been committed to for-next for the new
functionality to work correctly. I posted a V3 patchset last night
and a published a git branch with all the kernel changes that you
can use for testing if you want.

> Please make sure it works
> as expected with those kernel patches still reviewing,

I did - the failures you see are expected from what you were
testing. i.e. the test ran just fine, the kernel code you were
running is buggy and you didn't update xfsprogs so logprint
supported the new log types, hence the test (correctly) reported
failures.

Cheers,

Dave.
-- 
Dave Chinner
david@xxxxxxxxxxxxx