On Thu, Feb 11, 2021 at 01:35:24PM -0500, Brian Foster wrote: > On Thu, Feb 11, 2021 at 10:12:34AM -0800, Darrick J. Wong wrote: > > On Thu, Feb 11, 2021 at 08:59:58AM -0500, Brian Foster wrote: > > > On Tue, Feb 09, 2021 at 06:56:30PM -0800, Darrick J. Wong wrote: > > > > From: Darrick J. Wong <djwong@xxxxxxxxxx> > > > > > > > > Capture metadump output when various userspace repair and checker tools > > > > fail or indicate corruption, to aid in debugging. We don't bother to > > > > annotate xfs_check because it's bitrotting. > > > > > > > > Signed-off-by: Darrick J. Wong <djwong@xxxxxxxxxx> > > > > --- > > > > README | 2 ++ > > > > common/xfs | 26 ++++++++++++++++++++++++++ > > > > 2 files changed, 28 insertions(+) > > > > > > > > > > > > diff --git a/README b/README > > > > index 43bb0cee..36f72088 100644 > > > > --- a/README > > > > +++ b/README > > > > @@ -109,6 +109,8 @@ Preparing system for tests: > > > > - Set TEST_FS_MODULE_RELOAD=1 to unload the module and reload > > > > it between test invocations. This assumes that the name of > > > > the module is the same as FSTYP. > > > > + - Set SNAPSHOT_CORRUPT_XFS=1 to record compressed metadumps of XFS > > > > + filesystems if the various stages of _check_xfs_filesystem fail. > > > > > > > > - or add a case to the switch in common/config assigning > > > > these variables based on the hostname of your test > > > > diff --git a/common/xfs b/common/xfs > > > > index 2156749d..ad1eb6ee 100644 > > > > --- a/common/xfs > > > > +++ b/common/xfs > > > > @@ -432,6 +432,21 @@ _supports_xfs_scrub() > > > > return 0 > > > > } > > > > > > > > +# Save a compressed snapshot of a corrupt xfs filesystem for later debugging. > > > > +_snapshot_xfs() { > > > > > > The term snapshot has a well known meaning. Can we just call this > > > _metadump_xfs()? > > > > Ok. > > > > > > > > > + local metadump="$1" > > > > + local device="$2" > > > > + local logdev="$3" > > > > + local options="-a -o" > > > > + > > > > + if [ "$logdev" != "none" ]; then > > > > + options="$options -l $logdev" > > > > + fi > > > > + > > > > + $XFS_METADUMP_PROG $options "$device" "$metadump" >> "$seqres.full" 2>&1 > > > > + gzip -f "$metadump" >> "$seqres.full" 2>&1 & > > > > > > Why compress in the background? > > > > Sometimes the metadumps can become very large and I don't tend to have a > > lot of space on the test appliances for storing blobs. > > > > Also, I was under the impression that it was customary for people to > > share compressed metadumps of crashes, so why not save everyone a step? > > > > I do this in the background to avoid holding up the next fstest. > > > > > I wonder if we should just skip the > > > compression step since this requires an option to enable in the first > > > place.. > > > > Seeing as it's optional, I think that's all the more reason to compress. > > > > That's fair. It was more the background task that I was concerned about. > If the issue is that the compression takes too long, ISTM there's a > similar risk of the background compression conflicting with ongoing > tests. E.g., we have various tests that scale out I/O threads to extreme > levels and could delay the compression even longer (or vice versa), we > have no way to prevent multiple background compression tasks from > starting/competing as tests continue to run, etc. <nod> Admittedly I chose gzip because it's decent in both the speed and compression ratio traits; someone else might want xz --extreme. > What about allowing the user to specify an optional env var in the > config file to provide a compression command to use? If set, compress > the file in the foreground. Then the user can determine whether > compression is necessary at all, and if so, which compression tool might > provide a suitable enough time/space tradeoff for the test environment > (i.e., something like lz4 might be faster than gzip or bzip2 at the cost > of space). Good idea! If the user sets SNAPSHOT_XFS_COMPRESSOR to the compressor program of their choice (e.g. 'gzip -9') then we'll use that to compress the metadump. It also occurred to me that I could refactor _scratch_metadump to use this new helper, so I think I'll implement some means for letting actual tests disable compression unconditionally. --D > Brian > > > > > > > > +} > > > > + > > > > # run xfs_check and friends on a FS. > > > > _check_xfs_filesystem() > > > > { > > > ... > > > > @@ -540,6 +564,8 @@ _check_xfs_filesystem() > > > > cat $tmp.repair >>$seqres.full > > > > echo "*** end xfs_repair output" >>$seqres.full > > > > > > > > + test "$SNAPSHOT_CORRUPT_XFS" = "1" && \ > > > > + _snapshot_xfs "$seqres.rebuildrepair.md" "$device" "$2" > > > > > > Why do we collect so many metadump images? Shouldn't all but the last > > > TEST_XFS_REPAIR_REBUILD thing not modify the fs? If so, it seems like we > > > should be able to collect one image (and perhaps just call it > > > "$seqres.$device.md") if any of the first several checks flag a problem. > > > > Yes, the number of metadumps collected can be reduced to two. One if > > scrub or logprint or repair -n fail, and a second one if the user set > > TEST_XFS_REPAIR_REBUILD=1 and either the repair or the repair -n fail. > > > > Will change that. > > > > --D > > > > > > > > Brian > > > > > > > ok=0 > > > > fi > > > > rm -f $tmp.repair > > > > > > > > > >