Re: [PATCH] generic: skip dm-log-writes tests on XFS v5 superblock filesystems

Amir Goldstein <amir73il@xxxxxxxxx> · Wed, 27 Feb 2019 06:49:56 +0200

On Wed, Feb 27, 2019 at 6:19 AM Qu Wenruo <quwenruo.btrfs@xxxxxxx> wrote:
>
>
>
> On 2019/2/27 下午12:06, Amir Goldstein wrote:
> > On Wed, Feb 27, 2019 at 1:22 AM Dave Chinner <david@xxxxxxxxxxxxx> wrote:
> >>
> >> On Tue, Feb 26, 2019 at 11:10:02PM +0200, Amir Goldstein wrote:
> >>> On Tue, Feb 26, 2019 at 8:14 PM Brian Foster <bfoster@xxxxxxxxxx> wrote:
> >>>>
> >>>> The dm-log-writes mechanism runs a workload against a filesystem,
> >>>> tracks underlying FUAs and restores the filesystem to various points
> >>>> in time based on FUA marks. This allows fstests to check fs
> >>>> consistency at various points and verify log recovery works as
> >>>> expected.
> >>>>
> >>>
> >>> Inaccurate. generic/482 restores to FUA points.
> >>> generic/45[57] restore to user defined points in time (marks).
> >>> dm-log-writes mechanism is capable of restoring either.
> >>>
> >>>> This mechanism does not play well with LSN based log recovery
> >>>> ordering behavior on XFS v5 superblocks, however. For example,
> >>>> generic/482 can reproduce false positive corruptions based on extent
> >>>> to btree conversion of an inode if the inode and associated btree
> >>>> block are written back after different checkpoints. Even though both
> >>>> items are logged correctly in the extent-to-btree transaction, the
> >>>> btree block can be relogged (multiple times) and only written back
> >>>> once when the filesystem unmounts. If the inode was written back
> >>>> after the initial conversion, recovery points between that mark and
> >>>> when the btree block is ultimately written back will show corruption
> >>>> because log recovery sees that the destination buffer is newer than
> >>>> the recovered buffer and intentionally skips the buffer. This is a
> >>>> false positive because the destination buffer was resiliently
> >>>> written back after being physically relogged one or more times.
> >>>>
> >>>
> >>> This story doesn't add up.
> >>> Either dm-log-writes emulated power failure correctly or it doesn't.
> >>> My understanding is that the issue you are seeing is a result of
> >>> XFS seeing "data from the future" after a restore of a power failure
> >>> snapshot, because the scratch device is not a clean slate.
> >>> If I am right, then the correct solution is to wipe the journal before
> >>> starting to replay restore points.
> >>
> >> If that is the problem, then I think we should be wiping the entire
> >> block device before replaying the recorded logwrite.
> >>
> >
> > Indeed.
>
> May I ask a stupid question?
>
> How does it matter whether the device is clean or not?
> Shouldn't the journal/metadata or whatever be self-contained?
>

Yes and no.

The most simple example (not limited to xfs and not sure it is like that in xfs)
is how you find the last valid journal commit entry. It should have correct CRC
and the largest LSN. But it you replay IO on top of existing journal without
wiping it first, then journal recovery will continue past the point to meant to
replay or worse.

The problem that Brian describes is more complicated than that and not
limited to the data in the journal IIUC, but I think what I described above
may plague also ext4 and xfs v4.

> >
> >> i.e. this sounds like a "block device we are replaying onto has
> >> stale data in it" problem because we are replaying the same
> >> filesystem over the top of itself.  Hence there are no unique
> >> identifiers in the metadata that can detect stale metadata in
> >> the block device.
> >>
> >> I'm surprised that we haven't tripped over this much earlier that
> >> this...
> >>
> >
> > I remember asking myself the same thing... it's coming back to me
> > now. I really remember having this discussion during test review.
> > generic/482 is an adaptation of Josef's test script [1], which
> > does log recovery onto a snapshot on every FUA checkpoint.
> >
> > [1] https://github.com/josefbacik/log-writes/blob/master/replay-fsck-wrapper.sh
> >
> > Setting up snapshots for every checkpoint was found empirically to take
> > more test runtime, than replaying log from the start for each checkpoint.
> > That observation was limited to the systems that Qu and Eryu tested on.
> >
> > IRC, what usually took care of cleaning the block device is replaying the
> > "discard everything" IO from mkfs time.
>
> This "discard everything" assumption doesn't look right to me.
> Although most mkfs would discard at least part of the device, even
> without discarding the newly created fs should be self-contained, no
> wild pointer points to some garbage.
>

It's true. We shouldn't make this assumption.
That was my explanation to Dave's question, how come we didn't see
this before?

Here is my log-writes info from generic/482:
./src/log-writes/replay-log -vv --find --end-mark mkfs --log
$LOGWRITES_DEV |grep DISCARD
seek entry 0@2: 0, size 8388607, flags 0x4(DISCARD)
seek entry 1@3: 8388607, size 8388607, flags 0x4(DISCARD)
seek entry 2@4: 16777214, size 4194306, flags 0x4(DISCARD)

> I though all metadata/journal write should be self-contained even for
> later fs writes.
>
> Am I missing something? Or do I get too poisoned by btrfs CoW?
>

I'd be very surprised if btrfs cannot be flipped by seeing stale data "from
the future" in the block device. Seems to me like the entire concept of
CoW and metadata checksums is completely subverted by the existence
of correct checksums on "stale metadata from the future".

Thanks,
Amir.