Re: [PATCH 01/12] generic/757: fix various bugs in this test

"Darrick J. Wong" <djwong@xxxxxxxxxx> · Thu, 21 Nov 2024 08:04:15 -0800

On Thu, Nov 21, 2024 at 07:28:09AM -0500, Brian Foster wrote:
> On Thu, Nov 21, 2024 at 11:05:55AM +0100, Christoph Hellwig wrote:
> > On Thu, Nov 21, 2024 at 05:56:24PM +0800, Zorro Lang wrote:
> > > I didn't merge this patch last week, due to we were still talking
> > > about the "discards" things:
> > > 
> > > https://lore.kernel.org/fstests/20241115182821.s3pt4wmkueyjggx3@xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx/T/#u
> > > 
> > > Do you think we need to do a force discards at here, or change the
> > > SCRATCH_DEV to dmthin to support discards?
> > 
> > FYI, I'm seeing regular failures with generic/757 when using Darrick's
> > not yet merged RT rmap support, but only with that.
> > 
> > But the whole discard thing leaves me really confused, and the commit
> > log in the patch references by the above link doesn't clear that up
> > either.
> > 
> > Why does dmlogwrites require discard for XFS (and apprently XFS only)?
> > Note that discard is not required and often does not zero data.  So
> > if we need data to be zeroed we need to do that explicitly, and
> > preferably in a way that is obvious.
> > 
> 
> IIRC it was to accommodate the test program, which presumably used
> discard for efficiency reasons because it did a lot of context switching
> to different point-in-time variations of the fs. If the discard didn't
> actually zero the range (depending on the underlying test dev), then at
> least on XFS, we'd see odd recovery issues and whatnot from the fs going
> forward/back in time.

Yes, that's my recollection too -- performing a logwrite replay of an
old mark means that you can end up with blocks with the correct fs uuid
but an LSN that's higher than anything in the log.  Recovery will then
skip the block replay, which is not correct.

I suppose we could fix log recovery to treat incoming block LSNs that
are higher than the log head as if there were no block contents at all.
OTOH going backwards in time isn't usually a concern...right?

> Therefore the reason for using dm-thin was that it was an easy way to
> provide predictable behavior to the test program, where discards punch
> out blocks that subsequently return zeroes.

Yep.  The test needs to reset the block device to a zeroed state.
Discards get us there quickly, but only if discard_zeroes_data==1.
Hence bolting dm-thinp (where this is guaranteed) onto the logwrites
tests.

> I don't recall all the specifics, but I thought part of the reason for
> using discard over explicit zeroing was the latter made the test
> impractically slow. I could be misremembering, but if you want to change
> it I'd suggest to at least verify runtimes on some of the preexisting
> logwrites tests as well.

Not sure -- I think BLKZEROOUT will cause allocations and real disk
writes if we're not careful.

--D

> Brian
> 
>