Re: [PATCH 1/4] generic: require discard zero behavior for dmlogwrites on XFS

Brian Foster <bfoster@xxxxxxxxxx> · Mon, 31 Aug 2020 09:37:03 -0400

On Sat, Aug 29, 2020 at 07:46:59AM +0100, Christoph Hellwig wrote:
> On Thu, Aug 27, 2020 at 02:35:07PM -0400, Brian Foster wrote:
> > OTOH, perhaps the thinp behavior could be internal, but conditional
> > based on XFS. It's not really clear to me if this problem is more of an
> > XFS phenomenon or just that XFS happens to have some unique recovery
> > checking logic that explicitly detects it. It seems more like the
> > latter, but I don't know enough about ext4 or btrfs to say..
> 
> The way I understand the tests (and Josefs mail seems to confirm that)
> is that it uses discards to ensure data disappears.  Unfortunately
> that's only how discard sometimes work, but not all the time.
> 

I think Amir's followup describes how the infrastructure uses discard
better than I could. I'm not intimately familiar with how it works, so
my goal was to take the same approach as generic/482 and update the
tests to provide the predictable behavior expected by the
infrastructure. If folks want to revisit all of that to improve the
tests to not rely on discard and break that dependency, that seems like
a fine direction, but it also seems that can come later as improvements
to the broader infrastructure.

> > > We have a write zeroes operation in the block layer.  For some devices
> > > this is as efficient as discard, and that should (I think) dm.
> > > 
> > 
> > Do you mean BLKZEROOUT? I see that is more efficient than writing zeroes
> > from userspace, but I don't think it's efficient enough to solve this
> > problem. It takes about 3m to manually zero my 15GB lvm (dm-linear)
> > scratch device on my test vm via dd using sync writes. A 'blkdiscard -z'
> > saves me about half that time, but IIRC this is an operation that would
> > occur every time the logwrites device is replayed to a particular
> > recovery point (which can happen many times per test).
> 
> Are we talking about the block layer interface or the userspace syscall
> one?  I though it was the former, in which case REQ_OP_WRITE_ZEROES
> is the interface.  User interface is harder - you need to use fallocate
> on the block device, but the flags are mapped kinda weird:
> 
> FALLOC_FL_PUNCH_HOLE | FALLOC_FL_KEEP_SIZE guarantees you a
> REQ_OP_WRITE_ZEROES, but there is a bunch of other variants that include
> fallbacks.
> 

I was using the BLKZEROOUT ioctl in my previous test because
fallocate(PUNCH_HOLE|KEEP_SIZE) (zeroing offload) isn't supported on
this device. I see similar results as above with
fallocate(PUNCH_HOLE|KEEP_SIZE) though, which seems to fall back to
__blkdev_issue_zero_pages() in that case.

Brian