Re: [PATCH] block: reintroduce discard_zeroes_data sysfs file and BLKDISCARDZEROES

Lukas Czerner <lczerner@xxxxxxxxxx> · Thu, 17 Aug 2017 09:47:44 +0200

On Wed, Aug 16, 2017 at 09:49:22PM -0400, Martin K. Petersen wrote:
> e standards tweaked the definitions a bit so the semantics became
> even more confusing and harder to honor in the drivers.
> 
> As a result, we changed things so that discards are only used to
> de-provision blocks. And the zeroout call/ioctl is used to zero block
> ranges.
> 
> Which ATA/SCSI/NVMe command is issued on the back-end depends on what's
> supported by the device and is hidden from the caller.
> 
> However, zeroout is guaranteed to return a zeroed block range on
> subsequent reads. The blocks may be unmapped, anchored, written
> explicitly, written with write same, or a combination thereof. But you
> are guaranteed predictable results.
> 
> Whereas a discarded region may be sliced and diced and rounded off
> before it hits the device. Which is then free to ignore all or parts of
> the request.
> 
> Consequently, discard_zeroes_data is meaningless. Because there is no
> guarantee that all of the discarded blocks will be acted upon. It
> kinda-sorta sometimes worked (if the device was whitelisted, had a
> reported alignment of 0, a granularity of 512 bytes, stacking didn't get
> in the way, and you were lucky on the device end). But there were always
> conditions.

Thanks for the detailed explanation. That's wery usefull to know!

> 
> So taking a step back: What information specifically were you trying to
> obtain from querying that flag? And why do you need it?

There are many users that historically benefit from the
"discard_zeroes_data" semantics. For example mkfs, where it's beneficial
to discard the blocks before creating a file system and if we also get
deterministic zeroes on read, even better since we do not have to
initialize some portions of the file system manually.

The other example might be virtualization where they can support
efficient "Wipe After Delete" and "Enable Discard" in case that
"discard_zeroes_data". I am sure there are other examples.

So I understand now that Deterministic Read Zero after TRIM is not
realiable so we do not want to use that flag because we can't guarantee
it in this case. However there are other situations where we can such
as loop device (might be especially usefull for VM) where backing file
system supports punch hole, or even SCSI write same with UNMAP ?

Currently user space can call fallocate with FALLOC_FL_PUNCH_HOLE |
FALLOC_FL_KEEP_SIZE however if that succeeds we're only guaranteed that
the range has been zeroed, not unmapped/discarded ? (that's not very
clear from the comments). None of the modes seems to guarantee both
zeroout and unmap in case of success. However still, there seem to be no
way to tell what's actually supported from user space without ending up
calling fallocate, is there ? While before we had discard_zeroes_data
which people learned to rely on in certain situations, even though it
might have been shaky.

I actually like the rewrite the Christoph did, even though documentation
seems to be lacking. But I just wonder if it's possible to bring back
the former functionality, at least in some form.

Thanks!
-Lukas

> 
> -- 
> Martin K. Petersen	Oracle Linux Engineering