Re: Testing devices for discard support properly

"Martin K. Petersen" <martin.petersen@xxxxxxxxxx> · Wed, 08 May 2019 12:16:24 -0400

Hi Dave,

> My big question here is this:
>
> - is "discard" even relevant for future devices?

It's hard to make predictions. Especially about the future. But discard
is definitely relevant on a bunch of current drives across the entire
spectrum from junk to enterprise. Depending on workload,
over-provisioning, media type, etc.

Plus, as Ric pointed out, thin provisioning is also relevant. Different
use case but exactly the same plumbing.

> IMO, trying to "optimise discard" is completely the wrong direction
> to take. We should be getting rid of "discard" and it's interfaces
> operations - deprecate the ioctls, fix all other kernel callers of
> blkdev_issue_discard() to call blkdev_fallocate()

blkdev_fallocate() is implemented using blkdev_issue_discard().

> and ensure that drive vendors understand that they need to make
> FALLOC_FL_ZERO_RANGE and FALLOC_FL_PUNCH_HOLE work, and that
> FALLOC_FL_PUNCH_HOLE | FALLOC_FL_NO_HIDE_STALE is deprecated (like
> discard) and will be going away.

Fast, cheap, easy. Pick any two.

The issue is that -- from the device perspective -- guaranteeing zeroes
requires substantially more effort than deallocating blocks. To the
point where several vendors have given up making it work altogether and
either report no discard support or silently ignore discard requests
causing you to waste queue slots for no good reason.

So while instant zeroing of a 100TB drive would be nice, I don't think
it's a realistic goal given the architectural limitations of many of
these devices. Conceptually, you'd think it would be as easy as
unlinking an inode. But in practice the devices keep much more (and
different) state around in their FTLs than a filesystem does in its
metadata.

Wrt. device command processing performance:

1. Our expectation is that REQ_DISCARD (FL_PUNCH_HOLE |
   FL_NO_HIDE_STALE), which gets translated into ATA DSM TRIM, NVMe
   DEALLOCATE, SCSI UNMAP, executes in O(1) regardless of the number of
   blocks operated on.

   Due to the ambiguity of ATA DSM TRIM and early SCSI we ended up in a
   situation where the industry applied additional semantics
   (deterministic zeroing) to that particular operation. And that has
   caused grief because devices often end up in the O(n-or-worse) bucket
   when determinism is a requirement.

2. Our expectation for the allocating REQ_ZEROOUT (FL_ZERO_RANGE), which
   gets translated into NVMe WRITE ZEROES, SCSI WRITE SAME, is that the
   command executes in O(n) but that it is faster -- or at least not
   worse -- than doing a regular WRITE to the same block range.

3. Our expectation for the deallocating REQ_ZEROOUT (FL_PUNCH_HOLE),
   which gets translated into ATA DSM TRIM w/ whitelist, NVMe WRITE
   ZEROES w/ DEAC, SCSI WRITE SAME w/ UNMAP, is that the command will
   execute in O(1) for any portion of the block range described by the
   I/O that is aligned to and a multiple of the internal device
   granularity. With an additional small O(n_head_LBs) + O(n_tail_LBs)
   overhead for zeroing any LBs at the beginning and end of the block
   range described by the I/O that do not comprise a full block wrt. the
   internal device granularity.

Does that description make sense?

The problem is that most vendors implement (3) using (1). But can't make
it work well because (3) was -- and still is for ATA -- outside the
scope of what the protocols can express.

And I agree with you that if (3) was implemented correctly in all
devices, we wouldn't need (1) at all. At least not for devices with an
internal granularity << total capacity.

-- 
Martin K. Petersen	Oracle Linux Engineering