Hi Dave, > My big question here is this: > > - is "discard" even relevant for future devices? It's hard to make predictions. Especially about the future. But discard is definitely relevant on a bunch of current drives across the entire spectrum from junk to enterprise. Depending on workload, over-provisioning, media type, etc. Plus, as Ric pointed out, thin provisioning is also relevant. Different use case but exactly the same plumbing. > IMO, trying to "optimise discard" is completely the wrong direction > to take. We should be getting rid of "discard" and it's interfaces > operations - deprecate the ioctls, fix all other kernel callers of > blkdev_issue_discard() to call blkdev_fallocate() blkdev_fallocate() is implemented using blkdev_issue_discard(). > and ensure that drive vendors understand that they need to make > FALLOC_FL_ZERO_RANGE and FALLOC_FL_PUNCH_HOLE work, and that > FALLOC_FL_PUNCH_HOLE | FALLOC_FL_NO_HIDE_STALE is deprecated (like > discard) and will be going away. Fast, cheap, easy. Pick any two. The issue is that -- from the device perspective -- guaranteeing zeroes requires substantially more effort than deallocating blocks. To the point where several vendors have given up making it work altogether and either report no discard support or silently ignore discard requests causing you to waste queue slots for no good reason. So while instant zeroing of a 100TB drive would be nice, I don't think it's a realistic goal given the architectural limitations of many of these devices. Conceptually, you'd think it would be as easy as unlinking an inode. But in practice the devices keep much more (and different) state around in their FTLs than a filesystem does in its metadata. Wrt. device command processing performance: 1. Our expectation is that REQ_DISCARD (FL_PUNCH_HOLE | FL_NO_HIDE_STALE), which gets translated into ATA DSM TRIM, NVMe DEALLOCATE, SCSI UNMAP, executes in O(1) regardless of the number of blocks operated on. Due to the ambiguity of ATA DSM TRIM and early SCSI we ended up in a situation where the industry applied additional semantics (deterministic zeroing) to that particular operation. And that has caused grief because devices often end up in the O(n-or-worse) bucket when determinism is a requirement. 2. Our expectation for the allocating REQ_ZEROOUT (FL_ZERO_RANGE), which gets translated into NVMe WRITE ZEROES, SCSI WRITE SAME, is that the command executes in O(n) but that it is faster -- or at least not worse -- than doing a regular WRITE to the same block range. 3. Our expectation for the deallocating REQ_ZEROOUT (FL_PUNCH_HOLE), which gets translated into ATA DSM TRIM w/ whitelist, NVMe WRITE ZEROES w/ DEAC, SCSI WRITE SAME w/ UNMAP, is that the command will execute in O(1) for any portion of the block range described by the I/O that is aligned to and a multiple of the internal device granularity. With an additional small O(n_head_LBs) + O(n_tail_LBs) overhead for zeroing any LBs at the beginning and end of the block range described by the I/O that do not comprise a full block wrt. the internal device granularity. Does that description make sense? The problem is that most vendors implement (3) using (1). But can't make it work well because (3) was -- and still is for ATA -- outside the scope of what the protocols can express. And I agree with you that if (3) was implemented correctly in all devices, we wouldn't need (1) at all. At least not for devices with an internal granularity << total capacity. -- Martin K. Petersen Oracle Linux Engineering