Re: Testing devices for discard support properly

Ric Wheeler <ricwheeler@xxxxxxxxx> · Thu, 9 May 2019 09:40:06 -0400

On 5/8/19 11:55 PM, Martin K. Petersen wrote:

Dave,

Only when told to do PUNCH_HOLE|NO_HIDE_STALE which means "we don't
care what the device does" as this fallcoate command provides no
guarantees for the data returned by subsequent reads. It is,
esssentially, a get out of gaol free mechanism for indeterminate
device capabilities.

Correct. But the point of discard is to be a lightweight mechanism to
convey to the device that a block range is no longer in use. Nothing
more, nothing less.

Not everybody wants the device to spend resources handling unwritten
extents. I understand the importance of that use case for XFS but other
users really just need deallocate semantics.

People used to make that assertion about filesystems, too. It took
linux filesystem developers years to realise that unwritten extents
are actually very simple and require very little extra code and no
extra space in metadata to implement. If you are already tracking
allocated blocks/space, then you're 99% of the way to efficient
management of logically zeroed disk space.

I don't disagree. But since "discard performance" checkmark appears to
be absent from every product requirements document known to man, very
little energy has been devoted to ensuring that discard operations can
coexist with read/write I/O without impeding the performance.

I'm not saying it's impossible. Just that so far it hasn't been a
priority. Even large volume customers have been unable to compel their
suppliers to produce a device that doesn't suffer one way or the other.

On the SSD device side, vendors typically try to strike a suitable
balance between what's handled by the FTL and what's handled by
over-provisioning.

2. Our expectation for the allocating REQ_ZEROOUT (FL_ZERO_RANGE), which
    gets translated into NVMe WRITE ZEROES, SCSI WRITE SAME, is that the
    command executes in O(n) but that it is faster -- or at least not
    worse -- than doing a regular WRITE to the same block range.

You're missing the important requirement of fallocate(ZERO_RANGE):
that the space is also allocated and ENOSPC will never be returned
for subsequent writes to that range. i.e. it is allocated but
"unwritten" space that contains zeros.

That's what I implied when comparing it to a WRITE.

3. Our expectation for the deallocating REQ_ZEROOUT (FL_PUNCH_HOLE),
    which gets translated into ATA DSM TRIM w/ whitelist, NVMe WRITE
    ZEROES w/ DEAC, SCSI WRITE SAME w/ UNMAP, is that the command will
    execute in O(1) for any portion of the block range described by the

FL_PUNCH_HOLE has no O(1) requirement - it has a "all possible space
must be freed" requirement. The larger the range, to longer it will
take.

OK, so maybe my O() notation lacked a media access moniker. What I meant
to convey was that no media writes take place for the properly aligned
multiple of the internal granularity. The FTL update takes however long
it takes, but the only potential media accesses would be the head and
tail pieces. For some types of devices, these might be handled in
translation tables. But for others, zeroing blocks on the media is the
only way to do it.

That's expected, and exaclty what filesystems do for sub-block punch
and zeroing ranges.

Yep.

What I'm saying is that we should be pushing standards to ensure (3)
is correctly standardise, certified and implemented because that is
what the "Linux OS" requires from future hardware.

That's well-defined for both NVMe and SCSI.

However, I do not agree that a deallocate operation has to imply
zeroing. I think there are valid use cases for pure deallocate.

In an ideal world the performance difference between (1) and (3) would
be negligible and make this distinction moot. However, we have to
support devices that have a wide variety of media and hardware
characteristics. So I don't see pure deallocate going away. Doesn't mean
that I am not pushing vendors to handle (3) because I think it is very
important. And why we defined WRITE ZEROES in the first place.

All of this makes sense to me.

I think that we can get value out of measuring how close various devices 
come to realizing the above assumptions.  Clearly, file systems (as 
Chris mentioned) do have to adapt to varying device performance issues, 
but I think today the variation can be orders of magnitude for large 
(whole device) discards and that it not something that is easy to 
tolerate....

ric