Re: [PATCH 1/2] mkfs: Break block discard into chunks of 2 GB

Eric Sandeen <sandeen@xxxxxxxxxxx> · Tue, 26 Nov 2019 13:40:16 -0600

On 11/22/19 3:30 PM, Eric Sandeen wrote:
> On 11/22/19 3:10 PM, Eric Sandeen wrote:
>> On 11/21/19 5:18 PM, Dave Chinner wrote:
>>> On Thu, Nov 21, 2019 at 10:44:44PM +0100, Pavel Reichl wrote:
>>>> Signed-off-by: Pavel Reichl <preichl@xxxxxxxxxx>
>>>> ---
>>>
>>> This is mixing an explanation about why the change is being made
>>> and what was considered when making decisions about the change.
>>>
>>> e.g. my first questions on looking at the patch were:
>>>
>>> 	- why do we need to break up the discards into 2GB chunks?
>>> 	- why 2GB?
>>> 	- why not use libblkid to query the maximum discard size
>>> 	  and use that as the step size instead?
>>
>> Just wondering, can we trust that to be reasonably performant?
>> (the whole motivation here is for hardware that takes inordinately
>> long to do discard, I wonder if we can count on such hardware to
>> properly fill out this info....)
> 
> Looking at the docs in kernel/Documentation/block/queue-sysfs.rst:
> 
> discard_max_hw_bytes (RO)
> -------------------------
> Devices that support discard functionality may have internal limits on
> the number of bytes that can be trimmed or unmapped in a single operation.
> The discard_max_bytes parameter is set by the device driver to the maximum
> number of bytes that can be discarded in a single operation. Discard
> requests issued to the device must not exceed this limit. A discard_max_bytes
> value of 0 means that the device does not support discard functionality.
> 
> discard_max_bytes (RW)
> ----------------------
> While discard_max_hw_bytes is the hardware limit for the device, this
> setting is the software limit. Some devices exhibit large latencies when
> large discards are issued, setting this value lower will make Linux issue
> smaller discards and potentially help reduce latencies induced by large
> discard operations.
> 
> it seems like a strong suggestion that the discard_max_hw_bytes value may
> still be problematic, and discard_max_bytes can be hand-tuned to something
> smaller if it's a problem.  To me that indicates that discard_max_hw_bytes
> probably can't be trusted to be performant, and presumably discard_max_bytes
> won't be either in that case unless it's been hand-tuned by the admin?

Lukas, Jeff Moyer reminded me that you did a lot of investigation into this
behavior a while back.  Can you shed light on this, particularly how you
chose 2G as the discard granularity for mke2fs?

Thanks,
-Eric