Re: [PATCH 5/5] nvme: support for zoned namespaces

Damien Le Moal <Damien.LeMoal@xxxxxxx> · Wed, 17 Jun 2020 06:47:37 +0000

On 2020/06/17 15:10, Javier González wrote:
> On 17.06.2020 00:14, Damien Le Moal wrote:
>> On 2020/06/17 0:02, Javier González wrote:
>>> On 16.06.2020 14:42, Damien Le Moal wrote:
>>>> On 2020/06/16 23:16, Javier González wrote:
>>>>> On 16.06.2020 12:35, Damien Le Moal wrote:
>>>>>> On 2020/06/16 21:24, Javier González wrote:
>>>>>>> On 16.06.2020 14:06, Matias Bjørling wrote:
>>>>>>>> On 16/06/2020 14.00, Javier González wrote:
>>>>>>>>> On 16.06.2020 13:18, Matias Bjørling wrote:
>>>>>>>>>> On 16/06/2020 12.41, Javier González wrote:
>>>>>>>>>>> On 16.06.2020 08:34, Keith Busch wrote:
>>>>>>>>>>>> Add support for NVM Express Zoned Namespaces (ZNS) Command Set defined
>>>>>>>>>>>> in NVM Express TP4053. Zoned namespaces are discovered based on their
>>>>>>>>>>>> Command Set Identifier reported in the namespaces Namespace
>>>>>>>>>>>> Identification Descriptor list. A successfully discovered Zoned
>>>>>>>>>>>> Namespace will be registered with the block layer as a host managed
>>>>>>>>>>>> zoned block device with Zone Append command support. A namespace that
>>>>>>>>>>>> does not support append is not supported by the driver.
>>>>>>>>>>>
>>>>>>>>>>> Why are we enforcing the append command? Append is optional on the
>>>>>>>>>>> current ZNS specification, so we should not make this mandatory in the
>>>>>>>>>>> implementation. See specifics below.
>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>> There is already general support in the kernel for the zone append
>>>>>>>>>> command. Feel free to submit patches to emulate the support. It is
>>>>>>>>>> outside the scope of this patchset.
>>>>>>>>>>
>>>>>>>>>
>>>>>>>>> It is fine that the kernel supports append, but the ZNS specification
>>>>>>>>> does not impose the implementation for append, so the driver should not
>>>>>>>>> do that either.
>>>>>>>>>
>>>>>>>>> ZNS SSDs that choose to leave append as a non-implemented optional
>>>>>>>>> command should not rely on emulated SW support, specially when
>>>>>>>>> traditional writes work very fine for a large part of current ZNS use
>>>>>>>>> cases.
>>>>>>>>>
>>>>>>>>> Please, remove this virtual constraint.
>>>>>>>>
>>>>>>>> The Zone Append command is mandatory for zoned block devices. Please
>>>>>>>> see https://lwn.net/Articles/818709/ for the background.
>>>>>>>
>>>>>>> I do not see anywhere in the block layer that append is mandatory for
>>>>>>> zoned devices. Append is emulated on ZBC, but beyond that there is no
>>>>>>> mandatory bits. Please explain.
>>>>>>
>>>>>> This is to allow a single write IO path for all types of zoned block device for
>>>>>> higher layers, e.g file systems. The on-going re-work of btrfs zone support for
>>>>>> instance now relies 100% on zone append being supported. That significantly
>>>>>> simplifies the file system support and more importantly remove the need for
>>>>>> locking around block allocation and BIO issuing, allowing to preserve a fully
>>>>>> asynchronous write path that can include workqueues for efficient CPU usage of
>>>>>> things like encryption and compression. Without zone append, file system would
>>>>>> either (1) have to reject these drives that do not support zone append, or (2)
>>>>>> implement 2 different write IO path (slower regular write and zone append). None
>>>>>> of these options are ideal, to say the least.
>>>>>>
>>>>>> So the approach is: mandate zone append support for ZNS devices. To allow other
>>>>>> ZNS drives, an emulation similar to SCSI can be implemented, with that emulation
>>>>>> ideally combined to work for both types of drives if possible.
>>>>>
>>>>> Enforcing QD=1 becomes a problem on devices with large zones. In
>>>>> a ZNS device that has smaller zones this should not be a problem.
>>>>
>>>> Let's be precise: this is not running the drive at QD=1, it is "at most one
>>>> write *request* per zone". If the FS is simultaneously using multiple block
>>>> groups mapped to different zones, you will get a total write QD > 1, and as many
>>>> reads as you want.
>>>>
>>>>> Would you agree that it is possible to have a write path that relies on
>>>>> QD=1, where the FS / application has the responsibility for enforcing
>>>>> this? Down the road this QD can be increased if the device is able to
>>>>> buffer the writes.
>>>>
>>>> Doing QD=1 per zone for writes at the FS layer, that is, at the BIO layer does
>>>> not work. This is because BIOs can be as large as the FS wants them to be. Such
>>>> large BIO will be split into multiple requests in the block layer, resulting in
>>>> more than one write per zone. That is why the zone write locking is at the
>>>> scheduler level, between BIO split and request dispatch. That avoids the
>>>> multiple requests fragments of a large BIO to be reordered and fail. That is
>>>> mandatory as the block layer itself can occasionally reorder requests and lower
>>>> levels such as AHCI HW is also notoriously good at reversing sequential
>>>> requests. For NVMe with multi-queue, the IO issuing process getting rescheduled
>>>> on a different CPU can result in sequential IOs being in different queues, with
>>>> the likely result of an out-of-order execution. All cases are avoided with zone
>>>> write locking and at most one write request dispatch per zone as recommended by
>>>> the ZNS specifications (ZBC and ZAC standards for SMR HDDs are silent on this).
>>>>
>>>
>>> I understand. I agree that the current FSs supporting ZNS follow this
>>> approach and it makes sense that there is a common interface that
>>> simplifies the FS implementation. See the comment below on the part I
>>> believe we see things differently.
>>>
>>>
>>>>> I would be OK with some FS implementations to rely on append and impose
>>>>> the constraint that append has to be supported (and it would be our job
>>>>> to change that), but I would like to avoid the driver rejecting
>>>>> initializing the device because current FS implementations have
>>>>> implemented this logic.
>>>>
>>>> What is the difference between the driver rejecting drives and the FS rejecting
>>>> the same drives ? That has the same end result to me: an entire class of devices
>>>> cannot be used as desired by the user. Implementing zone append emulation avoids
>>>> the rejection entirely while still allowing the FS to have a single write IO
>>>> path, thus simplifying the code.
>>>
>>> The difference is that users that use a raw ZNS device submitting I/O
>>> through the kernel would still be able to use these devices. The result
>>> would be that the ZNS SSD is recognized and initialized, but the FS
>>> format fails.
>>
>> I understand your point of view. Raw ZNS block device access by an application
>> is of course a fine use case. SMR also has plenty of these.
>>
>> My point is that enabling this regular write/raw device use case should not
>> prevent using btrfs or other kernel components that require zone append.
>> Implementing zone append emulation in the NVMe/ZNS driver for devices without
>> native support for the command enables *all* use cases without impacting the use
>> case you are interested in.
>>
>> This approach is, in my opinion, far better. No one is left out and the user
>> gains a flexible system with different setup capabilities. The user wins here.
> 
> So, do you see a path where we enable the following:
> 
>     1. We add the emulation layer to the NVMe driver for enabling FSs
>        that currently support zoned devices
>     2. We add a path from user-space (e.g., uring) to enable passthru
>        commands to the NVMe driver to enable a raw ZNS path from the
>        application. This path does not require the device to support
>        append. An initial limitation is that I/Os must be of < 127 bio
>        segments (same as append) to avod bio splits
>     3. As per above, the NVMe driver allows ZNS drives without append
>        support to be initialized and the check moves to the FS
>        formatting.
> 
> 2 and 3. is something we have on our end. We need to rebase on top of
> the patches you guys submitted. 1. is something we can help with after
> that.
> 
> Does the above make sense to you?

Doing (1) first will give you a regular nvme namespace block device that you can
use to send passthrough commands with ioctl(). So (1) gives you (2).

However, I do not understand what io-uring has to do with passthrough. io-uring
being a block layer functionality, I do not think you can use it to send
passthrough commands to the driver. I amy be wrong though, but my understanding
is that for NVMe, passthrough is either ioctl() to device file or the entire
driver in user space with SPDK.

As for (3), I do not understand your point. If you have (1), then an FS
requiring zone append will work.

-- 
Damien Le Moal
Western Digital Research