On 2020/06/17 15:10, Javier González wrote: > On 17.06.2020 00:14, Damien Le Moal wrote: >> On 2020/06/17 0:02, Javier González wrote: >>> On 16.06.2020 14:42, Damien Le Moal wrote: >>>> On 2020/06/16 23:16, Javier González wrote: >>>>> On 16.06.2020 12:35, Damien Le Moal wrote: >>>>>> On 2020/06/16 21:24, Javier González wrote: >>>>>>> On 16.06.2020 14:06, Matias Bjørling wrote: >>>>>>>> On 16/06/2020 14.00, Javier González wrote: >>>>>>>>> On 16.06.2020 13:18, Matias Bjørling wrote: >>>>>>>>>> On 16/06/2020 12.41, Javier González wrote: >>>>>>>>>>> On 16.06.2020 08:34, Keith Busch wrote: >>>>>>>>>>>> Add support for NVM Express Zoned Namespaces (ZNS) Command Set defined >>>>>>>>>>>> in NVM Express TP4053. Zoned namespaces are discovered based on their >>>>>>>>>>>> Command Set Identifier reported in the namespaces Namespace >>>>>>>>>>>> Identification Descriptor list. A successfully discovered Zoned >>>>>>>>>>>> Namespace will be registered with the block layer as a host managed >>>>>>>>>>>> zoned block device with Zone Append command support. A namespace that >>>>>>>>>>>> does not support append is not supported by the driver. >>>>>>>>>>> >>>>>>>>>>> Why are we enforcing the append command? Append is optional on the >>>>>>>>>>> current ZNS specification, so we should not make this mandatory in the >>>>>>>>>>> implementation. See specifics below. >>>>>>>>> >>>>>>>>>> >>>>>>>>>> There is already general support in the kernel for the zone append >>>>>>>>>> command. Feel free to submit patches to emulate the support. It is >>>>>>>>>> outside the scope of this patchset. >>>>>>>>>> >>>>>>>>> >>>>>>>>> It is fine that the kernel supports append, but the ZNS specification >>>>>>>>> does not impose the implementation for append, so the driver should not >>>>>>>>> do that either. >>>>>>>>> >>>>>>>>> ZNS SSDs that choose to leave append as a non-implemented optional >>>>>>>>> command should not rely on emulated SW support, specially when >>>>>>>>> traditional writes work very fine for a large part of current ZNS use >>>>>>>>> cases. >>>>>>>>> >>>>>>>>> Please, remove this virtual constraint. >>>>>>>> >>>>>>>> The Zone Append command is mandatory for zoned block devices. Please >>>>>>>> see https://lwn.net/Articles/818709/ for the background. >>>>>>> >>>>>>> I do not see anywhere in the block layer that append is mandatory for >>>>>>> zoned devices. Append is emulated on ZBC, but beyond that there is no >>>>>>> mandatory bits. Please explain. >>>>>> >>>>>> This is to allow a single write IO path for all types of zoned block device for >>>>>> higher layers, e.g file systems. The on-going re-work of btrfs zone support for >>>>>> instance now relies 100% on zone append being supported. That significantly >>>>>> simplifies the file system support and more importantly remove the need for >>>>>> locking around block allocation and BIO issuing, allowing to preserve a fully >>>>>> asynchronous write path that can include workqueues for efficient CPU usage of >>>>>> things like encryption and compression. Without zone append, file system would >>>>>> either (1) have to reject these drives that do not support zone append, or (2) >>>>>> implement 2 different write IO path (slower regular write and zone append). None >>>>>> of these options are ideal, to say the least. >>>>>> >>>>>> So the approach is: mandate zone append support for ZNS devices. To allow other >>>>>> ZNS drives, an emulation similar to SCSI can be implemented, with that emulation >>>>>> ideally combined to work for both types of drives if possible. >>>>> >>>>> Enforcing QD=1 becomes a problem on devices with large zones. In >>>>> a ZNS device that has smaller zones this should not be a problem. >>>> >>>> Let's be precise: this is not running the drive at QD=1, it is "at most one >>>> write *request* per zone". If the FS is simultaneously using multiple block >>>> groups mapped to different zones, you will get a total write QD > 1, and as many >>>> reads as you want. >>>> >>>>> Would you agree that it is possible to have a write path that relies on >>>>> QD=1, where the FS / application has the responsibility for enforcing >>>>> this? Down the road this QD can be increased if the device is able to >>>>> buffer the writes. >>>> >>>> Doing QD=1 per zone for writes at the FS layer, that is, at the BIO layer does >>>> not work. This is because BIOs can be as large as the FS wants them to be. Such >>>> large BIO will be split into multiple requests in the block layer, resulting in >>>> more than one write per zone. That is why the zone write locking is at the >>>> scheduler level, between BIO split and request dispatch. That avoids the >>>> multiple requests fragments of a large BIO to be reordered and fail. That is >>>> mandatory as the block layer itself can occasionally reorder requests and lower >>>> levels such as AHCI HW is also notoriously good at reversing sequential >>>> requests. For NVMe with multi-queue, the IO issuing process getting rescheduled >>>> on a different CPU can result in sequential IOs being in different queues, with >>>> the likely result of an out-of-order execution. All cases are avoided with zone >>>> write locking and at most one write request dispatch per zone as recommended by >>>> the ZNS specifications (ZBC and ZAC standards for SMR HDDs are silent on this). >>>> >>> >>> I understand. I agree that the current FSs supporting ZNS follow this >>> approach and it makes sense that there is a common interface that >>> simplifies the FS implementation. See the comment below on the part I >>> believe we see things differently. >>> >>> >>>>> I would be OK with some FS implementations to rely on append and impose >>>>> the constraint that append has to be supported (and it would be our job >>>>> to change that), but I would like to avoid the driver rejecting >>>>> initializing the device because current FS implementations have >>>>> implemented this logic. >>>> >>>> What is the difference between the driver rejecting drives and the FS rejecting >>>> the same drives ? That has the same end result to me: an entire class of devices >>>> cannot be used as desired by the user. Implementing zone append emulation avoids >>>> the rejection entirely while still allowing the FS to have a single write IO >>>> path, thus simplifying the code. >>> >>> The difference is that users that use a raw ZNS device submitting I/O >>> through the kernel would still be able to use these devices. The result >>> would be that the ZNS SSD is recognized and initialized, but the FS >>> format fails. >> >> I understand your point of view. Raw ZNS block device access by an application >> is of course a fine use case. SMR also has plenty of these. >> >> My point is that enabling this regular write/raw device use case should not >> prevent using btrfs or other kernel components that require zone append. >> Implementing zone append emulation in the NVMe/ZNS driver for devices without >> native support for the command enables *all* use cases without impacting the use >> case you are interested in. >> >> This approach is, in my opinion, far better. No one is left out and the user >> gains a flexible system with different setup capabilities. The user wins here. > > So, do you see a path where we enable the following: > > 1. We add the emulation layer to the NVMe driver for enabling FSs > that currently support zoned devices > 2. We add a path from user-space (e.g., uring) to enable passthru > commands to the NVMe driver to enable a raw ZNS path from the > application. This path does not require the device to support > append. An initial limitation is that I/Os must be of < 127 bio > segments (same as append) to avod bio splits > 3. As per above, the NVMe driver allows ZNS drives without append > support to be initialized and the check moves to the FS > formatting. > > 2 and 3. is something we have on our end. We need to rebase on top of > the patches you guys submitted. 1. is something we can help with after > that. > > Does the above make sense to you? Doing (1) first will give you a regular nvme namespace block device that you can use to send passthrough commands with ioctl(). So (1) gives you (2). However, I do not understand what io-uring has to do with passthrough. io-uring being a block layer functionality, I do not think you can use it to send passthrough commands to the driver. I amy be wrong though, but my understanding is that for NVMe, passthrough is either ioctl() to device file or the entire driver in user space with SPDK. As for (3), I do not understand your point. If you have (1), then an FS requiring zone append will work. -- Damien Le Moal Western Digital Research