Re: [PATCH 5/5] nvme: support for zoned namespaces

Javier González <javier@xxxxxxxxxxx> · Tue, 16 Jun 2020 18:03:26 +0200

On 16.06.2020 17:20, Matias Bjørling wrote:
On 16/06/2020 17.02, Javier González wrote:
On 16.06.2020 14:42, Damien Le Moal wrote:
On 2020/06/16 23:16, Javier González wrote:
On 16.06.2020 12:35, Damien Le Moal wrote:
On 2020/06/16 21:24, Javier González wrote:
On 16.06.2020 14:06, Matias Bjørling wrote:
On 16/06/2020 14.00, Javier González wrote:
On 16.06.2020 13:18, Matias Bjørling wrote:
On 16/06/2020 12.41, Javier González wrote:
On 16.06.2020 08:34, Keith Busch wrote:
Add support for NVM Express Zoned Namespaces (ZNS) 
Command Set defined
in NVM Express TP4053. Zoned namespaces are 
discovered based on their
Command Set Identifier reported in the namespaces Namespace
Identification Descriptor list. A successfully discovered Zoned
Namespace will be registered with the block layer 
as a host managed
zoned block device with Zone Append command 
support. A namespace that
does not support append is not supported by the driver.

Why are we enforcing the append command? Append is 
optional on the
current ZNS specification, so we should not make 
this mandatory in the
implementation. See specifics below.

There is already general support in the kernel for the zone append
command. Feel free to submit patches to emulate the support. It is
outside the scope of this patchset.

It is fine that the kernel supports append, but the ZNS 
specification
does not impose the implementation for append, so the 
driver should not
do that either.

ZNS SSDs that choose to leave append as a non-implemented optional
command should not rely on emulated SW support, specially when
traditional writes work very fine for a large part of 
current ZNS use
cases.

Please, remove this virtual constraint.

The Zone Append command is mandatory for zoned block devices. Please
see https://lwn.net/Articles/818709/ for the background.

I do not see anywhere in the block layer that append is mandatory for
zoned devices. Append is emulated on ZBC, but beyond that there is no
mandatory bits. Please explain.

This is to allow a single write IO path for all types of zoned 
block device for
higher layers, e.g file systems. The on-going re-work of btrfs 
zone support for
instance now relies 100% on zone append being supported. That 
significantly
simplifies the file system support and more importantly remove 
the need for
locking around block allocation and BIO issuing, allowing to 
preserve a fully
asynchronous write path that can include workqueues for 
efficient CPU usage of
things like encryption and compression. Without zone append, 
file system would
either (1) have to reject these drives that do not support 
zone append, or (2)
implement 2 different write IO path (slower regular write and 
zone append). None
of these options are ideal, to say the least.

So the approach is: mandate zone append support for ZNS 
devices. To allow other
ZNS drives, an emulation similar to SCSI can be implemented, 
with that emulation
ideally combined to work for both types of drives if possible.

Enforcing QD=1 becomes a problem on devices with large zones. In
a ZNS device that has smaller zones this should not be a problem.

Let's be precise: this is not running the drive at QD=1, it is "at 
most one
write *request* per zone". If the FS is simultaneously using 
multiple block
groups mapped to different zones, you will get a total write QD > 
1, and as many
reads as you want.

Would you agree that it is possible to have a write path that relies on
QD=1, where the FS / application has the responsibility for enforcing
this? Down the road this QD can be increased if the device is able to
buffer the writes.

Doing QD=1 per zone for writes at the FS layer, that is, at the 
BIO layer does
not work. This is because BIOs can be as large as the FS wants 
them to be. Such
large BIO will be split into multiple requests in the block layer, 
resulting in
more than one write per zone. That is why the zone write locking 
is at the
scheduler level, between BIO split and request dispatch. That avoids the
multiple requests fragments of a large BIO to be reordered and 
fail. That is
mandatory as the block layer itself can occasionally reorder 
requests and lower
levels such as AHCI HW is also notoriously good at reversing sequential
requests. For NVMe with multi-queue, the IO issuing process 
getting rescheduled
on a different CPU can result in sequential IOs being in different 
queues, with
the likely result of an out-of-order execution. All cases are 
avoided with zone
write locking and at most one write request dispatch per zone as 
recommended by
the ZNS specifications (ZBC and ZAC standards for SMR HDDs are 
silent on this).

I understand. I agree that the current FSs supporting ZNS follow this
approach and it makes sense that there is a common interface that
simplifies the FS implementation. See the comment below on the part I
believe we see things differently.

I would be OK with some FS implementations to rely on append and impose
the constraint that append has to be supported (and it would be our job
to change that), but I would like to avoid the driver rejecting
initializing the device because current FS implementations have
implemented this logic.

What is the difference between the driver rejecting drives and the 
FS rejecting
the same drives ? That has the same end result to me: an entire 
class of devices
cannot be used as desired by the user. Implementing zone append 
emulation avoids
the rejection entirely while still allowing the FS to have a 
single write IO
path, thus simplifying the code.

The difference is that users that use a raw ZNS device submitting I/O
through the kernel would still be able to use these devices. The result
would be that the ZNS SSD is recognized and initialized, but the FS
format fails.

We can agree that a number of initial customers will use these devices
raw, using the in-kernel I/O path, but without a FS on top.

Thoughts?

and note that
this emulation would require the drive to be operated with 
mq-deadline to enable
zone write locking for preserving write command order. While 
on a HDD the
performance penalty is minimal, it will likely be significant 
on a SSD.

Exactly my concern. I do not want ZNS SSDs to be impacted by this type
of design decision at the driver level.

But your proposed FS level approach would end up doing the exact 
same thing with
the same limitation and so the same potential performance impact. 
The block
layer generic approach has the advantage that we do not bother the 
higher levels
with the implementation of in-order request dispatch guarantees. 
File systems
are complex enough. The less complexity is required for zone 
support, the better.

This depends very much on how the FS / application is managing
stripping. At the moment our main use case is enabling user-space
applications submitting I/Os to raw ZNS devices through the kernel.

Can we enable this use case to start with?

It is free for everyone to load kernel modules into the kernel. Those 
modules may not have the appropriate checks or may rely on the zone 
append functionality. Having per use-case limit is a no-go and at best 
a game of whack-a-mole.

Let's focus on mainline support. We are leaving append as not enabled
based on customer requests for some ZNS products and would like this
devices to be supported. This is not at all a corner use-case but a very
general one.

You already agreed to create a set of patches to add the appropriate 
support for emulating zone append. As these would fix your specific 
issue, please go ahead and submit those.

I agreed to solve the use case that some of our customers are enabling
and this is what I am doing.

Again, to start with I would like to have a path where ZNS namespaces are
identified independently of append support. Then specific users can
require append if they please to do so. We will of course take care of
sending patches for this.

Thanks,
Javier