Re: [PATCH 5/5] nvme: support for zoned namespaces

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 



On 16.06.2020 12:35, Damien Le Moal wrote:
On 2020/06/16 21:24, Javier González wrote:
On 16.06.2020 14:06, Matias Bjørling wrote:
On 16/06/2020 14.00, Javier González wrote:
On 16.06.2020 13:18, Matias Bjørling wrote:
On 16/06/2020 12.41, Javier González wrote:
On 16.06.2020 08:34, Keith Busch wrote:
Add support for NVM Express Zoned Namespaces (ZNS) Command Set defined
in NVM Express TP4053. Zoned namespaces are discovered based on their
Command Set Identifier reported in the namespaces Namespace
Identification Descriptor list. A successfully discovered Zoned
Namespace will be registered with the block layer as a host managed
zoned block device with Zone Append command support. A namespace that
does not support append is not supported by the driver.

Why are we enforcing the append command? Append is optional on the
current ZNS specification, so we should not make this mandatory in the
implementation. See specifics below.


There is already general support in the kernel for the zone append
command. Feel free to submit patches to emulate the support. It is
outside the scope of this patchset.


It is fine that the kernel supports append, but the ZNS specification
does not impose the implementation for append, so the driver should not
do that either.

ZNS SSDs that choose to leave append as a non-implemented optional
command should not rely on emulated SW support, specially when
traditional writes work very fine for a large part of current ZNS use
cases.

Please, remove this virtual constraint.

The Zone Append command is mandatory for zoned block devices. Please
see https://lwn.net/Articles/818709/ for the background.

I do not see anywhere in the block layer that append is mandatory for
zoned devices. Append is emulated on ZBC, but beyond that there is no
mandatory bits. Please explain.

This is to allow a single write IO path for all types of zoned block device for
higher layers, e.g file systems. The on-going re-work of btrfs zone support for
instance now relies 100% on zone append being supported. That significantly
simplifies the file system support and more importantly remove the need for
locking around block allocation and BIO issuing, allowing to preserve a fully
asynchronous write path that can include workqueues for efficient CPU usage of
things like encryption and compression. Without zone append, file system would
either (1) have to reject these drives that do not support zone append, or (2)
implement 2 different write IO path (slower regular write and zone append). None
of these options are ideal, to say the least.

So the approach is: mandate zone append support for ZNS devices. To allow other
ZNS drives, an emulation similar to SCSI can be implemented, with that emulation
ideally combined to work for both types of drives if possible.

Enforcing QD=1 becomes a problem on devices with large zones. In
a ZNS device that has smaller zones this should not be a problem.

Would you agree that it is possible to have a write path that relies on
QD=1, where the FS / application has the responsibility for enforcing
this? Down the road this QD can be increased if the device is able to
buffer the writes.

I would be OK with some FS implementations to rely on append and impose
the constraint that append has to be supported (and it would be our job
to change that), but I would like to avoid the driver rejecting
initializing the device because current FS implementations have
implemented this logic.

We can agree that a number of initial customers will use these devices
raw, using the in-kernel I/O path, but without a FS on top.

Thoughts?

and note that
this emulation would require the drive to be operated with mq-deadline to enable
zone write locking for preserving write command order. While on a HDD the
performance penalty is minimal, it will likely be significant on a SSD.

Exactly my concern. I do not want ZNS SSDs to be impacted by this type
of design decision at the driver level.

Thanks,
Javier



[Index of Archives]     [Linux RAID]     [Linux SCSI]     [Linux ATA RAID]     [IDE]     [Linux Wireless]     [Linux Kernel]     [ATH6KL]     [Linux Bluetooth]     [Linux Netdev]     [Kernel Newbies]     [Security]     [Git]     [Netfilter]     [Bugtraq]     [Yosemite News]     [MIPS Linux]     [ARM Linux]     [Linux Security]     [Device Mapper]

  Powered by Linux