Re: [PATCH 5/5] nvme: support for zoned namespaces

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 



On 2020/06/16 22:08, Judy Brock wrote:
> 
> "The on-going re-work of btrfs zone support for instance now relies 100% on
> zone append being supported.... So the approach is: mandate zone append
> support for ZNS devices.... To allow other ZNS drives, an emulation similar
> to SCSI can be implemented, ...  While on a HDD the  performance penalty is
> minimal, it will likely be *significant* on a SSD."
> 
> Wow. Well as I said, I don't know much about Linux but it sounds like the
> ongoing re-work of btrfs zone support mandating zone append should be
> revisited.
> 
> The reality is there will be flavors of ZNS drives in the market that do not
> support Append.  As many of you know, the ZRWA technical proposal is well
> under-way in NVMe ZNS WG.
> 
> Ensuring that the entire Linux zone support ecosystem deliberately locks
> these devices out / or at best consigns them to a severely
> performance-penalized path, especially given the MULTIPLE statements that
> have been made in the NVMe ZNS WG by multiple companies regarding the use
> cases for which Zone Append is an absolute disaster (not my words), seems
> pretty darn inappropriate.

The software design decision is not about locking out one class of devices, it
is about how to deliver high performance implementations of file systems for
drives that can actually provide that performance, e.g. SSDs. As I said,
mandating that zone append is always supported by the storage devices, either
natively or through emulation, allows such efficient, and simple, implementation
of zone support at higher levels in device mapper and file system layers.

Without this, the file system has to do the serialization of write commands
*and* protect itself against write command reordering by the block IO stack as
that layer of the kernel is totally asynchronous and does not give any guarantee
of a particular command execution order. This complicates the file system
implementation significantly and so is not acceptable.

For zoned devices, the block layer can provide *write* command execution order
guarantees, similarly to what the file system would need to do. That is the
mq-deadline and zone write locking I was referring to. That is acceptable for
SMR HDDs, but likely will have impact on SSD write performance (that needs to be
checked).

Summary: what needs to be done for correctly processing sequential write
commands in Linux is the same no matter which layer implements it: writes must
be throttled to at most one write per zone. This can be done by a file system or
the block layer. Native zone append support by a drive removes all this,
simplifies the code and enables high performance. Zone append emulation in the
driver gives the same code simplification overall, but *may* suffer from the
zone write locking penalty.

Overall, we get code simplification at the file system layer, with only a single
area where performance may not be optimal. Any other design choice would result
in much worse situations:
1) complex code everywhere as the file systems would have to support both
regular write and zone append write path to support all class of devices.
2) file system implementing only zone append write path end up rejecting drives
that do not have zone append native support
3) The file system layer supports only regular writes, resulting in complex code
and potentially degraded write performance for *all* devices



> 
> 
> 
> 
> 
> -----Original Message----- From: linux-nvme
> [mailto:linux-nvme-bounces@xxxxxxxxxxxxxxxxxxx] On Behalf Of Damien Le Moal 
> Sent: Tuesday, June 16, 2020 5:36 AM To: Javier González; Matias Bjørling Cc:
> Jens Axboe; Niklas Cassel; Ajay Joshi; Sagi Grimberg; Keith Busch; Dmitry
> Fomichev; Aravind Ramesh; linux-nvme@xxxxxxxxxxxxxxxxxxx;
> linux-block@xxxxxxxxxxxxxxx; Hans Holmberg; Christoph Hellwig; Matias
> Bjorling Subject: Re: [PATCH 5/5] nvme: support for zoned namespaces
> 
> On 2020/06/16 21:24, Javier González wrote:
>> On 16.06.2020 14:06, Matias Bjørling wrote:
>>> On 16/06/2020 14.00, Javier González wrote:
>>>> On 16.06.2020 13:18, Matias Bjørling wrote:
>>>>> On 16/06/2020 12.41, Javier González wrote:
>>>>>> On 16.06.2020 08:34, Keith Busch wrote:
>>>>>>> Add support for NVM Express Zoned Namespaces (ZNS) Command Set
>>>>>>> defined in NVM Express TP4053. Zoned namespaces are discovered
>>>>>>> based on their Command Set Identifier reported in the namespaces
>>>>>>> Namespace Identification Descriptor list. A successfully
>>>>>>> discovered Zoned Namespace will be registered with the block
>>>>>>> layer as a host managed zoned block device with Zone Append
>>>>>>> command support. A namespace that does not support append is not
>>>>>>> supported by the driver.
>>>>>> 
>>>>>> Why are we enforcing the append command? Append is optional on the 
>>>>>> current ZNS specification, so we should not make this mandatory in
>>>>>> the implementation. See specifics below.
>>>> 
>>>>> 
>>>>> There is already general support in the kernel for the zone append 
>>>>> command. Feel free to submit patches to emulate the support. It is 
>>>>> outside the scope of this patchset.
>>>>> 
>>>> 
>>>> It is fine that the kernel supports append, but the ZNS specification 
>>>> does not impose the implementation for append, so the driver should
>>>> not do that either.
>>>> 
>>>> ZNS SSDs that choose to leave append as a non-implemented optional 
>>>> command should not rely on emulated SW support, specially when 
>>>> traditional writes work very fine for a large part of current ZNS use 
>>>> cases.
>>>> 
>>>> Please, remove this virtual constraint.
>>> 
>>> The Zone Append command is mandatory for zoned block devices. Please see
>>> https://urldefense.proofpoint.com/v2/url?u=https-3A__lwn.net_Articles_818709_&d=DwIFAw&c=JfeWlBa6VbDyTXraMENjy_b_0yKWuqQ4qY-FPhxK4x8w-TfgRBDyeV4hVQQBEgL2&r=YJM_QPk2w1CRIo5NNBXnCXGzNnmIIfG_iTRs6chBf6s&m=-fIHWuFYU2GHiTJ2FuhTBgrypPIJW0FjLUWTaK4cH9c&s=kkJ8bJpiTjKS9PoobDPHTf11agXKNUpcw5AsIEyewZk&e=
>>> for the background.
>> 
>> I do not see anywhere in the block layer that append is mandatory for zoned
>> devices. Append is emulated on ZBC, but beyond that there is no mandatory
>> bits. Please explain.
> 
> This is to allow a single write IO path for all types of zoned block device
> for higher layers, e.g file systems. The on-going re-work of btrfs zone
> support for instance now relies 100% on zone append being supported. That
> significantly simplifies the file system support and more importantly remove
> the need for locking around block allocation and BIO issuing, allowing to
> preserve a fully asynchronous write path that can include workqueues for
> efficient CPU usage of things like encryption and compression. Without zone
> append, file system would either (1) have to reject these drives that do not
> support zone append, or (2) implement 2 different write IO path (slower
> regular write and zone append). None of these options are ideal, to say the
> least.
> 
> So the approach is: mandate zone append support for ZNS devices. To allow
> other ZNS drives, an emulation similar to SCSI can be implemented, with that
> emulation ideally combined to work for both types of drives if possible. And
> note that this emulation would require the drive to be operated with
> mq-deadline to enable zone write locking for preserving write command order.
> While on a HDD the performance penalty is minimal, it will likely be
> significant on a SSD.
> 
>> 
>>> Please submitpatches if you want to have support for ZNS devices that 
>>> does not implement the Zone Append command. It is outside the scope of
>>> this patchset.
>> 
>> That we will.
>> 
>> 
>> _______________________________________________ linux-nvme mailing list 
>> linux-nvme@xxxxxxxxxxxxxxxxxxx 
>> https://urldefense.proofpoint.com/v2/url?u=http-3A__lists.infradead.org_mailman_listinfo_linux-2Dnvme&d=DwIFAw&c=JfeWlBa6VbDyTXraMENjy_b_0yKWuqQ4qY-FPhxK4x8w-TfgRBDyeV4hVQQBEgL2&r=YJM_QPk2w1CRIo5NNBXnCXGzNnmIIfG_iTRs6chBf6s&m=-fIHWuFYU2GHiTJ2FuhTBgrypPIJW0FjLUWTaK4cH9c&s=HeBnGkcBM5OqESkW8yYYi2KtvVwbdamrbd_X5PgGKBk&e=
>> 
>> 
> 
> 


-- 
Damien Le Moal
Western Digital Research




[Index of Archives]     [Linux RAID]     [Linux SCSI]     [Linux ATA RAID]     [IDE]     [Linux Wireless]     [Linux Kernel]     [ATH6KL]     [Linux Bluetooth]     [Linux Netdev]     [Kernel Newbies]     [Security]     [Git]     [Netfilter]     [Bugtraq]     [Yosemite News]     [MIPS Linux]     [ARM Linux]     [Linux Security]     [Device Mapper]

  Powered by Linux