Re: [PATCH 5/5] nvme: support for zoned namespaces

Damien Le Moal <Damien.LeMoal@xxxxxxx> · Fri, 19 Jun 2020 00:57:47 +0000

On 2020/06/19 7:05, Heiner Litz wrote:
> Matias, Keith,
> thanks, this all sounds good and it makes total sense to hide striping
> from the user.
> 
> In the end, the real problem really seems to be that ZNS effectively
> requires in-order IO delivery which the kernel cannot guarantee. I
> think fixing this problem in the ZNS specification instead of in the
> communication substrate (kernel) is problematic, especially as
> out-of-order delivery absolutely has no benefit in the case of ZNS.
> But I guess this has been discussed before..

>From the device interface perspective, that is from the ZNS specifications point
of view, only regular writes require in order dispatching by the host. Zone
append write commands can be issued in any order and will succeed as long as
there are enough unwritten blocks in the target zone to fit the append request.
And the zone append command processing can happen in any order the drive sees
fit. SO there is indeed no guarantee back to the host that zone append command
execution will be done in the same order as issued by the host.

That is from the interface perspective, for the protocol. Now the question that
I think you are after seems to be "does this work for the user" ? The answer is
a simple "it depends what the use case is". The device user is free to choose
between issuing regular writes or zone append write. This choice heavily depends
on the answer to the question: "Can I tolerate out of order writes ?". For a
file system, the answer is yes, since metadata is used to indicate the mapping
of file offsets to on-disk locations. It does not matter, functionally speaking,
if the file data blocks for increasing file offsets are out of order. That can
happen with any file system on any regular disk due to block
allocation/fragmentation today.

For an application using raw block device accesses without a file system, the
usability of zone append will heavily depend on the structure/format of the data
being written. A simple logging application where every write to a device stores
a single independent "record" will likely be fine with zone append. If the
application is writing something like a B-tree with dependency between data
blocks pointing to each other, zone append may not be the best choice as the
final location on disk of a write is only approximately known (i.e., one can
only guarantee that it will land "somewhere" in a zone). That however depend on
how the application issues IO requests.

Zone append is not a magic command solving all problems. But it certainly does
simplify a lot of things in the kernel IO stack (no need for strong ordering)
and also can simplify file system implementation (no need to control write
issuing order).

> 
> On Thu, Jun 18, 2020 at 2:19 PM Keith Busch <kbusch@xxxxxxxxxx> wrote:
>>
>> On Thu, Jun 18, 2020 at 01:47:20PM -0700, Heiner Litz wrote:
>>> the striping explanation makes sense. In this case will rephase to: It
>>> is sufficient to support large enough un-splittable writes to achieve
>>> full per-zone bandwidth with a single writer/single QD.
>>
>> This is subject to the capabilities of the device and software's memory
>> constraints. The maximum DMA size for a single request an nvme device can
>> handle often range anywhere from 64k to 4MB. The pci nvme driver maxes out at
>> 4MB anyway because that's the most we can guarantee forward progress right now,
>> otherwise the scatter lists become to big to ensure we'll be able to allocate
>> one to dispatch a write command.
>>
>> We do report the size and the alignment constraints so that it won't get split,
>> but we still have to work with applications that don't abide by those
>> constraints.
>>
>>> My main point is: There is no fundamental reason for splitting up
>>> requests intermittently just to re-assemble them in the same form
>>> later.
> 

-- 
Damien Le Moal
Western Digital Research