On 2020/06/19 7:05, Heiner Litz wrote: > Matias, Keith, > thanks, this all sounds good and it makes total sense to hide striping > from the user. > > In the end, the real problem really seems to be that ZNS effectively > requires in-order IO delivery which the kernel cannot guarantee. I > think fixing this problem in the ZNS specification instead of in the > communication substrate (kernel) is problematic, especially as > out-of-order delivery absolutely has no benefit in the case of ZNS. > But I guess this has been discussed before.. >From the device interface perspective, that is from the ZNS specifications point of view, only regular writes require in order dispatching by the host. Zone append write commands can be issued in any order and will succeed as long as there are enough unwritten blocks in the target zone to fit the append request. And the zone append command processing can happen in any order the drive sees fit. SO there is indeed no guarantee back to the host that zone append command execution will be done in the same order as issued by the host. That is from the interface perspective, for the protocol. Now the question that I think you are after seems to be "does this work for the user" ? The answer is a simple "it depends what the use case is". The device user is free to choose between issuing regular writes or zone append write. This choice heavily depends on the answer to the question: "Can I tolerate out of order writes ?". For a file system, the answer is yes, since metadata is used to indicate the mapping of file offsets to on-disk locations. It does not matter, functionally speaking, if the file data blocks for increasing file offsets are out of order. That can happen with any file system on any regular disk due to block allocation/fragmentation today. For an application using raw block device accesses without a file system, the usability of zone append will heavily depend on the structure/format of the data being written. A simple logging application where every write to a device stores a single independent "record" will likely be fine with zone append. If the application is writing something like a B-tree with dependency between data blocks pointing to each other, zone append may not be the best choice as the final location on disk of a write is only approximately known (i.e., one can only guarantee that it will land "somewhere" in a zone). That however depend on how the application issues IO requests. Zone append is not a magic command solving all problems. But it certainly does simplify a lot of things in the kernel IO stack (no need for strong ordering) and also can simplify file system implementation (no need to control write issuing order). > > On Thu, Jun 18, 2020 at 2:19 PM Keith Busch <kbusch@xxxxxxxxxx> wrote: >> >> On Thu, Jun 18, 2020 at 01:47:20PM -0700, Heiner Litz wrote: >>> the striping explanation makes sense. In this case will rephase to: It >>> is sufficient to support large enough un-splittable writes to achieve >>> full per-zone bandwidth with a single writer/single QD. >> >> This is subject to the capabilities of the device and software's memory >> constraints. The maximum DMA size for a single request an nvme device can >> handle often range anywhere from 64k to 4MB. The pci nvme driver maxes out at >> 4MB anyway because that's the most we can guarantee forward progress right now, >> otherwise the scatter lists become to big to ensure we'll be able to allocate >> one to dispatch a write command. >> >> We do report the size and the alignment constraints so that it won't get split, >> but we still have to work with applications that don't abide by those >> constraints. >> >>> My main point is: There is no fundamental reason for splitting up >>> requests intermittently just to re-assemble them in the same form >>> later. > -- Damien Le Moal Western Digital Research