Re: [PATCH 5/5] nvme: support for zoned namespaces

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 



Thanks for the interesting discussion but it made me wonder about the
usefulness of 4K writes in the first place. Append seems to be a
workaround for a problem (single writer/QD) that shouldn't exist in
the first place. If writes need to be sequential, what is the purpose
of allowing 4K writes at all (it provides no placement flexibility).
Mandating zone-sized writes would address all problems with ease and
reduce request rate and overheads in the kernel. I don't see why we
would disassemble a zone-sized block into smaller writes just to
re-assemble them again on the device. A promise of ZNS is to move the
translation overhead from the device into the FS layer so why
re-introduce complexity in the bio layer? Managing zone-sized blocks
on the application/FS layer is also much more convenient than
receiving random 4K addresses from append commands.
Finally, note that splitting zone-sized bios in the kernel does not
achieve any purpose as interleaving/scheduling within a zone isn't
possible. If we want to interleave accesses to multiple open zones,
this should be done on the device layer by exposing queue(s) per zone.
For applications writing large, consecutive blocks (RocksDB), the best
implementation seems to be providing a kernel path that guarantees
non-splittable zone-sized writes.

On Wed, Jun 17, 2020 at 12:40 PM Javier González <javier@xxxxxxxxxxx> wrote:
>
> On 17.06.2020 21:23, Matias Bjørling wrote:
> >On 17/06/2020 21.09, Javier González wrote:
> >>On 17.06.2020 18:55, Matias Bjorling wrote:
> >>>>-----Original Message-----
> >>>>From: Javier González <javier@xxxxxxxxxxx>
> >>>>Sent: Wednesday, 17 June 2020 20.29
> >>>>To: Matias Bjørling <mb@xxxxxxxxxxx>
> >>>>Cc: Christoph Hellwig <hch@xxxxxx>; Keith Busch <Keith.Busch@xxxxxxx>;
> >>>>linux-nvme@xxxxxxxxxxxxxxxxxxx; linux-block@xxxxxxxxxxxxxxx;
> >>>>Damien Le Moal
> >>>><Damien.LeMoal@xxxxxxx>; Matias Bjorling <Matias.Bjorling@xxxxxxx>;
> >>>>Sagi Grimberg <sagi@xxxxxxxxxxx>; Jens Axboe <axboe@xxxxxxxxx>; Hans
> >>>>Holmberg <Hans.Holmberg@xxxxxxx>; Dmitry Fomichev
> >>>><Dmitry.Fomichev@xxxxxxx>; Ajay Joshi <Ajay.Joshi@xxxxxxx>; Aravind
> >>>>Ramesh <Aravind.Ramesh@xxxxxxx>; Niklas Cassel
> >>>><Niklas.Cassel@xxxxxxx>; Judy Brock <judy.brock@xxxxxxxxxxx>
> >>>>Subject: Re: [PATCH 5/5] nvme: support for zoned namespaces
> >>>>
> >>>>On 17.06.2020 19:57, Matias Bjørling wrote:
> >>>>>On 17/06/2020 16.42, Javier González wrote:
> >>>>>>On 17.06.2020 09:43, Christoph Hellwig wrote:
> >>>>>>>On Tue, Jun 16, 2020 at 12:41:42PM +0200, Javier González wrote:
> >>>>>>>>On 16.06.2020 08:34, Keith Busch wrote:
> >>>>>>>>>Add support for NVM Express Zoned Namespaces (ZNS) Command Set
> >>>>>>>>>defined in NVM Express TP4053. Zoned namespaces are discovered
> >>>>>>>>>based on their Command Set Identifier reported in the namespaces
> >>>>>>>>>Namespace Identification Descriptor list. A successfully
> >>>>discovered
> >>>>>>>>>Zoned Namespace will be registered with the block layer as a host
> >>>>>>>>>managed zoned block device with Zone Append command support. A
> >>>>>>>>>namespace that does not support append is not supported by
> >>>>the driver.
> >>>>>>>>
> >>>>>>>>Why are we enforcing the append command? Append is optional on the
> >>>>>>>>current ZNS specification, so we should not make this mandatory in
> >>>>>>>>the implementation. See specifics below.
> >>>>>>>
> >>>>>>>Because Append is the way to go and we've moved the Linux
> >>>>zoned block
> >>>>>>>I/O stack to required it, as should have been obvious to anyone
> >>>>>>>following linux-block in the last few months.  I also have to
> >>>>say I'm
> >>>>>>>really tired of the stupid politics tha your company started in the
> >>>>>>>NVMe working group, and will say that these do not matter for Linux
> >>>>>>>development at all.  If you think it is worthwhile to support
> >>>>devices
> >>>>>>>without Zone Append you can contribute support for them on top of
> >>>>>>>this series by porting the SCSI Zone Append Emulation code to NVMe.
> >>>>>>>
> >>>>>>>And I'm not even going to read the rest of this thread as I'm on a
> >>>>>>>vacation that I badly needed because of the Samsung TWG bullshit.
> >>>>>>
> >>>>>>My intention is to support some Samsung ZNS devices that will not
> >>>>>>enable append. I do not think this is an unreasonable thing to
> >>>>do. How
> >>>>>>/ why append ended up being an optional feature in the ZNS TP is
> >>>>>>orthogonal to this conversation. Bullshit or not, it ends up on
> >>>>>>devices that we would like to support one way or another.
> >>>>>
> >>>>>I do not believe any of us have said that it is unreasonable to
> >>>>>support. We've only asked that you make the patches for it.
> >>>>>
> >>>>>All of us have communicated why Zone Append is a great addition to the
> >>>>>Linux kernel. Also, as Christoph points out, this has not been
> >>>>a secret
> >>>>>for the past couple of months, and as Martin pointed out, have been a
> >>>>>wanted feature for the past decade in the Linux community.
> >>>>
> >>>>>
> >>>>>I do want to politely point out, that you've got a very clear signal
> >>>>>from the key storage maintainers. Each of them is part of the planet's
> >>>>>best of the best and most well-respected software developers, that
> >>>>>literally have built the storage stack that most of the world depends
> >>>>>on. The storage stack that recently sent manned rockets into space.
> >>>>>They each unanimously said that the Zone Append command is the right
> >>>>>approach for the Linux kernel to reduce the overhead of I/O tracking
> >>>>>for zoned block devices. It may be worth bringing this information to
> >>>>>your engineering organization, and also potentially consider Zone
> >>>>>Append support for devices that you intend to used with the Linux
> >>>>>kernel storage stack.
> >>>>
> >>>>I understand and I have never said the opposite.
> >>>>
> >>>>Append is a great addition that
> >>>
> >>>One may have interpreted your SDC EMEA talk the opposite. It was not
> >>>very neutral towards Zone Append, but that is of cause one of its least
> >>>problems. But I am happy to hear that you've changed your opinion.
> >>
> >>As you are well aware, there are some cases where append introduces
> >>challenges. This is well-documented on the bibliography around nameless
> >>writes.
> >
> >The nameless writes idea is vastly different from Zone append, and
> >have little of the drawbacks of nameless writes, which makes the
> >well-documented literature not apply.
>
> You can have that conversation with your customer base.
>
> >
> >>Part of the talk was on presenting an alternative for these
> >>particular use cases.
> >>
> >>This said, I am not afraid of changing my point of view when I am proven
> >>wrong.
> >>
> >>>
> >>>>we also have been working on for several months (see patches
> >>>>additions from
> >>>>today). We just have a couple of use cases where append is not
> >>>>required and I
> >>>>would like to make sure that they are supported.
> >>>>
> >>>>At the end of the day, the only thing I have disagreed on is
> >>>>that the NVMe
> >>>>driver rejects ZNS SSDs that do not support append, as opposed
> >>>>to doing this
> >>>>instead when an in-kernel user wants to utilize the drive (e.g.,
> >>>>formatting a FS
> >>>>with zoned support) This would allow _today_
> >>>>ioctl() passthru to work for normal writes.
> >>>>
> >>>>I still believe the above would be a more inclusive solution
> >>>>with the current ZNS
> >>>>specification, but I can see that the general consensus is different.
> >>>
> >>>The comment from the community, including me, is that there is a
> >>>general requirement for Zone Append command when utilizing Zoned
> >>>storage devices. This is similar to implement an API that one wants to
> >>>support. It is not a general consensus or opinion. It is hard facts and
> >>>how the Linux kernel source code is implemented at this point. One must
> >>>implement support for ZNS SSDs that do not expose the Zone Append
> >>>command natively. Period.
> >>
> >>Again, I am not saying the opposite. Read the 2 lines below...
> >
> >My point with the above paragraph was to clarify that we are not
> >trying to be difficult or opinionated, but point out that the reason
> >we give you the specific feedback, is that it is the way it is in the
> >kernel as today.
>
> Again, yes, we will apply the feedback and come back with an approach
> that fits so that we can enable the raw ZNS block access that we want to
> enable.
>
> >
> >>
> >>>>
> >>>>So we will go back, apply the feedback that we got and return with an
> >>>>approach that better fits the ecosystem.
> >>>>
> >>>>>
> >>>>>Another approach, is to use SPDK, and bypass the Linux kernel. This
> >>>>>might even be an advantage, your customers does not have to
> >>>>wait on the
> >>>>>Linux distribution being released with a long term release,
> >>>>before they
> >>>>>can even get started and deploy in volume. I.e., they will
> >>>>actually get
> >>>>>faster to market, and your company will be able to sell more drives.
> >>>>
> >>>>I think I will refrain from discussing our business strategy on
> >>>>an open mailing
> >>>>list. Appreciate the feedback though. Very insightful.
> >>>
> >>>I am not asking for you to discuss your business strategy on the
> >>>mailing list. My comment was to give you genuinely advise that may
> >>>save a lot of work, and might even get better results.
> >>>
> >>>>
> >>>>Thanks,
> >>>>Javier
> >>




[Index of Archives]     [Linux RAID]     [Linux SCSI]     [Linux ATA RAID]     [IDE]     [Linux Wireless]     [Linux Kernel]     [ATH6KL]     [Linux Bluetooth]     [Linux Netdev]     [Kernel Newbies]     [Security]     [Git]     [Netfilter]     [Bugtraq]     [Yosemite News]     [MIPS Linux]     [ARM Linux]     [Linux Security]     [Device Mapper]

  Powered by Linux