RE: [PATCH 5/5] nvme: support for zoned namespaces

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 



> -----Original Message-----
> From: Heiner Litz <hlitz@xxxxxxxx>
> Sent: Thursday, 18 June 2020 22.47
> To: Damien Le Moal <Damien.LeMoal@xxxxxxx>
> Cc: Keith Busch <kbusch@xxxxxxxxxx>; Javier González <javier@xxxxxxxxxxx>;
> Matias Bjørling <mb@xxxxxxxxxxx>; Matias Bjorling
> <Matias.Bjorling@xxxxxxx>; Christoph Hellwig <hch@xxxxxx>; Keith Busch
> <Keith.Busch@xxxxxxx>; linux-nvme@xxxxxxxxxxxxxxxxxxx; linux-
> block@xxxxxxxxxxxxxxx; Sagi Grimberg <sagi@xxxxxxxxxxx>; Jens Axboe
> <axboe@xxxxxxxxx>; Hans Holmberg <Hans.Holmberg@xxxxxxx>; Dmitry
> Fomichev <Dmitry.Fomichev@xxxxxxx>; Ajay Joshi <Ajay.Joshi@xxxxxxx>;
> Aravind Ramesh <Aravind.Ramesh@xxxxxxx>; Niklas Cassel
> <Niklas.Cassel@xxxxxxx>; Judy Brock <judy.brock@xxxxxxxxxxx>
> Subject: Re: [PATCH 5/5] nvme: support for zoned namespaces
> 
> Thanks Damien,
> the striping explanation makes sense. In this case will rephase to: It is sufficient
> to support large enough un-splittable writes to achieve full per-zone bandwidth
> with a single writer/single QD.

Hi Heiner,

For ZNS in general, there is no performance information for a zone, other than what is communicated per namespace. I.e., In a well-developed ZNS drive, the host should not have to stripe zones to get the full performance of a ZNS drive. This is important and was one of the learnings from OCSSD. We saw that the main bottlenecks in OCSSD host software implementations were striping, host buffering, and vendor-specific hacks. For ZNS - I wanted to make sure that we did not make the same mistake, and as such, the complexity should solely be managed within an ZNS SSD.

If one does want to expose this kind of architecture, for whatever reason. One can make use of Endurance Groups in NVMe, and as such expose groups, that are physically separated in the drive, and as such, the host can then stripe zones is separate endurance groups to get the necessary performance.

That being said, some vendors implemented ZNS SSDs as it were OCSSDs, and as such, one has to stripe zones together to get the expected performance. For Linux, that is not something that'll be supported (other than if a device does it the appropriate way by using the standardized endurance groups). That being said, adoptors that run custom storage stacks can make use of it at the cost of having to manage the same challenges that OCSSD had., i.e., manually manage striping, host buffering, and even vendor-specific hacks.

> 
> My main point is: There is no fundamental reason for splitting up requests
> intermittently just to re-assemble them in the same form later.
> 
> On Wed, Jun 17, 2020 at 10:15 PM Damien Le Moal
> <Damien.LeMoal@xxxxxxx> wrote:
> >
> > On 2020/06/18 13:24, Heiner Litz wrote:
> > > What is the purpose of making zones larger than the erase block size
> > > of flash? And why are large writes fundamentally unreasonable?
> >
> > It is up to the drive vendor to decide how zones are mapped onto flash media.
> > Different mapping give different properties for different use cases.
> > Zones, in many cases, will be much larger than an erase block due to
> > stripping across many dies for example. And erase block size also has
> > a tendency to grow over time with new media generations.
> > The block layer management of zoned block devices also applies to SMR
> > HDDs, which can have any zone size they want. This is not all about flash.
> >
> > As for large writes, they may not be possible due to memory
> > fragmentation and/or limited SGL size of the drive interface. E.g.
> > AHCI max out at 168 segments, most HBAs are at best 256, etc.
> >
> > > I don't see why it should be a fundamental problem for e.g. RocksDB
> > > to issue single zone-sized writes (whatever the zone size is because
> > > RocksDB needs to cope with it). The write buffer exists as a level
> > > in DRAM anyways and increasing write latency will not matter either.
> >
> > Rocksdb is an application, so of course it is free to issue a single
> > write() call with a buffer size equal to the zone size. But due to the
> > buffer mapping limitations stated above, there is a very high
> > probability that this single zone-sized large write operation will
> > end-up being split into multiple write commands in the kernel.
> >
> > >
> > > On Wed, Jun 17, 2020 at 6:55 PM Keith Busch <kbusch@xxxxxxxxxx> wrote:
> > >>
> > >> On Wed, Jun 17, 2020 at 04:44:23PM -0700, Heiner Litz wrote:
> > >>> Mandating zone-sized writes would address all problems with ease
> > >>> and reduce request rate and overheads in the kernel.
> > >>
> > >> Yikes, no. Typical zone sizes are much to large for that to be
> > >> reasonable.
> > >
> >
> >
> > --
> > Damien Le Moal
> > Western Digital Research




[Index of Archives]     [Linux RAID]     [Linux SCSI]     [Linux ATA RAID]     [IDE]     [Linux Wireless]     [Linux Kernel]     [ATH6KL]     [Linux Bluetooth]     [Linux Netdev]     [Kernel Newbies]     [Security]     [Git]     [Netfilter]     [Bugtraq]     [Yosemite News]     [MIPS Linux]     [ARM Linux]     [Linux Security]     [Device Mapper]

  Powered by Linux