Thanks Damien, the striping explanation makes sense. In this case will rephase to: It is sufficient to support large enough un-splittable writes to achieve full per-zone bandwidth with a single writer/single QD. My main point is: There is no fundamental reason for splitting up requests intermittently just to re-assemble them in the same form later. On Wed, Jun 17, 2020 at 10:15 PM Damien Le Moal <Damien.LeMoal@xxxxxxx> wrote: > > On 2020/06/18 13:24, Heiner Litz wrote: > > What is the purpose of making zones larger than the erase block size > > of flash? And why are large writes fundamentally unreasonable? > > It is up to the drive vendor to decide how zones are mapped onto flash media. > Different mapping give different properties for different use cases. Zones, in > many cases, will be much larger than an erase block due to stripping across many > dies for example. And erase block size also has a tendency to grow over time > with new media generations. > The block layer management of zoned block devices also applies to SMR HDDs, > which can have any zone size they want. This is not all about flash. > > As for large writes, they may not be possible due to memory fragmentation and/or > limited SGL size of the drive interface. E.g. AHCI max out at 168 segments, most > HBAs are at best 256, etc. > > > I don't see why it should be a fundamental problem for e.g. RocksDB to > > issue single zone-sized writes (whatever the zone size is because > > RocksDB needs to cope with it). The write buffer exists as a level in > > DRAM anyways and increasing write latency will not matter either. > > Rocksdb is an application, so of course it is free to issue a single write() > call with a buffer size equal to the zone size. But due to the buffer mapping > limitations stated above, there is a very high probability that this single > zone-sized large write operation will end-up being split into multiple write > commands in the kernel. > > > > > On Wed, Jun 17, 2020 at 6:55 PM Keith Busch <kbusch@xxxxxxxxxx> wrote: > >> > >> On Wed, Jun 17, 2020 at 04:44:23PM -0700, Heiner Litz wrote: > >>> Mandating zone-sized writes would address all problems with ease and > >>> reduce request rate and overheads in the kernel. > >> > >> Yikes, no. Typical zone sizes are much to large for that to be > >> reasonable. > > > > > -- > Damien Le Moal > Western Digital Research