Re: [PATCH 5/5] nvme: support for zoned namespaces

Damien Le Moal <Damien.LeMoal@xxxxxxx> · Thu, 18 Jun 2020 05:15:35 +0000

On 2020/06/18 13:24, Heiner Litz wrote:
> What is the purpose of making zones larger than the erase block size
> of flash? And why are large writes fundamentally unreasonable?

It is up to the drive vendor to decide how zones are mapped onto flash media.
Different mapping give different properties for different use cases. Zones, in
many cases, will be much larger than an erase block due to stripping across many
dies for example. And erase block size also has a tendency to grow over time
with new media generations.
The block layer management of zoned block devices also applies to SMR HDDs,
which can have any zone size they want. This is not all about flash.

As for large writes, they may not be possible due to memory fragmentation and/or
limited SGL size of the drive interface. E.g. AHCI max out at 168 segments, most
HBAs are at best 256, etc.

> I don't see why it should be a fundamental problem for e.g. RocksDB to
> issue single zone-sized writes (whatever the zone size is because
> RocksDB needs to cope with it). The write buffer exists as a level in
> DRAM anyways and increasing write latency will not matter either.

Rocksdb is an application, so of course it is free to issue a single write()
call with a buffer size equal to the zone size. But due to the buffer mapping
limitations stated above, there is a very high probability that this single
zone-sized large write operation will end-up being split into multiple write
commands in the kernel.

> 
> On Wed, Jun 17, 2020 at 6:55 PM Keith Busch <kbusch@xxxxxxxxxx> wrote:
>>
>> On Wed, Jun 17, 2020 at 04:44:23PM -0700, Heiner Litz wrote:
>>> Mandating zone-sized writes would address all problems with ease and
>>> reduce request rate and overheads in the kernel.
>>
>> Yikes, no. Typical zone sizes are much to large for that to be
>> reasonable.
> 

-- 
Damien Le Moal
Western Digital Research