Re: [LSF/MM/BPF BoF] BoF for Zoned Storage

Damien Le Moal <damien.lemoal@xxxxxxxxxxxxxxxxxx> · Mon, 7 Mar 2022 08:54:06 +0900

On 3/5/22 05:12, Luis Chamberlain wrote:
> On Thu, Mar 03, 2022 at 09:33:06PM +0000, Matias Bjørling wrote:
>>> -----Original Message-----
>>> From: Adam Manzanares <a.manzanares@xxxxxxxxxxx>
>>> However, an end-user application should not (in my opinion) have to deal
>>> with this. It should use helper functions from a library that provides the
>>> appropriate abstraction to the application, such that the applications don't
>>> have to care about either specific zone capacity/size, or multiple resets. This is
>>> similar to how file systems work with file system semantics. For example, a file
>>> can span multiple extents on disk, but all an application sees is the file
>>> semantics.
>>>>
>>>
>>> I don't want to go so far as to say what the end user application should and
>>> should not do.
>>
>> Consider it as a best practice example. Another typical example is
>> that one should avoid extensive flushes to disk if the application
>> doesn't need persistence for each I/O it issues. 
> 
> Although I was sad to see there was no raw access to a block zoned
> storage device, the above makes me kind of happy that this is the case
> today. Why? Because there is an implicit requirement on management of
> data on zone storage devices outside of regular storage SSDs, and if
> its not considered and *very well documented*, in agreement with us
> all, we can end up with folks slightly surprised with these
> requirements.
> 
> An application today can't directly manage these objects so that's not
> even possible today. And in fact it's not even clear if / how we'll get
> there.

See include/uapi/linux/blkzoned.h. I really do not understand what you
are talking about.

And yes, there is not much in terms of documentation under
Documentation. Patches welcome. We do have documented things here though:

https://zonedstorage.io/docs/linux/zbd-api

> 
> So in the meantime the only way to access zones directly, if an application
> wants anything close as possible to the block layer, the only way is
> through the VFS through zonefs. I can hear people cringing even if you
> are miles away. If we want an improvement upon this, whatever API we come
> up with we *must* clearly embrace and document the requirements /
> responsiblities above.
> 
> From what I read, the unmapped LBA problem can be observed as a
> non-problem *iff* users are willing to deal with the above. We seem to
> have disagreement on the expection from users.

Again, how can one implement an application doing raw zoned block device
accesses without managing zones correctly is unknown to me. It seems to
me that you are thinking of an application design model that I do not
see/understand. Care to elaborate ?

> Any way, there are two aspects to what Javier was mentioning and I think
> it is *critial* to separate them:
> 
>  a) emulation should be possible given the nature of NAND

Emulation need has nothing to do with the media type. Specifications
*never* talk about a specific media type. ZBC/ZAC, similarly to ZNS, do
not mandate any requirement on zone size.

>  b) The PO2 requirement exists, is / should it exist forever?

Not necessarily. But since it is that right now, any change must ensure
that existing user-space does not break nor regress (performance).

> 
> The discussion around these two throws drew in a third aspect:
> 
> c) Applications which want to deal with LBAs directly on
> NVMe ZNS drives must be aware of the ZNS design and deal with
> it diretly or indirectly in light of the unmapped LBAs which
> are caused by the differences between zone sizes, zone capacity,
> how objects can span multiple zones, zone resets, etc.

That is not really special to ZNS. ZBC/ZAC SMR HDDs also need that
management since zones can go offline or read-only too (in ZNS too).
That is actually the main reason why applications *must* manage accesses
per zones. Otherwise, correct IO error recovery is impossible.

> 
> I think a) is easier to swallow and accept provided there is
> no impact on existing users. b) and c) are things which I think
> could be elaborated a bit more at LSFMM through community dialog.
> 
>   Luis

-- 
Damien Le Moal
Western Digital Research