Re: [LSF/MM/BPF BoF] BoF for Zoned Storage

Himanshu Madhani <himanshu.madhani@xxxxxxxxxx> · Thu, 3 Mar 2022 16:12:58 +0000

> On Mar 2, 2022, at 10:29 PM, Javier González <javier@xxxxxxxxxxx> wrote:
> 
> On 03.03.2022 06:32, Javier González wrote:
>> 
>>> On 3 Mar 2022, at 04.24, Luis Chamberlain <mcgrof@xxxxxxxxxx> wrote:
>>> 
>>> Thinking proactively about LSFMM, regarding just Zone storage..
>>> 
>>> I'd like to propose a BoF for Zoned Storage. The point of it is
>>> to address the existing point points we have and take advantage of
>>> having folks in the room we can likely settle on things faster which
>>> otherwise would take years.
>>> 
>>> I'll throw at least one topic out:
>>> 
>>> * Raw access for zone append for microbenchmarks:
>>>     - are we really happy with the status quo?
>>>   - if not what outlets do we have?
>>> 
>>> I think the nvme passthrogh stuff deserves it's own shared
>>> discussion though and should not make it part of the BoF.
>>> 
>>> Luis
>> 
>> Thanks for proposing this, Luis.
>> 
>> I’d like to join this discussion too.
>> 
>> Thanks,
>> Javier
> 
> Let me expand a bit on this. There is one topic that I would like to
> cover in this session:
> 
>  - PO2 zone sizes
>      In the past weeks we have been talking to Damien and Matias around
>      the constraint that we currently have for PO2 zone sizes. While
>      this has not been an issue for SMR HDDs, the gap that ZNS
>      introduces between zone capacity and zone size causes holes in the
>      address space. This unmapped LBA space has been the topic of
>      discussion with several ZNS adopters.
> 
>      One of the things to note here is that even if the zone size is a
>      PO2, the zone capacity is typically not. This means that even when
>      we can use shifts to move around zones, the actual data placement
>      algorithms need to deal with arbitrary sizes. So at the end of the
>      day applications that use a contiguous address space - like in a
>      conventional block device -, will have to deal with this.
> 
>      Since chunk_sectors is no longer required to be a PO2, we have
>      started the work in removing this constraint. We are working in 2
>      phases:
> 
>        1. Add an emulation layer in NVMe driver to simulate PO2 devices
> 	when the HW presents a zone_capacity = zone_size. This is a
> 	product of one of Damien's early concerns about supporting
> 	existing applications and FSs that work under the PO2
> 	assumption. We will post these patches in the next few days.
> 
>        2. Remove the PO2 constraint from the block layer and add
> 	support for arbitrary zone support in btrfs. This will allow the
> 	raw block device to be present for arbitrary zone sizes (and
> 	capacities) and btrfs will be able to use it natively.
> 
> 	For completeness, F2FS works natively in PO2 zone sizes, so we
> 	will not do work here for now, as the changes will not bring any
> 	benefit. For F2FS, the emulation layer will help use devices
> 	that do not have PO2 zone sizes.
> 
>     We are working towards having at least a RFC of (2) before LSF/MM.
>     Since this is a topic that involves several parties across the
>     stack, I believe that a F2F conversation will help laying the path
>     forward.
> 
> Thanks,
> Javier
> 

I am working on Zoned storage for some time as well. I would like to be part of this discussion as well. 

Thanks! 

--
Himanshu Madhani	 Oracle Linux Engineering