Re: [LSF/MM/BPF BoF] BoF for Zoned Storage

Adam Manzanares <a.manzanares@xxxxxxxxxxx> · Thu, 3 Mar 2022 17:10:25 +0000

On Thu, Mar 03, 2022 at 03:22:52PM +0000, Damien Le Moal wrote:
> On 2022/03/03 16:55, Adam Manzanares wrote:
> > On Thu, Mar 03, 2022 at 09:49:06AM +0000, Damien Le Moal wrote:
> >> On 2022/03/03 8:29, Javier González wrote:
> >>> On 03.03.2022 06:32, Javier González wrote:
> >>>>
> >>>>> On 3 Mar 2022, at 04.24, Luis Chamberlain <mcgrof@xxxxxxxxxx> wrote:
> >>>>>
> >>>>> Thinking proactively about LSFMM, regarding just Zone storage..
> >>>>>
> >>>>> I'd like to propose a BoF for Zoned Storage. The point of it is
> >>>>> to address the existing point points we have and take advantage of
> >>>>> having folks in the room we can likely settle on things faster which
> >>>>> otherwise would take years.
> >>>>>
> >>>>> I'll throw at least one topic out:
> >>>>>
> >>>>>  * Raw access for zone append for microbenchmarks:
> >>>>>      - are we really happy with the status quo?
> >>>>>    - if not what outlets do we have?
> >>>>>
> >>>>> I think the nvme passthrogh stuff deserves it's own shared
> >>>>> discussion though and should not make it part of the BoF.
> >>>>>
> >>>>>  Luis
> >>>>
> >>>> Thanks for proposing this, Luis.
> >>>>
> >>>> I’d like to join this discussion too.
> >>>>
> >>>> Thanks,
> >>>> Javier
> >>>
> >>> Let me expand a bit on this. There is one topic that I would like to
> >>> cover in this session:
> >>>
> >>>    - PO2 zone sizes
> >>>        In the past weeks we have been talking to Damien and Matias around
> >>>        the constraint that we currently have for PO2 zone sizes. While
> >>>        this has not been an issue for SMR HDDs, the gap that ZNS
> >>>        introduces between zone capacity and zone size causes holes in the
> >>>        address space. This unmapped LBA space has been the topic of
> >>>        discussion with several ZNS adopters.
> >>>
> >>>        One of the things to note here is that even if the zone size is a
> >>>        PO2, the zone capacity is typically not. This means that even when
> >>>        we can use shifts to move around zones, the actual data placement
> >>>        algorithms need to deal with arbitrary sizes. So at the end of the
> >>>        day applications that use a contiguous address space - like in a
> >>>        conventional block device -, will have to deal with this.
> >>
> >> "the actual data placement algorithms need to deal with arbitrary sizes"
> >>
> >> ???
> >>
> >> No it does not. With zone cap < zone size, the amount of sectors that can be
> >> used within a zone may be smaller than the zone size, but:
> >> 1) Writes still must be issued at the WP location so choosing a zone for writing
> >> data has the same constraint regardless of the zone capacity: Do I have enough
> >> usable sectors left in the zone ?
> > 
> > Are you saying holes are irrelevant because an application has to know the 
> > status of a zone by querying the device for the zone status before using a zone
> > and at that point it should know a start LBA? I see your point here but we have
> > to assume things to arrive at this conclusion.
> 
> Of course holes are relevant. But their presence does not complicate anything
> because the basic management of zones already has to deal with 2 sector ranges
> in any zone: sectors that have already been written and the one that have not.
> The "hole" for zone capacity != zone size case is simply another range to be
> ignored.
> 
> And the only thing I am assuming here is that the software has a decent design,
> that is, it is indeed zone aware and manages zones (their state and wp
> position). That does not mean that one needs to do a report zones before every
> IO (well, dumb application can do that if they want). Zone management is
> initialized using a report zone command information but can be then be cached on
> the host dram in any form suitable for the application.
> 
> > 
> > Let's think of another scenario where the drive is managed by a user space 
> > application that knows the status of zones and picks a zone because it knows 
> > it is free. To calculate the start offset in terms of LBAs the application has 
> > to use the difference in zone_size and zone_cap to calculate the write offset
> > in terms of LBAs. 
> 
> What ? This does not make sense. The application simply needs to know the
> current "soft" wp position and issue writes at that position and increment it
> right away with the number of sectors written. Once that position reaches zone
> cap, the zone is full. The hole behind that can be ignored. What is difficult
> with this ? This is zone storage use 101.

Sounds like you voluntered to teach zoned storage use 101. Can you teach me how
to calculate an LBA offset given a zone number when zone capacity is not equal
to zone size?

The second thing I would like to know is what happens when an application wants
to map an object that spans multiple consecutive zones. Does the application 
have to be aware of the difference in zone capacity and zone size?

> 
> > My argument is that the zone_size is a construct conceived to make a ZNS zone
> > a power of 2 that creates a hole in the LBA space. Applications don't want
> > to deal with the power of 2 constraint and neither do devices. It seems like
> > the existing zoned kernel infrastructure, which made sense for SMR, pushed 
> > this constraint onto devices and onto users. Arguments can be made for where 
> > complexity should lie, but I don't think this decision made things easier for
> > someone to use a ZNS SSD as a block device.
> 
> "Applications don't want to deal with the power of 2 constraint"
> 
> Well, we definitely are not talking to the same users then. Because I heard the
> contrary from users who have actually deployed zoned storage at scale. And there
> is nothing to deal with power of 2. This is not a constraint in itself. A
> particular zone size is the constraint and for that, users are indeed never
> satisfied (some want smaller zones, other bigger zones). So far, power of 2 size
> has been mostly irrelevant or actually required because everybody understands
> the CPU load benefits of bit shift arithmetic as opposed to CPU cycle hungry
> multiplications and divisions.

You are thinking from a kernel perspective you are potentially pushing 
additional multiplications onto users. This should be clear if we learn more 
about zoned storage 101 in this thread.

> 
> > 
> >> 2) Reading after the WP is not useful (if not outright stupid), regardless of
> >> where the last usable sector in the zone is (at zone start + zone size or at
> >> zone start + zone cap).
> > 
> > Of course but the with po2 you force useless LBA space even if you fill a zone.
> 
> And my point is: so what ? I do not see this as a problem given that accesses
> must be zone based anyway.
> 
> >> And talking about "use a contiguous address space" is in my opinion nonsense in
> >> the context of zoned storage since by definition, everything has to be managed
> >> using zones as units. The only sensible range for a "contiguous address space"
> >> is "zone start + min(zone cap, zone size)".
> > 
> > Definitely disagree with this given previous arguments. This is a construct 
> > forced upon us because of zoned storage legacy.
> 
> What construct ? The zone is the unit. No matter its size, it *must* remain the
> access management unit for the zoned software top be correct. Thinking that one
> can correctly implement a zone compliant application, or any piece of software,
> without managing zones and using them as the storage unit is in my opinion a bad
> design bound to fail.
> 

Forcing a zone to be power of 2 size. For NAND it is something that it is 
not. Capacity vs size doesn't solve any real problem other than making ZNS fit
the zoned model that was conceived for HDDs.

> I may be wrong, of course, but I still have to be proven so by an actual use case.
> 
> > 
> >>
> >>>        Since chunk_sectors is no longer required to be a PO2, we have
> >>>        started the work in removing this constraint. We are working in 2
> >>>        phases:
> >>>
> >>>          1. Add an emulation layer in NVMe driver to simulate PO2 devices
> >>> 	when the HW presents a zone_capacity = zone_size. This is a
> >>> 	product of one of Damien's early concerns about supporting
> >>> 	existing applications and FSs that work under the PO2
> >>> 	assumption. We will post these patches in the next few days.
> >>>
> >>>          2. Remove the PO2 constraint from the block layer and add
> >>> 	support for arbitrary zone support in btrfs. This will allow the
> >>> 	raw block device to be present for arbitrary zone sizes (and
> >>> 	capacities) and btrfs will be able to use it natively.
> >>
> >> Zone sizes cannot be arbitrary in btrfs since block groups must be a multiple of
> >> 64K. So constraints remain and should be enforced, at least by btrfs.
> > 
> > I don't think we should base a lot of decisions on the work that has gone into 
> > btrfs. I think it is very promising, but I don't think it is settled that it 
> > is the only way people will consume ZNS SSDs.
> 
> Of course it is not. But not satisfying this constraint essentially disables
> btrfs support. Ever heard of a regular block device that you cannot format with
> ext4 or xfs ? It is the same here.
> 
> 
> -- 
> Damien Le Moal
> Western Digital Research