Re: [LSF/MM/BPF BoF] BoF for Zoned Storage

Luis Chamberlain <mcgrof@xxxxxxxxxx> · Fri, 4 Mar 2022 14:55:38 -0800

On Sat, Mar 05, 2022 at 09:42:57AM +1100, Dave Chinner wrote:
> On Fri, Mar 04, 2022 at 02:10:08PM -0800, Luis Chamberlain wrote:
> > On Fri, Mar 04, 2022 at 11:10:22AM +1100, Dave Chinner wrote:
> > > On Wed, Mar 02, 2022 at 04:56:54PM -0800, Luis Chamberlain wrote:
> > > > Thinking proactively about LSFMM, regarding just Zone storage..
> > > > 
> > > > I'd like to propose a BoF for Zoned Storage. The point of it is
> > > > to address the existing point points we have and take advantage of
> > > > having folks in the room we can likely settle on things faster which
> > > > otherwise would take years.
> > > > 
> > > > I'll throw at least one topic out:
> > > > 
> > > >   * Raw access for zone append for microbenchmarks:
> > > >   	- are we really happy with the status quo?
> > > > 	- if not what outlets do we have?
> > > > 
> > > > I think the nvme passthrogh stuff deserves it's own shared
> > > > discussion though and should not make it part of the BoF.
> > > 
> > > Reading through the discussion on this thread, perhaps this session
> > > should be used to educate application developers about how to use
> > > ZoneFS so they never need to manage low level details of zone
> > > storage such as enumerating zones, controlling write pointers
> > > safely for concurrent IO, performing zone resets, etc.
> > 
> > I'm not even sure users are really aware that given cap can be different
> > than zone size and btrfs uses zone size to compute size, the size is a
> > flat out lie.
> 
> Sorry, I don't get what btrfs does with zone management has anything
> to do with using Zonefs to get direct, raw IO access to individual
> zones.

You are right for direct raw access. My point was that even for
filesystem use design I don't think the communication is clear on
expectations. Similar computation need to be managed by fileystem
design, for instance.

> Direct IO on open zone fds is likely more efficient than
> doing IO through the standard LBA based block device because ZoneFS
> uses iomap_dio_rw() so it only needs to do one mapping operation per
> IO instead of one per page in the IO. Nor does it have to manage
> buffer heads or other "generic blockdev" functionality that direct
> IO access to zoned storage doesn't require.
>
> So whatever you're complaining about that btrfs lies about, does or
> doesn't do is irrelevant - Zonefs was written with the express
> purpose of getting user applications away from needing to directly
> manage zone storage.

I think it ended that way, I can't say it was the goal from the start.
Seems the raw block patches had some support and in the end zonefs
was presented as a possible outlet.

> SO if you have special zone IO management
> requirements, work out how they can be supported by zonefs - we
> don't need yet another special purpose direct hardware access API
> for zone storage when we already have a solid solution to the
> problem already.

If this is fairly decided. Then that's that.

Calling zonefs solid though is a stretch.

> > modprobe null_blk nr_devices=0
> > mkdir /sys/kernel/config/nullb/nullb0
> > echo 0 > /sys/kernel/config/nullb/nullb0/completion_nsec
> > echo 0 > /sys/kernel/config/nullb/nullb0/irqmode
> > echo 2 > /sys/kernel/config/nullb/nullb0/queue_mode
> > echo 1024 > /sys/kernel/config/nullb/nullb0/hw_queue_depth
> > echo 1 > /sys/kernel/config/nullb/nullb0/memory_backed
> > echo 1 > /sys/kernel/config/nullb/nullb0/zoned
> > 
> > echo 128 > /sys/kernel/config/nullb/nullb0/zone_size
> > # 6 zones are implied, we are saying 768 for the full storage size..
> > # but...
> > echo 768 > /sys/kernel/config/nullb/nullb0/size
> > 
> > # If we force capacity to be way less than the zone sizes, btrfs still
> > # uses the zone size to do its data / metadata size computation...
> > echo 32 > /sys/kernel/config/nullb/nullb0/zone_capacity
> 
> Then that's just a btrfs zone support bug where it's used the
> wrong information to size it's zones. Why not just send a patch to
> fix it?

This can change the format of existing created filesystems. And so
if this change is welcomed I think we would need to be explicit
about its support.

  Luis