RE: [LSF/MM/BPF BoF] BoF for Zoned Storage

Matias Bjørling <Matias.Bjorling@xxxxxxx> · Mon, 7 Mar 2022 10:27:49 +0000

> > I understand that you point to ZoneFS for this. It is true that it was
> > presented at the moment as the way to do raw zone access from
> > user-space.
> >
> > However, there is no users of ZoneFS for ZNS devices that I am aware
> > of (maybe for SMR this is a different story).  The main open-source
> > implementations out there for RocksDB that are being used in
> > production (ZenFS and xZTL) rely on either raw zone block access or
> > the generic char device in NVMe (/dev/ngXnY).
> 
> That's exactly the situation we want to avoid.
> 
> You're talking about accessing Zoned storage by knowing directly about how
> the hardware works and interfacing directly with hardware specific device
> commands.
> 
> This is exactly what is wrong with this whole conversation - direct access to
> hardware is fragile and very limiting, and the whole purpose of having an
> operating system is to abstract the hardware functionality into a generally
> usable API. That way when something new gets added to the hardware or
> something gets removed, the applications don't because they weren't written
> with that sort of hardware functionality extension in mind.
> 
> I understand that RocksDB probably went direct to the hardware because, at
> the time, it was the only choice the developers had to make use of ZNS based
> storage. I understand that.
> 
> However, I also understand that there are *better options now* that allow
> applications to target zone storage in a way that doesn't expose them to the
> foibles of hardware support and storage protocol specifications and
> characteristics.
> 
> The generic interface that the kernel provides for zoned storage is called
> ZoneFS. Forget about the fact it is a filesystem, all it does is provide userspace
> with a named zone abstraction for a zoned
> device: every zone is an append-only file.
> 
> That's what I'm trying to get across here - this whole discussion about zone
> capacity not matching zone size is a hardware/ specification detail that
> applications *do not need to know about* to use zone storage. That's
> something taht Zonefs can/does hide from applications completely - the zone
> files behave exactly the same from the user perspective regardless of whether
> the hardware zone capacity is the same or less than the zone size.
> 
> Expanding access the hardware and/or raw block devices to ensure userspace
> applications can directly manage zone write pointers, zone capacity/space
> limits, etc is the wrong architectural direction to be taking. The sort of
> *hardware quirks* being discussed in this thread need to be managed by the
> kernel and hidden from userspace; userspace shouldn't need to care about
> such wierd and esoteric hardware and storage
> protocol/specification/implementation
> differences.
> 
> IMO, while RocksDB is the technology leader for ZNS, it is not the model that
> new applications should be trying to emulate. They should be designed from
> the ground up to use ZoneFS instead of directly accessing nvme devices or
> trying to use the raw block devices for zoned storage. Use the generic kernel
> abstraction for the hardware like applications do for all other things!
> 
> > This is because having the capability to do zone management from
> > applications that already work with objects fits much better.
> 
> ZoneFS doesn't absolve applications from having to perform zone management
> to pack it's objects and garbage collect stale storage space.  ZoneFS merely
> provides a generic, file based, hardware independent API for performing these
> zone management tasks.
> 
> > My point is that there is space for both ZoneFS and raw zoned block
> > device. And regarding !PO2 zone sizes, my point is that this can be
> > leveraged both by btrfs and this raw zone block device.
> 
> On that I disagree - any argument that starts with "we need raw zoned block
> device access to ...." is starting from an invalid premise. We should be hiding
> hardware quirks from userspace, not exposing them further.
> 
> IMO, we want writing zone storage native applications to be simple and
> approachable by anyone who knows how to write to append-only files.  We do
> not want such applications to be limited to people who have deep and rare
> expertise in the dark details of, say, largely undocumented niche NVMe ZNS
> specification and protocol quirks.
> 
> ZoneFS provides us with a path to the former, what you are advocating is the
> latter....
> 

+ Hans (zenfs/rocksdb author)

Dave, thank you for your great insight. It is a great argument for why zonefs makes sense. I must admit that Damien has been telling me this multiple times, but I didn't fully grok the benefits until seeing it in the light of this thread.

Wrt to RocksDB support using ZenFS - while raw block access was the initial approach, it is very easy to change to use the zonefs API. Hans has already whipped up a plan for how to do it.