Re: [LSF/MM/BPF BoF] BoF for Zoned Storage

Javier González <javier@xxxxxxxxxxx> · Mon, 7 Mar 2022 12:29:45 +0100

On 07.03.2022 10:27, Matias Bjørling wrote:
> I understand that you point to ZoneFS for this. It is true that it was
> presented at the moment as the way to do raw zone access from
> user-space.
>
> However, there is no users of ZoneFS for ZNS devices that I am aware
> of (maybe for SMR this is a different story).  The main open-source
> implementations out there for RocksDB that are being used in
> production (ZenFS and xZTL) rely on either raw zone block access or
> the generic char device in NVMe (/dev/ngXnY).

That's exactly the situation we want to avoid.

You're talking about accessing Zoned storage by knowing directly about how
the hardware works and interfacing directly with hardware specific device
commands.

This is exactly what is wrong with this whole conversation - direct access to
hardware is fragile and very limiting, and the whole purpose of having an
operating system is to abstract the hardware functionality into a generally
usable API. That way when something new gets added to the hardware or
something gets removed, the applications don't because they weren't written
with that sort of hardware functionality extension in mind.

I understand that RocksDB probably went direct to the hardware because, at
the time, it was the only choice the developers had to make use of ZNS based
storage. I understand that.

However, I also understand that there are *better options now* that allow
applications to target zone storage in a way that doesn't expose them to the
foibles of hardware support and storage protocol specifications and
characteristics.

The generic interface that the kernel provides for zoned storage is called
ZoneFS. Forget about the fact it is a filesystem, all it does is provide userspace
with a named zone abstraction for a zoned
device: every zone is an append-only file.

That's what I'm trying to get across here - this whole discussion about zone
capacity not matching zone size is a hardware/ specification detail that
applications *do not need to know about* to use zone storage. That's
something taht Zonefs can/does hide from applications completely - the zone
files behave exactly the same from the user perspective regardless of whether
the hardware zone capacity is the same or less than the zone size.

Expanding access the hardware and/or raw block devices to ensure userspace
applications can directly manage zone write pointers, zone capacity/space
limits, etc is the wrong architectural direction to be taking. The sort of
*hardware quirks* being discussed in this thread need to be managed by the
kernel and hidden from userspace; userspace shouldn't need to care about
such wierd and esoteric hardware and storage
protocol/specification/implementation
differences.

IMO, while RocksDB is the technology leader for ZNS, it is not the model that
new applications should be trying to emulate. They should be designed from
the ground up to use ZoneFS instead of directly accessing nvme devices or
trying to use the raw block devices for zoned storage. Use the generic kernel
abstraction for the hardware like applications do for all other things!

> This is because having the capability to do zone management from
> applications that already work with objects fits much better.

ZoneFS doesn't absolve applications from having to perform zone management
to pack it's objects and garbage collect stale storage space.  ZoneFS merely
provides a generic, file based, hardware independent API for performing these
zone management tasks.

> My point is that there is space for both ZoneFS and raw zoned block
> device. And regarding !PO2 zone sizes, my point is that this can be
> leveraged both by btrfs and this raw zone block device.

On that I disagree - any argument that starts with "we need raw zoned block
device access to ...." is starting from an invalid premise. We should be hiding
hardware quirks from userspace, not exposing them further.

IMO, we want writing zone storage native applications to be simple and
approachable by anyone who knows how to write to append-only files.  We do
not want such applications to be limited to people who have deep and rare
expertise in the dark details of, say, largely undocumented niche NVMe ZNS
specification and protocol quirks.

ZoneFS provides us with a path to the former, what you are advocating is the
latter....

I agree with all you say. I can see ZoneFS becoming a generic zone API,
but we are not there yet. Rather than advocating for using raw devices,
I am describing how zone devices are being consumed today. So to me
there are 2 things we need to consider: Support current customers and
improve the way future customers consume these devices.

Coming back to the original topic of the LSF/MM discussion, what I would
like to propose is that we support existing, deployed devices that are
running in Linux and do not have PO2 zone sizes. These can then be
consumed by btrfs or presented to applications through ZoneFS. And for
existing customers, this will mean less headaches.

Note here that if we use ZoneFS and all we care is zone capacities, then
the whole PO2 argument to make applications more efficient does not
apply anymore, as applications would be using the real capacity of the
zone. I very much like this approach.

+ Hans (zenfs/rocksdb author)

Dave, thank you for your great insight. It is a great argument for why zonefs makes sense. I must admit that Damien has been telling me this multiple times, but I didn't fully grok the benefits until seeing it in the light of this thread.

Wrt to RocksDB support using ZenFS - while raw block access was the initial approach, it is very easy to change to use the zonefs API. Hans has already whipped up a plan for how to do it.

This is great. We have been thinking for some time about aligning with
ZenFS for the in-kernel path. This might be the right time to take
action on this.