> > I understand that you point to ZoneFS for this. It is true that it was > > presented at the moment as the way to do raw zone access from > > user-space. > > > > However, there is no users of ZoneFS for ZNS devices that I am aware > > of (maybe for SMR this is a different story). The main open-source > > implementations out there for RocksDB that are being used in > > production (ZenFS and xZTL) rely on either raw zone block access or > > the generic char device in NVMe (/dev/ngXnY). > > That's exactly the situation we want to avoid. > > You're talking about accessing Zoned storage by knowing directly about how > the hardware works and interfacing directly with hardware specific device > commands. > > This is exactly what is wrong with this whole conversation - direct access to > hardware is fragile and very limiting, and the whole purpose of having an > operating system is to abstract the hardware functionality into a generally > usable API. That way when something new gets added to the hardware or > something gets removed, the applications don't because they weren't written > with that sort of hardware functionality extension in mind. > > I understand that RocksDB probably went direct to the hardware because, at > the time, it was the only choice the developers had to make use of ZNS based > storage. I understand that. > > However, I also understand that there are *better options now* that allow > applications to target zone storage in a way that doesn't expose them to the > foibles of hardware support and storage protocol specifications and > characteristics. > > The generic interface that the kernel provides for zoned storage is called > ZoneFS. Forget about the fact it is a filesystem, all it does is provide userspace > with a named zone abstraction for a zoned > device: every zone is an append-only file. > > That's what I'm trying to get across here - this whole discussion about zone > capacity not matching zone size is a hardware/ specification detail that > applications *do not need to know about* to use zone storage. That's > something taht Zonefs can/does hide from applications completely - the zone > files behave exactly the same from the user perspective regardless of whether > the hardware zone capacity is the same or less than the zone size. > > Expanding access the hardware and/or raw block devices to ensure userspace > applications can directly manage zone write pointers, zone capacity/space > limits, etc is the wrong architectural direction to be taking. The sort of > *hardware quirks* being discussed in this thread need to be managed by the > kernel and hidden from userspace; userspace shouldn't need to care about > such wierd and esoteric hardware and storage > protocol/specification/implementation > differences. > > IMO, while RocksDB is the technology leader for ZNS, it is not the model that > new applications should be trying to emulate. They should be designed from > the ground up to use ZoneFS instead of directly accessing nvme devices or > trying to use the raw block devices for zoned storage. Use the generic kernel > abstraction for the hardware like applications do for all other things! > > > This is because having the capability to do zone management from > > applications that already work with objects fits much better. > > ZoneFS doesn't absolve applications from having to perform zone management > to pack it's objects and garbage collect stale storage space. ZoneFS merely > provides a generic, file based, hardware independent API for performing these > zone management tasks. > > > My point is that there is space for both ZoneFS and raw zoned block > > device. And regarding !PO2 zone sizes, my point is that this can be > > leveraged both by btrfs and this raw zone block device. > > On that I disagree - any argument that starts with "we need raw zoned block > device access to ...." is starting from an invalid premise. We should be hiding > hardware quirks from userspace, not exposing them further. > > IMO, we want writing zone storage native applications to be simple and > approachable by anyone who knows how to write to append-only files. We do > not want such applications to be limited to people who have deep and rare > expertise in the dark details of, say, largely undocumented niche NVMe ZNS > specification and protocol quirks. > > ZoneFS provides us with a path to the former, what you are advocating is the > latter.... > + Hans (zenfs/rocksdb author) Dave, thank you for your great insight. It is a great argument for why zonefs makes sense. I must admit that Damien has been telling me this multiple times, but I didn't fully grok the benefits until seeing it in the light of this thread. Wrt to RocksDB support using ZenFS - while raw block access was the initial approach, it is very easy to change to use the zonefs API. Hans has already whipped up a plan for how to do it.