On 07.03.2022 10:27, Matias Bjørling wrote:
> I understand that you point to ZoneFS for this. It is true that it was > presented at the moment as the way to do raw zone access from > user-space. > > However, there is no users of ZoneFS for ZNS devices that I am aware > of (maybe for SMR this is a different story). The main open-source > implementations out there for RocksDB that are being used in > production (ZenFS and xZTL) rely on either raw zone block access or > the generic char device in NVMe (/dev/ngXnY). That's exactly the situation we want to avoid. You're talking about accessing Zoned storage by knowing directly about how the hardware works and interfacing directly with hardware specific device commands. This is exactly what is wrong with this whole conversation - direct access to hardware is fragile and very limiting, and the whole purpose of having an operating system is to abstract the hardware functionality into a generally usable API. That way when something new gets added to the hardware or something gets removed, the applications don't because they weren't written with that sort of hardware functionality extension in mind. I understand that RocksDB probably went direct to the hardware because, at the time, it was the only choice the developers had to make use of ZNS based storage. I understand that. However, I also understand that there are *better options now* that allow applications to target zone storage in a way that doesn't expose them to the foibles of hardware support and storage protocol specifications and characteristics. The generic interface that the kernel provides for zoned storage is called ZoneFS. Forget about the fact it is a filesystem, all it does is provide userspace with a named zone abstraction for a zoned device: every zone is an append-only file. That's what I'm trying to get across here - this whole discussion about zone capacity not matching zone size is a hardware/ specification detail that applications *do not need to know about* to use zone storage. That's something taht Zonefs can/does hide from applications completely - the zone files behave exactly the same from the user perspective regardless of whether the hardware zone capacity is the same or less than the zone size. Expanding access the hardware and/or raw block devices to ensure userspace applications can directly manage zone write pointers, zone capacity/space limits, etc is the wrong architectural direction to be taking. The sort of *hardware quirks* being discussed in this thread need to be managed by the kernel and hidden from userspace; userspace shouldn't need to care about such wierd and esoteric hardware and storage protocol/specification/implementation differences. IMO, while RocksDB is the technology leader for ZNS, it is not the model that new applications should be trying to emulate. They should be designed from the ground up to use ZoneFS instead of directly accessing nvme devices or trying to use the raw block devices for zoned storage. Use the generic kernel abstraction for the hardware like applications do for all other things! > This is because having the capability to do zone management from > applications that already work with objects fits much better. ZoneFS doesn't absolve applications from having to perform zone management to pack it's objects and garbage collect stale storage space. ZoneFS merely provides a generic, file based, hardware independent API for performing these zone management tasks. > My point is that there is space for both ZoneFS and raw zoned block > device. And regarding !PO2 zone sizes, my point is that this can be > leveraged both by btrfs and this raw zone block device. On that I disagree - any argument that starts with "we need raw zoned block device access to ...." is starting from an invalid premise. We should be hiding hardware quirks from userspace, not exposing them further. IMO, we want writing zone storage native applications to be simple and approachable by anyone who knows how to write to append-only files. We do not want such applications to be limited to people who have deep and rare expertise in the dark details of, say, largely undocumented niche NVMe ZNS specification and protocol quirks. ZoneFS provides us with a path to the former, what you are advocating is the latter....
I agree with all you say. I can see ZoneFS becoming a generic zone API, but we are not there yet. Rather than advocating for using raw devices, I am describing how zone devices are being consumed today. So to me there are 2 things we need to consider: Support current customers and improve the way future customers consume these devices. Coming back to the original topic of the LSF/MM discussion, what I would like to propose is that we support existing, deployed devices that are running in Linux and do not have PO2 zone sizes. These can then be consumed by btrfs or presented to applications through ZoneFS. And for existing customers, this will mean less headaches. Note here that if we use ZoneFS and all we care is zone capacities, then the whole PO2 argument to make applications more efficient does not apply anymore, as applications would be using the real capacity of the zone. I very much like this approach.
+ Hans (zenfs/rocksdb author) Dave, thank you for your great insight. It is a great argument for why zonefs makes sense. I must admit that Damien has been telling me this multiple times, but I didn't fully grok the benefits until seeing it in the light of this thread. Wrt to RocksDB support using ZenFS - while raw block access was the initial approach, it is very easy to change to use the zonefs API. Hans has already whipped up a plan for how to do it.
This is great. We have been thinking for some time about aligning with ZenFS for the in-kernel path. This might be the right time to take action on this.