On Sat, Mar 05, 2022 at 09:42:57AM +1100, Dave Chinner wrote: > On Fri, Mar 04, 2022 at 02:10:08PM -0800, Luis Chamberlain wrote: > > On Fri, Mar 04, 2022 at 11:10:22AM +1100, Dave Chinner wrote: > > > On Wed, Mar 02, 2022 at 04:56:54PM -0800, Luis Chamberlain wrote: > > > > Thinking proactively about LSFMM, regarding just Zone storage.. > > > > > > > > I'd like to propose a BoF for Zoned Storage. The point of it is > > > > to address the existing point points we have and take advantage of > > > > having folks in the room we can likely settle on things faster which > > > > otherwise would take years. > > > > > > > > I'll throw at least one topic out: > > > > > > > > * Raw access for zone append for microbenchmarks: > > > > - are we really happy with the status quo? > > > > - if not what outlets do we have? > > > > > > > > I think the nvme passthrogh stuff deserves it's own shared > > > > discussion though and should not make it part of the BoF. > > > > > > Reading through the discussion on this thread, perhaps this session > > > should be used to educate application developers about how to use > > > ZoneFS so they never need to manage low level details of zone > > > storage such as enumerating zones, controlling write pointers > > > safely for concurrent IO, performing zone resets, etc. > > > > I'm not even sure users are really aware that given cap can be different > > than zone size and btrfs uses zone size to compute size, the size is a > > flat out lie. > > Sorry, I don't get what btrfs does with zone management has anything > to do with using Zonefs to get direct, raw IO access to individual > zones. You are right for direct raw access. My point was that even for filesystem use design I don't think the communication is clear on expectations. Similar computation need to be managed by fileystem design, for instance. > Direct IO on open zone fds is likely more efficient than > doing IO through the standard LBA based block device because ZoneFS > uses iomap_dio_rw() so it only needs to do one mapping operation per > IO instead of one per page in the IO. Nor does it have to manage > buffer heads or other "generic blockdev" functionality that direct > IO access to zoned storage doesn't require. > > So whatever you're complaining about that btrfs lies about, does or > doesn't do is irrelevant - Zonefs was written with the express > purpose of getting user applications away from needing to directly > manage zone storage. I think it ended that way, I can't say it was the goal from the start. Seems the raw block patches had some support and in the end zonefs was presented as a possible outlet. > SO if you have special zone IO management > requirements, work out how they can be supported by zonefs - we > don't need yet another special purpose direct hardware access API > for zone storage when we already have a solid solution to the > problem already. If this is fairly decided. Then that's that. Calling zonefs solid though is a stretch. > > modprobe null_blk nr_devices=0 > > mkdir /sys/kernel/config/nullb/nullb0 > > echo 0 > /sys/kernel/config/nullb/nullb0/completion_nsec > > echo 0 > /sys/kernel/config/nullb/nullb0/irqmode > > echo 2 > /sys/kernel/config/nullb/nullb0/queue_mode > > echo 1024 > /sys/kernel/config/nullb/nullb0/hw_queue_depth > > echo 1 > /sys/kernel/config/nullb/nullb0/memory_backed > > echo 1 > /sys/kernel/config/nullb/nullb0/zoned > > > > echo 128 > /sys/kernel/config/nullb/nullb0/zone_size > > # 6 zones are implied, we are saying 768 for the full storage size.. > > # but... > > echo 768 > /sys/kernel/config/nullb/nullb0/size > > > > # If we force capacity to be way less than the zone sizes, btrfs still > > # uses the zone size to do its data / metadata size computation... > > echo 32 > /sys/kernel/config/nullb/nullb0/zone_capacity > > Then that's just a btrfs zone support bug where it's used the > wrong information to size it's zones. Why not just send a patch to > fix it? This can change the format of existing created filesystems. And so if this change is welcomed I think we would need to be explicit about its support. Luis