On 3/12/22 07:24, Luis Chamberlain wrote: > On Fri, Mar 11, 2022 at 01:31:02PM -0800, Keith Busch wrote: >> On Fri, Mar 11, 2022 at 01:04:35PM -0800, Luis Chamberlain wrote: >>> On Fri, Mar 11, 2022 at 12:51:35PM -0800, Keith Busch wrote: >>> >>>> I'm starting to like the previous idea of creating an unholey >>>> device-mapper for such users... >>> >>> Won't that restrict nvme with chunk size crap. For instance later if we >>> want much larger block sizes. >> >> I'm not sure I understand. The chunk_size has nothing to do with the >> block size. And while nvme is a user of this in some circumstances, it >> can't be used concurrently with ZNS because the block layer appropriates >> the field for the zone size. > > Many device mapper targets split I/O into chunks, see max_io_len(), > wouldn't this create an overhead? Apart from the bio clone, the overhead should not be higher than what the block layer already has. IOs that are too large or that are straddling zones are split by the block layer, and DM splitting leads generally to no split in the block layer for the underlying device IO. DM essentially follows the same pattern: max_io_len() depends on the target design limits, which in turn depend on the underlying device. For a dm-unhole target, the IO size limit would typically be the same as that of the underlying device. > Using a device mapper target also creates a divergence in strategy > for ZNS. Some will use the block device, others the dm target. The > goal should be to create a unified path. If we allow non power of 2 zone sized devices, the path will *never* be unified because we will get fragmentation on what can run on these devices as opposed to power of 2 sized ones. E.g. f2fs will not work for the former but will for the latter. That is really not an ideal situation. > > And all this, just because SMR. Is that worth it? Are we sure? No. This is *not* because of SMR. Never has been. The first prototype SMR drives I received in my lab 10 years ago did not have a power of 2 sized zone size because zones where naturally aligned to tracks, which like NAND erase blocks, are not necessarily power of 2 sized. And all zones were not even the same size. That was not usable. The reason for the power of 2 requirement is 2 fold: 1) At the time we added zone support for SMR, chunk_sectors had to be a power of 2 number of sectors. 2) SMR users did request power of 2 zone sizes and that all zones have the same size as that simplified software design. There was even a de-facto agreement that 256MB zone size is a good compromise between usability and overhead of zone reclaim/GC. But that particular number is for HDD due to their performance characteristics. Hence the current Linux requirements which have been serving us well so far. DM needed that chunk_sectors be changed to allow non power of 2 values. So the chunk_sectors requirement was lifted recently (can't remember which version added this). Allowing non power of 2 zone size would thus be more easily feasible now. Allowing devices with a non power of 2 zone size is not technically difficult. But... The problem being raised is all about the fact that the power of 2 zone size requirement creates a hole of unusable sectors in every zone when the device implementation has a zone capacity lower than the zone size. I have been arguing all along that I think this problem is a non-problem, simply because a well designed application should *always* use zones as storage containers without ever hoping that the next zone in sequence can be used as well. The application should *never* consider the entire LBA space of the device capacity without this zone split. The zone based management of capacity is necessary for any good design to deal correctly with write error recovery and active/open zone resources management. And as Keith said. there is always a "hole" anyway for any non-full zone, between the zone write pointer and the last usable sector in the zone. Reads there are nonsensical and writes can only go to one place. Now, in the spirit of trying to facilitate software development for zoned devices, we can try finding solutions to remove that hole. zonefs is a obvious solution. But back to the previous point: with one zone == one file, there is no continuity in the storage address space that the application can use. The application has to be designed to use individual files representing a zone. And with such design, an equivalent design directly using the block device file would have no difficulties due to the the sector hole between zone capacity and zone size. I have a prototype LevelDB implementation that can use both zonefs and block device file on ZNS with only a few different lines of code to prove this point. The other solution would be adding a dm-unhole target to remap sectors to remove the holes from the device address space. Such target would be easy to write, but in my opinion, this would still not change the fact that applications still have to deal with error recovery and active/open zone resources. So they still have to be zone aware and operate per zone. Furthermore, adding such DM target would create a non power of 2 zone size zoned device which will need support from the block layer. So some block layer functions will need to change. In the end, this may not be different than enabling non power of 2 zone sized devices for ZNS. And for this decision, I maintain some of my requirements: 1) The added overhead from multiplication & divisions should be acceptable and not degrade performance. Otherwise, this would be a disservice to the zone ecosystem. 2) Nothing that works today on available devices should break 3) Zone size requirements will still exist. E.g. btrfs 64K alignment requirement But even with all these properly addressed, f2fs will not work anymore, some in-kernel users will still need some zone size requirements (btrfs) and *all* applications using a zoned block device file will now have to be designed based on non power of 2 zone size so that they can work on all devices. Meaning that this is also potentially forcing changes on existing applications to use newer zoned devices that may not have a power of 2 zone size. This entire discussion is about the problem that power of 2 zone size creates (which again I think is a non-problem). However, based on the arguments above, allowing non power of 2 zone sized devices is not exactly problem free either. My answer to your last question ("Are we sure?") is thus: No. I am not sure this is a good idea. But as always, I would be happy to be proven wrong. So far, I have not seen any argument doing that. -- Damien Le Moal Western Digital Research