Re: [PATCH 0/6] power_of_2 emulation support for NVMe ZNS devices

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 



On 14.03.2022 16:45, Damien Le Moal wrote:
On 3/14/22 16:35, Christoph Hellwig wrote:
On Sat, Mar 12, 2022 at 04:58:08PM +0900, Damien Le Moal wrote:
The reason for the power of 2 requirement is 2 fold:
1) At the time we added zone support for SMR, chunk_sectors had to be a
power of 2 number of sectors.
2) SMR users did request power of 2 zone sizes and that all zones have
the same size as that simplified software design. There was even a
de-facto agreement that 256MB zone size is a good compromise between
usability and overhead of zone reclaim/GC. But that particular number is
for HDD due to their performance characteristics.

Also for NVMe we initially went down the road to try to support
non power of two sizes.  But there was another major early host that
really wanted the power of two zone sizes to support hardware based
hosts that can cheaply do shifts but not divisions.  The variable
zone capacity feature (something that Linux does not currently support)
is a feature requested by NVMe members on the host and device side
also can only be supported with the the zone size / zone capacity split.

The other solution would be adding a dm-unhole target to remap sectors
to remove the holes from the device address space. Such target would be
easy to write, but in my opinion, this would still not change the fact
that applications still have to deal with error recovery and active/open
zone resources. So they still have to be zone aware and operate per zone.

I don't think we even need a new target for it.  I think you can do
this with a table using multiple dm-linear sections already if you
want.

Nope, this is currently not possible: DM requires the target zone size
to be the same as the underlying device zone size. So that would not work.


My answer to your last question ("Are we sure?") is thus: No. I am not
sure this is a good idea. But as always, I would be happy to be proven
wrong. So far, I have not seen any argument doing that.

Agreed. Supporting non-power of two sizes in the block layer is fairly
easy as shown by some of the patches seens in this series.  Supporting
them properly in the whole ecosystem is not trivial and will create a
long-term burden.  We could do that, but we'd rather have a really good
reason for it, and right now I don't see that.

I think that Bo's use-case is an example of a major upstream Linux host
that is struggling with unmmapped LBAs. Can we focus on this use-case
and the parts that we are missing to support Bytedance?

If you agree to this, I believe we can add support for ZoneFS pretty
easily. We also have a POC in btrfs that we will follow on. For the time
being, F2FS would fail at mkfs time if zone size is not a PO2.

What do you think?



[Index of Archives]     [Linux RAID]     [Linux SCSI]     [Linux ATA RAID]     [IDE]     [Linux Wireless]     [Linux Kernel]     [ATH6KL]     [Linux Bluetooth]     [Linux Netdev]     [Kernel Newbies]     [Security]     [Git]     [Netfilter]     [Bugtraq]     [Yosemite News]     [MIPS Linux]     [ARM Linux]     [Linux Security]     [Device Mapper]

  Powered by Linux