Re: Non-power-of-2 zone size

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 



On 7/5/24 04:00, Mikulas Patocka wrote:
> 
> 
>> On Thu, 4 Jul 2024, Li Dong wrote:
>>
>>> For zone block devices, device_area_is_invalid may return an incorrect 
>>> value.
>>>
>>> Failure log:
>>> [   19.337657]: device-mapper: table: 254:56: len=836960256 not aligned to
>>> h/w zone size 3244032 of sde
>>> [   19.337665]: device-mapper: core: Cannot calculate initial queue limits
>>> [   19.337667]: device-mapper: ioctl: unable to set up device queue for 
>>> new table.
>>>
>>> Actually, the device's zone length is aligned to the zonesize.
>>>
>>> Fixes: 5dea271b6d87 ("dm table: pass correct dev area size to device_area_is_valid")
>>> Signed-off-by: Li Dong <lidong@xxxxxxxx>
>>> ---
>>>  drivers/md/dm-table.c | 4 ++--
>>>  1 file changed, 2 insertions(+), 2 deletions(-)
>>>
>>> diff --git a/drivers/md/dm-table.c b/drivers/md/dm-table.c
>>> index 33b7a1844ed4..0bddae0bee3c 100644
>>> --- a/drivers/md/dm-table.c
>>> +++ b/drivers/md/dm-table.c
>>> @@ -257,7 +257,7 @@ static int device_area_is_invalid(struct dm_target *ti, struct dm_dev *dev,
>>>  	if (bdev_is_zoned(bdev)) {
>>>  		unsigned int zone_sectors = bdev_zone_sectors(bdev);
>>>  
>>> -		if (start & (zone_sectors - 1)) {
>>> +		if (start % zone_sectors) {
>>>  			DMERR("%s: start=%llu not aligned to h/w zone size %u of %pg",
>>>  			      dm_device_name(ti->table->md),
>>>  			      (unsigned long long)start,
>>> @@ -274,7 +274,7 @@ static int device_area_is_invalid(struct dm_target *ti, struct dm_dev *dev,
>>>  		 * devices do not end up with a smaller zone in the middle of
>>>  		 * the sector range.
>>>  		 */
>>> -		if (len & (zone_sectors - 1)) {
>>> +		if (len % zone_sectors) {
>>>  			DMERR("%s: len=%llu not aligned to h/w zone size %u of %pg",
>>>  			      dm_device_name(ti->table->md),
>>>  			      (unsigned long long)len,
>>> -- 
>>> 2.31.1.windows.1
> 
> I grepped the kernel for bdev_zone_sectors and there are more assumptions 
> that bdev_zone_sectors is a power of 2.
> 
> drivers/md/dm-zone.c:           sector_t mask = bdev_zone_sectors(disk->part0) - 1
> drivers/nvme/target/zns.c:      if (get_capacity(bd_disk) & (bdev_zone_sectors(ns->bdev) - 1))
> drivers/nvme/target/zns.c:      if (sect & (bdev_zone_sectors(req->ns->bdev) - 1)) {
> fs/zonefs/super.c:      sbi->s_zone_sectors_shift = ilog2(bdev_zone_sectors(sb->s_bdev));
> fs/btrfs/zoned.c:       return (sector_t)zone_number << ilog2(bdev_zone_sectors(bdev));
> fs/btrfs/zoned.c:	zone_info->zone_size_shift = ilog2(zone_info->zone_size);
> include/linux/blkdev.h: return sector & (bdev_zone_sectors(bdev) - 1);
> fs/f2fs/super.c:	if (nr_sectors & (zone_sectors - 1))
> 
> So, if we want to support non-power-of-2 zone size, we need some 
> systematic fix. Now it appears that Linux doesn't even attempt to support 
> disks non-power-of-2 zone size.

Correct. We currently do not support zoned devices with a zone size that is not
a power of 2 number of LBAs. Such drives are rejected by the block layer. So I
am surprised that Li could even reach a DM error. It means that his kernel was
already patched to accept zoned drives with a zone size that is not a power of 2.

> I added Damien Le Moal so that he can help with testing disks with 
> non-power-of-2 zone size (if WD is actually making them).

No, I do not have such drive. The vast majority of zoned device users today are
SMR HDD users, and in that space, no one wants a non power of 2 zone size drive.
As far as I know, the main push is from zoned-UFS users for Android, as Pankaj
mentioned.

As you rightly guessed, and as Pankaj confirmed, supporting non power of 2 zone
size drives requires a lot more changes than just the fix Li sent for DM. Pankaj
initial patch set that did not make it upstream (but is in Android) would need
to be reworked given the extensive changes that happened since then (zone write
locking replaced with zone write plugging, queue limits changes, etc).

-- 
Damien Le Moal
Western Digital Research





[Index of Archives]     [Linux Ext4 Filesystem]     [Union Filesystem]     [Filesystem Testing]     [Ceph Users]     [Ecryptfs]     [NTFS 3]     [AutoFS]     [Kernel Newbies]     [Share Photos]     [Security]     [Netfilter]     [Bugtraq]     [Yosemite News]     [MIPS Linux]     [ARM Linux]     [Linux Security]     [Linux Cachefs]     [Reiser Filesystem]     [Linux RAID]     [NTFS 3]     [Samba]     [Device Mapper]     [CEPH Development]

  Powered by Linux