Hi, We found that "nvme format ..." command fails to format nvme disk with block-size set to 512. Notes and observations: ====================== This is observed on the latest linus kernel tree. This was working well on kernel v6.8. Test details: ============= At system boot or when nvme is hot plugin, the nvme block size is 4096 and later if we try format it with the block-size of 512 (lbaf=2) then it fails. Interestingly, if we start with the nvme block size of 512 and later if we try format it with block-size of 4096 (lbaf=0) then it doesn't fail. Please note that CONFIG_NVME_MULTIPATH is enabled. Please find below further details: # lspci 0018:01:00.0 Non-Volatile memory controller: Samsung Electronics Co Ltd NVMe SSD Controller PM173X # nvme list Node Generic SN Model Namespace Usage Format FW Rev --------------------- --------------------- -------------------- ---------------------------------------- ---------- -------------------------- ---------------- -------- /dev/nvme0n1 /dev/ng0n1 S6EUNA0R500358 1.6TB NVMe Gen4 U.2 SSD 0x1 1.60 TB / 1.60 TB 512 B + 0 B REV.SN49 # nvme id-ns /dev/nvme0n1 -H NVME Identify Namespace 1: nsze : 0xba4d4ab0 ncap : 0xba4d4ab0 nuse : 0xba4d4ab0 <snip> <snip> nlbaf : 4 flbas : 0 [6:5] : 0 Most significant 2 bits of Current LBA Format Selected [4:4] : 0 Metadata Transferred in Separate Contiguous Buffer [3:0] : 0 Least significant 4 bits of Current LBA Format Selected <snip> <snip> LBA Format 0 : Metadata Size: 0 bytes - Data Size: 4096 bytes - Relative Performance: 0 Best (in use) LBA Format 1 : Metadata Size: 8 bytes - Data Size: 4096 bytes - Relative Performance: 0x2 Good LBA Format 2 : Metadata Size: 0 bytes - Data Size: 512 bytes - Relative Performance: 0x1 Better LBA Format 3 : Metadata Size: 8 bytes - Data Size: 512 bytes - Relative Performance: 0x3 Degraded LBA Format 4 : Metadata Size: 64 bytes - Data Size: 4096 bytes - Relative Performance: 0x3 Degraded # lsblk -t /dev/nvme0n1 NAME ALIGNMENT MIN-IO OPT-IO PHY-SEC LOG-SEC ROTA SCHED RQ-SIZE RA WSAME nvme0n1 0 4096 0 4096 4096 0 128 0B ^^^ ^^^ !!!! FAILING TO FORMAT with 512 bytes of block size !!!! # nvme format /dev/nvme0n1 --lbaf=2 --pil=0 --ms=0 --pi=0 -f Success formatting namespace:1 failed to set block size to 512 ^^^ # lsblk -t /dev/nvme0n1 NAME ALIGNMENT MIN-IO OPT-IO PHY-SEC LOG-SEC ROTA SCHED RQ-SIZE RA WSAME nvme0n1 0 4096 0 4096 4096 0 128 0B ^^^ ^^^ # cat /sys/block/nvme0n1/queue/logical_block_size:4096 # cat /sys/block/nvme0n1/queue/physical_block_size:4096 # cat /sys/block/nvme0c0n1/queue/logical_block_size:512 # cat /sys/block/nvme0c0n1/queue/physical_block_size:512 # nvme id-ns /dev/nvme0n1 -H NVME Identify Namespace 1: nsze : 0xba4d4ab0 ncap : 0xba4d4ab0 nuse : 0xba4d4ab0 <snip> <snip> nlbaf : 4 flbas : 0x2 [6:5] : 0 Most significant 2 bits of Current LBA Format Selected [4:4] : 0 Metadata Transferred in Separate Contiguous Buffer [3:0] : 0x2 Least significant 4 bits of Current LBA Format Selected <snip> <snip> LBA Format 0 : Metadata Size: 0 bytes - Data Size: 4096 bytes - Relative Performance: 0 Best LBA Format 1 : Metadata Size: 8 bytes - Data Size: 4096 bytes - Relative Performance: 0x2 Good LBA Format 2 : Metadata Size: 0 bytes - Data Size: 512 bytes - Relative Performance: 0x1 Better (in use) LBA Format 3 : Metadata Size: 8 bytes - Data Size: 512 bytes - Relative Performance: 0x3 Degraded LBA Format 4 : Metadata Size: 64 bytes - Data Size: 4096 bytes - Relative Performance: 0x3 Degraded Note : We could see above that the NVMe is indeed formatted with lbaf 2(block size 512). However, the block queue limits are not correctly updated. Git bisect: ========== Git bisect reveals the following commit as bad commit: 8f03cfa117e06bd2d3ba7ed8bba70a3dda310cae is the first bad commit commit 8f03cfa117e06bd2d3ba7ed8bba70a3dda310cae Author: Christoph Hellwig <hch@xxxxxx> Date: Mon Mar 4 07:04:51 2024 -0700 nvme: don't use nvme_update_disk_info for the multipath disk Currently nvme_update_ns_info_block calls nvme_update_disk_info both for the namespace attached disk, and the multipath one (if it exists). This is very different from how other stacking drivers work, and leads to a lot of complexity. Switch to setting the disk capacity and initializing the integrity profile, and let blk_stack_limits which already is called just below deal with updating the other limits. Signed-off-by: Christoph Hellwig <hch@xxxxxx> Signed-off-by: Keith Busch <kbusch@xxxxxxxxxx> drivers/nvme/host/core.c | 3 ++- 1 file changed, 2 insertions(+), 1 deletion(-) The above commit is part of the new atomic queue limit updates patch series. For NVMe device if multipath config is enabled then we rely on blk_stack_limits to update the queue limits for the stacked device. For updating the logical/physical queue limit of the top (nvme%dn%d) device, the blk_stack_limits() uses the max of top and bottom limit: t->logical_block_size = max(t->logical_block_size, b->logical_block_size); t->physical_block_size = max(t->physical_block_size, b->physical_block_size); When we try formatting the nvme disk with block-size of 512, the value of t->logical_block_size would be 4096 (as this is the initial block-size) however the value of b->logical_block_size would be 512 (the block size of the bottom device is first updated in nvme_update_ns_info_block()). I think we may want to update the queue limits of both top and bottom devices in the nvme_update_ns_info_block(). Or if there's some other way? Let me know if you need any further information. Thanks, --Nilay