Unusual value of optimal_io_size prevents bcache initialization

Andrea Tomassetti <andrea.tomassetti-opensource@xxxxxxxx> · Fri, 22 Sep 2023 15:26:46 +0200

Hi Coly,
recently I was testing bcache on new HW when, while creating a bcache 
device with make-bcache -B /dev/nvme16n1, I got this kernel WARNING:

------------[ cut here ]------------
WARNING: CPU: 41 PID: 648 at mm/util.c:630 kvmalloc_node+0x12c/0x178
Modules linked in: nf_conntrack_netlink nf_conntrack nf_defrag_ipv6 
nf_defrag_ipv4 nfnetlink_acct wireguard libchacha20poly1305 chacha_neon 
poly1305_neon ip6_udp_tunnel udp_tunnel libcurve25519_generic libchacha 
nfnetlink ip6table_filter ip6_tables iptable_filter bpfilter 
nls_iso8859_1 xfs libcrc32c dm_multipath scsi_dh_rdac scsi_dh_emc 
scsi_dh_alua bcache crc64 raid0 aes_ce_blk crypto_simd cryptd 
aes_ce_cipher crct10dif_ce ghash_ce sha2_ce sha256_arm64 ena sha1_ce 
sch_fq_codel drm efi_pstore ip_tables x_tables autofs4
CPU: 41 PID: 648 Comm: kworker/41:2 Not tainted 5.15.0-1039-aws 
#44~20.04.1-Ubuntu
Hardware name: DEVO new fabulous hardware/, BIOS 1.0 11/1/2018
Workqueue: events register_bdev_worker [bcache]
pstate: 20400005 (nzCv daif +PAN -UAO -TCO -DIT -SSBS BTYPE=--)
pc : kvmalloc_node+0x12c/0x178
lr : kvmalloc_node+0x74/0x178
sp : ffff80000ea4bc90
x29: ffff80000ea4bc90 x28: ffffdfa18f249c70 x27: ffff0003c9690000
x26: ffff00043160e8e8 x25: ffff000431600040 x24: ffffdfa18f249ec0
x23: ffff0003c1d176c0 x22: 00000000ffffffff x21: ffffdfa18f236938
x20: 00000000833ffff8 x19: 0000000000000dc0 x18: 0000000000000000
x17: ffff20de6376c000 x16: ffffdfa19df02f48 x15: 0000000000000000
x14: 0000000000000000 x13: 0000000000000000 x12: 0000000000000000
x11: 0000000000000000 x10: 0000000000000000 x9 : ffffdfa19df8d468
x8 : ffff00043160e800 x7 : 0000000000000010 x6 : 000000000000c8c8
x5 : 00000000ffffffff x4 : 0000000000012dc0 x3 : 0000000100000000
x2 : 00000000833ffff8 x1 : 000000007fffffff x0 : 0000000000000000
Call trace:
 kvmalloc_node+0x12c/0x178
 bcache_device_init+0x80/0x2e8 [bcache]
 register_bdev_worker+0x228/0x450 [bcache]
 process_one_work+0x200/0x4d8
 worker_thread+0x148/0x558
 kthread+0x114/0x120
 ret_from_fork+0x10/0x20
---[ end trace e326483a1d681714 ]---
bcache: register_bdev() error nvme16n1: cannot allocate memory
bcache: register_bdev_worker() error /dev/nvme16n1: fail to register 
backing device
bcache: bcache_device_free() bcache device (NULL gendisk) stopped

I tracked down the root cause: in this new HW the disks have an 
optimal_io_size of 4096. Doing some maths, it's easy to find out that 
this makes bcache initialization fails for all the backing disks greater 
than 2 TiB. Is this a well-known limitation?

Analyzing bcache_device_init I came up with a doubt:
...
	n = DIV_ROUND_UP_ULL(sectors, d->stripe_size);
	if (!n || n > max_stripes) {
		pr_err("nr_stripes too large or invalid: %llu (start sector beyond end 
of disk?)\n",
			n);
		return -ENOMEM;
	}
	d->nr_stripes = n;

	n = d->nr_stripes * sizeof(atomic_t);
	d->stripe_sectors_dirty = kvzalloc(n, GFP_KERNEL);
...
Is it normal that n is been checked against max_stripes _before_ its 
value gets changed by a multiply it by sizeof(atomic_t) ? Shouldn't the 
check happen just before trying to kvzalloc n?

Another consideration, stripe_sectors_dirty and full_dirty_stripes, the 
two arrays allocated using n, are being used just in writeback mode, is 
this correct? In my specific case, I'm not planning to use writeback 
mode so I would expect bcache to not even try to create those arrays. 
Or, at least, to not create them during initialization but just in case 
of a change in the working mode (i.e. write-through -> writeback).

Best regards,
Andrea