Hi Coly,
recently I was testing bcache on new HW when, while creating a bcache
device with make-bcache -B /dev/nvme16n1, I got this kernel WARNING:
------------[ cut here ]------------
WARNING: CPU: 41 PID: 648 at mm/util.c:630 kvmalloc_node+0x12c/0x178
Modules linked in: nf_conntrack_netlink nf_conntrack nf_defrag_ipv6
nf_defrag_ipv4 nfnetlink_acct wireguard libchacha20poly1305 chacha_neon
poly1305_neon ip6_udp_tunnel udp_tunnel libcurve25519_generic libchacha
nfnetlink ip6table_filter ip6_tables iptable_filter bpfilter
nls_iso8859_1 xfs libcrc32c dm_multipath scsi_dh_rdac scsi_dh_emc
scsi_dh_alua bcache crc64 raid0 aes_ce_blk crypto_simd cryptd
aes_ce_cipher crct10dif_ce ghash_ce sha2_ce sha256_arm64 ena sha1_ce
sch_fq_codel drm efi_pstore ip_tables x_tables autofs4
CPU: 41 PID: 648 Comm: kworker/41:2 Not tainted 5.15.0-1039-aws
#44~20.04.1-Ubuntu
Hardware name: DEVO new fabulous hardware/, BIOS 1.0 11/1/2018
Workqueue: events register_bdev_worker [bcache]
pstate: 20400005 (nzCv daif +PAN -UAO -TCO -DIT -SSBS BTYPE=--)
pc : kvmalloc_node+0x12c/0x178
lr : kvmalloc_node+0x74/0x178
sp : ffff80000ea4bc90
x29: ffff80000ea4bc90 x28: ffffdfa18f249c70 x27: ffff0003c9690000
x26: ffff00043160e8e8 x25: ffff000431600040 x24: ffffdfa18f249ec0
x23: ffff0003c1d176c0 x22: 00000000ffffffff x21: ffffdfa18f236938
x20: 00000000833ffff8 x19: 0000000000000dc0 x18: 0000000000000000
x17: ffff20de6376c000 x16: ffffdfa19df02f48 x15: 0000000000000000
x14: 0000000000000000 x13: 0000000000000000 x12: 0000000000000000
x11: 0000000000000000 x10: 0000000000000000 x9 : ffffdfa19df8d468
x8 : ffff00043160e800 x7 : 0000000000000010 x6 : 000000000000c8c8
x5 : 00000000ffffffff x4 : 0000000000012dc0 x3 : 0000000100000000
x2 : 00000000833ffff8 x1 : 000000007fffffff x0 : 0000000000000000
Call trace:
kvmalloc_node+0x12c/0x178
bcache_device_init+0x80/0x2e8 [bcache]
register_bdev_worker+0x228/0x450 [bcache]
process_one_work+0x200/0x4d8
worker_thread+0x148/0x558
kthread+0x114/0x120
ret_from_fork+0x10/0x20
---[ end trace e326483a1d681714 ]---
bcache: register_bdev() error nvme16n1: cannot allocate memory
bcache: register_bdev_worker() error /dev/nvme16n1: fail to register
backing device
bcache: bcache_device_free() bcache device (NULL gendisk) stopped
I tracked down the root cause: in this new HW the disks have an
optimal_io_size of 4096. Doing some maths, it's easy to find out that
this makes bcache initialization fails for all the backing disks greater
than 2 TiB. Is this a well-known limitation?
Analyzing bcache_device_init I came up with a doubt:
...
n = DIV_ROUND_UP_ULL(sectors, d->stripe_size);
if (!n || n > max_stripes) {
pr_err("nr_stripes too large or invalid: %llu (start sector beyond end
of disk?)\n",
n);
return -ENOMEM;
}
d->nr_stripes = n;
n = d->nr_stripes * sizeof(atomic_t);
d->stripe_sectors_dirty = kvzalloc(n, GFP_KERNEL);
...
Is it normal that n is been checked against max_stripes _before_ its
value gets changed by a multiply it by sizeof(atomic_t) ? Shouldn't the
check happen just before trying to kvzalloc n?
Another consideration, stripe_sectors_dirty and full_dirty_stripes, the
two arrays allocated using n, are being used just in writeback mode, is
this correct? In my specific case, I'm not planning to use writeback
mode so I would expect bcache to not even try to create those arrays.
Or, at least, to not create them during initialization but just in case
of a change in the working mode (i.e. write-through -> writeback).
Best regards,
Andrea