On Fri, Sep 22, 2023 at 03:26:46PM +0200, Andrea Tomassetti wrote: > Hi Coly, > recently I was testing bcache on new HW when, while creating a bcache device > with make-bcache -B /dev/nvme16n1, I got this kernel WARNING: > > ------------[ cut here ]------------ > WARNING: CPU: 41 PID: 648 at mm/util.c:630 kvmalloc_node+0x12c/0x178 > Modules linked in: nf_conntrack_netlink nf_conntrack nf_defrag_ipv6 > nf_defrag_ipv4 nfnetlink_acct wireguard libchacha20poly1305 chacha_neon > poly1305_neon ip6_udp_tunnel udp_tunnel libcurve25519_generic libchacha > nfnetlink ip6table_filter ip6_tables iptable_filter bpfilter nls_iso8859_1 > xfs libcrc32c dm_multipath scsi_dh_rdac scsi_dh_emc scsi_dh_alua bcache > crc64 raid0 aes_ce_blk crypto_simd cryptd aes_ce_cipher crct10dif_ce > ghash_ce sha2_ce sha256_arm64 ena sha1_ce sch_fq_codel drm efi_pstore > ip_tables x_tables autofs4 > CPU: 41 PID: 648 Comm: kworker/41:2 Not tainted 5.15.0-1039-aws > #44~20.04.1-Ubuntu > Hardware name: DEVO new fabulous hardware/, BIOS 1.0 11/1/2018 > Workqueue: events register_bdev_worker [bcache] > pstate: 20400005 (nzCv daif +PAN -UAO -TCO -DIT -SSBS BTYPE=--) > pc : kvmalloc_node+0x12c/0x178 > lr : kvmalloc_node+0x74/0x178 > sp : ffff80000ea4bc90 > x29: ffff80000ea4bc90 x28: ffffdfa18f249c70 x27: ffff0003c9690000 > x26: ffff00043160e8e8 x25: ffff000431600040 x24: ffffdfa18f249ec0 > x23: ffff0003c1d176c0 x22: 00000000ffffffff x21: ffffdfa18f236938 > x20: 00000000833ffff8 x19: 0000000000000dc0 x18: 0000000000000000 > x17: ffff20de6376c000 x16: ffffdfa19df02f48 x15: 0000000000000000 > x14: 0000000000000000 x13: 0000000000000000 x12: 0000000000000000 > x11: 0000000000000000 x10: 0000000000000000 x9 : ffffdfa19df8d468 > x8 : ffff00043160e800 x7 : 0000000000000010 x6 : 000000000000c8c8 > x5 : 00000000ffffffff x4 : 0000000000012dc0 x3 : 0000000100000000 > x2 : 00000000833ffff8 x1 : 000000007fffffff x0 : 0000000000000000 > Call trace: > kvmalloc_node+0x12c/0x178 > bcache_device_init+0x80/0x2e8 [bcache] > register_bdev_worker+0x228/0x450 [bcache] > process_one_work+0x200/0x4d8 > worker_thread+0x148/0x558 > kthread+0x114/0x120 > ret_from_fork+0x10/0x20 > ---[ end trace e326483a1d681714 ]--- > bcache: register_bdev() error nvme16n1: cannot allocate memory > bcache: register_bdev_worker() error /dev/nvme16n1: fail to register backing > device > bcache: bcache_device_free() bcache device (NULL gendisk) stopped > > I tracked down the root cause: in this new HW the disks have an > optimal_io_size of 4096. Doing some maths, it's easy to find out that this > makes bcache initialization fails for all the backing disks greater than 2 > TiB. Is this a well-known limitation? > > Analyzing bcache_device_init I came up with a doubt: > ... > n = DIV_ROUND_UP_ULL(sectors, d->stripe_size); > if (!n || n > max_stripes) { > pr_err("nr_stripes too large or invalid: %llu (start sector beyond end of > disk?)\n", > n); > return -ENOMEM; > } > d->nr_stripes = n; > > n = d->nr_stripes * sizeof(atomic_t); > d->stripe_sectors_dirty = kvzalloc(n, GFP_KERNEL); > ... > Is it normal that n is been checked against max_stripes _before_ its value > gets changed by a multiply it by sizeof(atomic_t) ? Shouldn't the check > happen just before trying to kvzalloc n? > The issue was triggered by d->nr_stripes which was orinially from q->limits.io_opt which is 8 sectors. Normally the backing devices announce 0 sector io_opt, then d->stripe_size will be 1<< 31 in bcache_device_init(). Number n from DIV_ROUND_UP_ULL() will be quite small. When io_opt is 8 sectors, number n from DIV_ROUND_UP_ULL() is possible to quite big for a large size backing device e.g. 2TB. Therefore the key point is not checking n after it is multiplified by sizeof(atomic_t), the question is from n itself -- the value is too big. Maybe bcache should not directly use q->limits.io_opt as d->stripe_size, it should be some value less than 1<<31 and aligned to optimal_io_size. After the code got merged into kernel for 10+ years, it is time to improve this calculation :-) > Another consideration, stripe_sectors_dirty and full_dirty_stripes, the two > arrays allocated using n, are being used just in writeback mode, is this > correct? In my specific case, I'm not planning to use writeback mode so I > would expect bcache to not even try to create those arrays. Or, at least, to > not create them during initialization but just in case of a change in the > working mode (i.e. write-through -> writeback). Indeed, Mingzhe Zou (if I remember correctly) submitted a patch for this idea, but it is blocked by other depending patches which are not finished by me. Yes I like the idea to dynamically allocate/free d->stripe_sectors_dirty and d->full_dirty_stripes when they are necessary. I hope I may help to make the change go into upstream sooner. I will post a patch for your testing. Thanks in advance. -- Coly Li