Re: Unusual value of optimal_io_size prevents bcache initialization

Coly Li <colyli@xxxxxxx> · Fri, 22 Sep 2023 22:22:42 +0800

On Fri, Sep 22, 2023 at 03:26:46PM +0200, Andrea Tomassetti wrote:
> Hi Coly,
> recently I was testing bcache on new HW when, while creating a bcache device
> with make-bcache -B /dev/nvme16n1, I got this kernel WARNING:
> 
> ------------[ cut here ]------------
> WARNING: CPU: 41 PID: 648 at mm/util.c:630 kvmalloc_node+0x12c/0x178
> Modules linked in: nf_conntrack_netlink nf_conntrack nf_defrag_ipv6
> nf_defrag_ipv4 nfnetlink_acct wireguard libchacha20poly1305 chacha_neon
> poly1305_neon ip6_udp_tunnel udp_tunnel libcurve25519_generic libchacha
> nfnetlink ip6table_filter ip6_tables iptable_filter bpfilter nls_iso8859_1
> xfs libcrc32c dm_multipath scsi_dh_rdac scsi_dh_emc scsi_dh_alua bcache
> crc64 raid0 aes_ce_blk crypto_simd cryptd aes_ce_cipher crct10dif_ce
> ghash_ce sha2_ce sha256_arm64 ena sha1_ce sch_fq_codel drm efi_pstore
> ip_tables x_tables autofs4
> CPU: 41 PID: 648 Comm: kworker/41:2 Not tainted 5.15.0-1039-aws
> #44~20.04.1-Ubuntu
> Hardware name: DEVO new fabulous hardware/, BIOS 1.0 11/1/2018
> Workqueue: events register_bdev_worker [bcache]
> pstate: 20400005 (nzCv daif +PAN -UAO -TCO -DIT -SSBS BTYPE=--)
> pc : kvmalloc_node+0x12c/0x178
> lr : kvmalloc_node+0x74/0x178
> sp : ffff80000ea4bc90
> x29: ffff80000ea4bc90 x28: ffffdfa18f249c70 x27: ffff0003c9690000
> x26: ffff00043160e8e8 x25: ffff000431600040 x24: ffffdfa18f249ec0
> x23: ffff0003c1d176c0 x22: 00000000ffffffff x21: ffffdfa18f236938
> x20: 00000000833ffff8 x19: 0000000000000dc0 x18: 0000000000000000
> x17: ffff20de6376c000 x16: ffffdfa19df02f48 x15: 0000000000000000
> x14: 0000000000000000 x13: 0000000000000000 x12: 0000000000000000
> x11: 0000000000000000 x10: 0000000000000000 x9 : ffffdfa19df8d468
> x8 : ffff00043160e800 x7 : 0000000000000010 x6 : 000000000000c8c8
> x5 : 00000000ffffffff x4 : 0000000000012dc0 x3 : 0000000100000000
> x2 : 00000000833ffff8 x1 : 000000007fffffff x0 : 0000000000000000
> Call trace:
>  kvmalloc_node+0x12c/0x178
>  bcache_device_init+0x80/0x2e8 [bcache]
>  register_bdev_worker+0x228/0x450 [bcache]
>  process_one_work+0x200/0x4d8
>  worker_thread+0x148/0x558
>  kthread+0x114/0x120
>  ret_from_fork+0x10/0x20
> ---[ end trace e326483a1d681714 ]---
> bcache: register_bdev() error nvme16n1: cannot allocate memory
> bcache: register_bdev_worker() error /dev/nvme16n1: fail to register backing
> device
> bcache: bcache_device_free() bcache device (NULL gendisk) stopped
> 
> I tracked down the root cause: in this new HW the disks have an
> optimal_io_size of 4096. Doing some maths, it's easy to find out that this
> makes bcache initialization fails for all the backing disks greater than 2
> TiB. Is this a well-known limitation?
> 
> Analyzing bcache_device_init I came up with a doubt:
> ...
> 	n = DIV_ROUND_UP_ULL(sectors, d->stripe_size);
> 	if (!n || n > max_stripes) {
> 		pr_err("nr_stripes too large or invalid: %llu (start sector beyond end of
> disk?)\n",
> 			n);
> 		return -ENOMEM;
> 	}
> 	d->nr_stripes = n;
> 
> 	n = d->nr_stripes * sizeof(atomic_t);
> 	d->stripe_sectors_dirty = kvzalloc(n, GFP_KERNEL);
> ...
> Is it normal that n is been checked against max_stripes _before_ its value
> gets changed by a multiply it by sizeof(atomic_t) ? Shouldn't the check
> happen just before trying to kvzalloc n?
> 

The issue was triggered by d->nr_stripes which was orinially from
q->limits.io_opt which is 8 sectors. Normally the backing devices announce
0 sector io_opt, then d->stripe_size will be 1<< 31 in bcache_device_init().
Number n from DIV_ROUND_UP_ULL() will be quite small. When io_opt is 8
sectors, number n from DIV_ROUND_UP_ULL() is possible to quite big for
a large size backing device e.g. 2TB.

Therefore the key point is not checking n after it is multiplified by
sizeof(atomic_t), the question is from n itself -- the value is too big.

Maybe bcache should not directly use q->limits.io_opt as d->stripe_size,
it should be some value less than 1<<31 and aligned to optimal_io_size.

After the code got merged into kernel for 10+ years, it is time to improve
this calculation :-)

> Another consideration, stripe_sectors_dirty and full_dirty_stripes, the two
> arrays allocated using n, are being used just in writeback mode, is this
> correct? In my specific case, I'm not planning to use writeback mode so I
> would expect bcache to not even try to create those arrays. Or, at least, to
> not create them during initialization but just in case of a change in the
> working mode (i.e. write-through -> writeback).

Indeed, Mingzhe Zou (if I remember correctly) submitted a patch for this
idea, but it is blocked by other depending patches which are not finished
by me. Yes I like the idea to dynamically allocate/free d->stripe_sectors_dirty
and d->full_dirty_stripes when they are necessary. I hope I may help to make
the change go into upstream sooner.

I will post a patch for your testing.

Thanks in advance.

-- 
Coly Li