Re: Unusual value of optimal_io_size prevents bcache initialization

Coly Li <colyli@xxxxxxx> · Sun, 24 Sep 2023 22:06:50 +0800

> 2023年9月23日 22:29，Andrea Tomassetti <andrea.tomassetti-opensource@xxxxxxxx> 写道：
> 
> Hi Coly,
> thank you very much for your quick reply.
> 
> On 22/9/23 16:22, Coly Li wrote:
>> On Fri, Sep 22, 2023 at 03:26:46PM +0200, Andrea Tomassetti wrote:
>>> Hi Coly,
>>> recently I was testing bcache on new HW when, while creating a bcache device
>>> with make-bcache -B /dev/nvme16n1, I got this kernel WARNING:
>>> 
>>> ------------[ cut here ]------------
>>> WARNING: CPU: 41 PID: 648 at mm/util.c:630 kvmalloc_node+0x12c/0x178
>>> Modules linked in: nf_conntrack_netlink nf_conntrack nf_defrag_ipv6
>>> nf_defrag_ipv4 nfnetlink_acct wireguard libchacha20poly1305 chacha_neon
>>> poly1305_neon ip6_udp_tunnel udp_tunnel libcurve25519_generic libchacha
>>> nfnetlink ip6table_filter ip6_tables iptable_filter bpfilter nls_iso8859_1
>>> xfs libcrc32c dm_multipath scsi_dh_rdac scsi_dh_emc scsi_dh_alua bcache
>>> crc64 raid0 aes_ce_blk crypto_simd cryptd aes_ce_cipher crct10dif_ce
>>> ghash_ce sha2_ce sha256_arm64 ena sha1_ce sch_fq_codel drm efi_pstore
>>> ip_tables x_tables autofs4
>>> CPU: 41 PID: 648 Comm: kworker/41:2 Not tainted 5.15.0-1039-aws
>>> #44~20.04.1-Ubuntu
>>> Hardware name: DEVO new fabulous hardware/, BIOS 1.0 11/1/2018
>>> Workqueue: events register_bdev_worker [bcache]
>>> pstate: 20400005 (nzCv daif +PAN -UAO -TCO -DIT -SSBS BTYPE=--)
>>> pc : kvmalloc_node+0x12c/0x178
>>> lr : kvmalloc_node+0x74/0x178
>>> sp : ffff80000ea4bc90
>>> x29: ffff80000ea4bc90 x28: ffffdfa18f249c70 x27: ffff0003c9690000
>>> x26: ffff00043160e8e8 x25: ffff000431600040 x24: ffffdfa18f249ec0
>>> x23: ffff0003c1d176c0 x22: 00000000ffffffff x21: ffffdfa18f236938
>>> x20: 00000000833ffff8 x19: 0000000000000dc0 x18: 0000000000000000
>>> x17: ffff20de6376c000 x16: ffffdfa19df02f48 x15: 0000000000000000
>>> x14: 0000000000000000 x13: 0000000000000000 x12: 0000000000000000
>>> x11: 0000000000000000 x10: 0000000000000000 x9 : ffffdfa19df8d468
>>> x8 : ffff00043160e800 x7 : 0000000000000010 x6 : 000000000000c8c8
>>> x5 : 00000000ffffffff x4 : 0000000000012dc0 x3 : 0000000100000000
>>> x2 : 00000000833ffff8 x1 : 000000007fffffff x0 : 0000000000000000
>>> Call trace:
>>>  kvmalloc_node+0x12c/0x178
>>>  bcache_device_init+0x80/0x2e8 [bcache]
>>>  register_bdev_worker+0x228/0x450 [bcache]
>>>  process_one_work+0x200/0x4d8
>>>  worker_thread+0x148/0x558
>>>  kthread+0x114/0x120
>>>  ret_from_fork+0x10/0x20
>>> ---[ end trace e326483a1d681714 ]---
>>> bcache: register_bdev() error nvme16n1: cannot allocate memory
>>> bcache: register_bdev_worker() error /dev/nvme16n1: fail to register backing
>>> device
>>> bcache: bcache_device_free() bcache device (NULL gendisk) stopped
>>> 
>>> I tracked down the root cause: in this new HW the disks have an
>>> optimal_io_size of 4096. Doing some maths, it's easy to find out that this
>>> makes bcache initialization fails for all the backing disks greater than 2
>>> TiB. Is this a well-known limitation?
>>> 
>>> Analyzing bcache_device_init I came up with a doubt:
>>> ...
>>> n = DIV_ROUND_UP_ULL(sectors, d->stripe_size);
>>> if (!n || n > max_stripes) {
>>> pr_err("nr_stripes too large or invalid: %llu (start sector beyond end of
>>> disk?)\n",
>>> n);
>>> return -ENOMEM;
>>> }
>>> d->nr_stripes = n;
>>> 
>>> n = d->nr_stripes * sizeof(atomic_t);
>>> d->stripe_sectors_dirty = kvzalloc(n, GFP_KERNEL);
>>> ...
>>> Is it normal that n is been checked against max_stripes _before_ its value
>>> gets changed by a multiply it by sizeof(atomic_t) ? Shouldn't the check
>>> happen just before trying to kvzalloc n?
>>> 
>> The issue was triggered by d->nr_stripes which was orinially from
>> q->limits.io_opt which is 8 sectors. Normally the backing devices announce
>> 0 sector io_opt, then d->stripe_size will be 1<< 31 in bcache_device_init().
>> Number n from DIV_ROUND_UP_ULL() will be quite small. When io_opt is 8
>> sectors, number n from DIV_ROUND_UP_ULL() is possible to quite big for
>> a large size backing device e.g. 2TB
> Thanks for the explanation, that was already clear to me but I didn't included in my previous message. I just quickly hided it with the expression "doing some maths".
> 
>> Therefore the key point is not checking n after it is multiplified by
>> sizeof(atomic_t), the question is from n itself -- the value is too big.
> What I was trying to point out with when n gets checked is that there are cases in which the check (if (!n || n > max_stripes)) passes but then, because n gets multiplied by sizeof(atomic_t) the kvzalloc fails. We could have prevented this failing by checking n after multiplying it, no?

I noticed this message, the root cause is a too big ’n’ value, checking it before or after multiplication doesn’t help too much. What I want is to avoid the memory allocation failure, not  to avoid calling kzalloc() if ’n’ value is too big.

>> Maybe bcache should not directly use q->limits.io_opt as d->stripe_size,
>> it should be some value less than 1<<31 and aligned to optimal_io_size.
>> After the code got merged into kernel for 10+ years, it is time to improve
>> this calculation :-) >
> Yeah, one of the other doubts I had was exactly regarding this value and if it is still "actual" to calculate it that way. Unfortunately, I don't have the expertise to have an opinion on it. Would you be so kind to share your knowledge and let me understand why it is calculated this way and why is it using the optimal io size? Is it using it to "writeback" optimal-sized blokes?
> 

Most of the conditions when underlying hardware doesn’t declare its optimal io size, bcache uses 1<<31 as a default stripe size. It works fine for decade, so I will use it and make sure it is aligned to value of optimal io size. It should work fine. And I will compose a simple patch for this fix.

>>> Another consideration, stripe_sectors_dirty and full_dirty_stripes, the two
>>> arrays allocated using n, are being used just in writeback mode, is this
>>> correct? In my specific case, I'm not planning to use writeback mode so I
>>> would expect bcache to not even try to create those arrays. Or, at least, to
>>> not create them during initialization but just in case of a change in the
>>> working mode (i.e. write-through -> writeback).
>> Indeed, Mingzhe Zou (if I remember correctly) submitted a patch for this
>> idea, but it is blocked by other depending patches which are not finished
>> by me. Yes I like the idea to dynamically allocate/free d->stripe_sectors_dirty
>> and d->full_dirty_stripes when they are necessary. I hope I may help to make
>> the change go into upstream sooner.
>> I will post a patch for your testing.
> This would be great! Thank you very much! On the other side, if there's anything I can do to help I would be glad to contribute.

I will post a simple patch for the reported memory allocation failure for you to test soon.

Thanks.

Coly Li