Hi Coli, Things are stable so far, looks really promising: bash-5.0$ cat /proc/version Linux version 5.12.8-200.rf.fc33.x86_64 (mockbuild@xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx) (gcc (GCC) 10.3.1 20210422 (Red Hat 10.3.1-1), GNU ld version 2.35-18.fc33) #1 SMP Tue Jun 1 23:10:39 CEST 2021 bash-5.0$ uptime 11:42:05 up 12:01, 1 user, load average: 0.84, 2.43, 3.20 bash-5.0$ I left the system running during the night, and have been using it actively for 3 hours now. I'll keep you posted if anything changes, but this is a major step forward for sure (before your patches the system would freeze in minutes after booting). Rolf On 6/2/21 10:33 AM, Thorsten Knabe wrote: > Hello Coly. > > On 6/1/21 5:34 PM, Coly Li wrote: >> On 5/31/21 5:37 PM, Rolf Fokkens wrote: >>> Same here, more details: https://bugzilla.redhat.com/show_bug.cgi?id=1965809 >> Could you please try these attached patches ? I'd like to have more >> confirmation before submitting to Jens. > Running 5.12.8 + your v5 patches for a few hours now on that degraded > RAID-6 system without any issues. If anything changes, I'll let you know. > > Thorsten >> Thanks. >> >> Coly Li >> >> >> >>> On 5/15/21 9:06 PM, Thorsten Knabe wrote: >>>> Hello. >>>> >>>> Starting with Linux 5.12 bcache triggers a BUG() after a few minutes of >>>> usage. >>>> >>>> Linux up to 5.11.x is not affected by this bug. >>>> >>>> Environment: >>>> - Debian 10 AMD 64 >>>> - Kernel 5.12 - 5.12.4 >>>> - Filesystem ext4 >>>> - Backing device: degraded software RAID-6 (MD) with 3 of 4 disks active >>>> (unsure if the degraded RAID-6 has an effect or not) >>>> - Cache device: Single SSD >>>> >>>> Kernel log: >>>> >>>> May 12 20:22:24 tek04 kernel: nr_vecs=472 >>>> May 12 20:22:24 tek04 kernel: ------------[ cut here ]------------ >>>> May 12 20:22:24 tek04 kernel: kernel BUG at block/bio.c:53! >>>> May 12 20:22:24 tek04 kernel: invalid opcode: 0000 [#1] PREEMPT SMP PTI >>>> May 12 20:22:24 tek04 kernel: CPU: 1 PID: 1670 Comm: grep Tainted: G >>>> I 5.12.3 #2 >>>> May 12 20:22:24 tek04 kernel: Hardware name: To Be Filled By O.E.M. To >>>> Be Filled By O.E.M./X58 Deluxe, BIOS P2.20 10/30/2009 >>>> May 12 20:22:24 tek04 kernel: RIP: 0010:biovec_slab.cold.45+0xf/0x11 >>>> May 12 20:22:24 tek04 kernel: Code: b3 ae ff 89 c6 48 c7 c7 30 82 21 82 >>>> e8 03 81 fe ff b8 b6 ff ff ff e9 3d bc ae ff 0f b7 f7 48 c7 c7 c4 82 21 >>>> 82 e8 ea 80 fe ff <0f> 0b 49 8b b4 24 d0 00 00 00 48 c7 c7 40 84 21 82 >>>> e8 d4 80 fe ff >>>> May 12 20:22:24 tek04 kernel: RSP: 0018:ffffc9000274b730 EFLAGS: 00010292 >>>> May 12 20:22:24 tek04 kernel: RAX: 000000000000000b RBX: >>>> ffffc9000274b764 RCX: 0000000000000000 >>>> May 12 20:22:24 tek04 kernel: bch_count_backing_io_errors: 1 callbacks >>>> suppressed >>>> May 12 20:22:24 tek04 kernel: bcache: bch_count_backing_io_errors() md1: >>>> Read-ahead I/O failed on backing device, ignore >>>> May 12 20:22:24 tek04 kernel: RDX: ffff888333c5e400 RSI: >>>> ffff888333c57480 RDI: ffff888333c57480 >>>> May 12 20:22:24 tek04 kernel: RBP: 0000000000000800 R08: >>>> 0000000000000000 R09: c0000000ffffdfff >>>> May 12 20:22:24 tek04 kernel: R10: ffffc9000274b580 R11: >>>> ffffc9000274b578 R12: ffff8881170f0118 >>>> May 12 20:22:24 tek04 kernel: R13: ffff888109911b00 R14: >>>> ffff8881170f00d0 R15: 0000000000000800 >>>> May 12 20:22:24 tek04 kernel: FS: 00007f3a1ca7cb80(0000) >>>> GS:ffff888333c40000(0000) knlGS:0000000000000000 >>>> May 12 20:22:24 tek04 kernel: CS: 0010 DS: 0000 ES: 0000 CR0: >>>> 0000000080050033 >>>> May 12 20:22:24 tek04 kernel: CR2: 00005628963f4fd0 CR3: >>>> 000000016c43c000 CR4: 00000000000006e0 >>>> May 12 20:22:24 tek04 kernel: Call Trace: >>>> May 12 20:22:24 tek04 kernel: bvec_alloc+0x22/0x90 >>>> May 12 20:22:24 tek04 kernel: bio_alloc_bioset+0x176/0x230 >>>> May 12 20:22:24 tek04 kernel: cached_dev_cache_miss+0x1a8/0x300 >>>> May 12 20:22:24 tek04 kernel: cache_lookup_fn+0x110/0x2e0 >>>> May 12 20:22:24 tek04 kernel: ? bch_ptr_invalid+0x10/0x10 >>>> May 12 20:22:24 tek04 kernel: ? bch_btree_iter_next_filter+0x1af/0x2d0 >>>> May 12 20:22:24 tek04 kernel: ? cache_lookup+0x190/0x190 >>>> May 12 20:22:24 tek04 kernel: bch_btree_map_keys_recurse+0x69/0x160 >>>> May 12 20:22:24 tek04 kernel: ? __bch_bset_search+0x315/0x440 >>>> May 12 20:22:24 tek04 kernel: ? downgrade_write+0xb0/0xb0 >>>> May 12 20:22:24 tek04 kernel: ? cache_lookup+0x190/0x190 >>>> May 12 20:22:24 tek04 kernel: bch_btree_map_keys_recurse+0xcf/0x160 >>>> May 12 20:22:24 tek04 kernel: ? raid5_make_request+0x5c4/0xaa0 >>>> May 12 20:22:24 tek04 kernel: ? recalibrate_cpu_khz+0x10/0x10 >>>> May 12 20:22:24 tek04 kernel: ? kmem_cache_alloc+0x30/0x400 >>>> May 12 20:22:24 tek04 kernel: ? rwsem_wake.isra.11+0x80/0x80 >>>> May 12 20:22:24 tek04 kernel: bch_btree_map_keys+0xf2/0x140 >>>> May 12 20:22:24 tek04 kernel: ? cache_lookup+0x190/0x190 >>>> May 12 20:22:24 tek04 kernel: cache_lookup+0xb1/0x190 >>>> May 12 20:22:24 tek04 kernel: cached_dev_submit_bio+0x9ab/0xc90 >>>> May 12 20:22:24 tek04 kernel: ? submit_bio_checks+0x197/0x4a0 >>>> May 12 20:22:24 tek04 kernel: ? kmem_cache_alloc+0x3b7/0x400 >>>> May 12 20:22:24 tek04 kernel: submit_bio_noacct+0x10e/0x4c0 >>>> May 12 20:22:24 tek04 kernel: submit_bio+0x2e/0x160 >>>> May 12 20:22:24 tek04 kernel: ? xa_load+0x66/0x70 >>>> May 12 20:22:24 tek04 kernel: ? bio_add_page+0x2f/0x70 >>>> May 12 20:22:24 tek04 kernel: ext4_mpage_readpages+0x1ae/0xa00 >>>> May 12 20:22:24 tek04 kernel: ? __mod_lruvec_state+0x29/0x60 >>>> May 12 20:22:24 tek04 kernel: read_pages+0x78/0x1d0 >>>> May 12 20:22:24 tek04 kernel: page_cache_ra_unbounded+0x127/0x1b0 >>>> May 12 20:22:24 tek04 kernel: filemap_get_pages+0x1d0/0x4a0 >>>> May 12 20:22:24 tek04 kernel: filemap_read+0x91/0x2d0 >>>> May 12 20:22:24 tek04 kernel: new_sync_read+0x103/0x180 >>>> May 12 20:22:24 tek04 kernel: vfs_read+0x11b/0x1b0 >>>> May 12 20:22:24 tek04 kernel: ksys_read+0x55/0xd0 >>>> May 12 20:22:24 tek04 kernel: do_syscall_64+0x33/0x80 >>>> May 12 20:22:24 tek04 kernel: entry_SYSCALL_64_after_hwframe+0x44/0xae >>>> May 12 20:22:24 tek04 kernel: RIP: 0033:0x7f3a1cb89461 >>>> May 12 20:22:24 tek04 kernel: Code: fe ff ff 50 48 8d 3d fe d0 09 00 e8 >>>> e9 03 02 00 66 0f 1f 84 00 00 00 00 00 48 8d 05 99 62 0d 00 8b 00 85 c0 >>>> 75 13 31 c0 0f 05 <48> 3d 00 f0 ff ff 77 57 c3 66 0f 1f 44 00 00 41 54 >>>> 49 89 d4 55 48 >>>> May 12 20:22:24 tek04 kernel: RSP: 002b:00007fff4052fff8 EFLAGS: >>>> 00000246 ORIG_RAX: 0000000000000000 >>>> May 12 20:22:24 tek04 kernel: RAX: ffffffffffffffda RBX: >>>> 0000000000018000 RCX: 00007f3a1cb89461 >>>> May 12 20:22:24 tek04 kernel: RDX: 0000000000018000 RSI: >>>> 000055ff0d01a000 RDI: 0000000000000004 >>>> May 12 20:22:24 tek04 kernel: RBP: 0000000000018000 R08: >>>> 0000000000000002 R09: 000055ff0d0194b0 >>>> May 12 20:22:24 tek04 kernel: R10: 0000000000000000 R11: >>>> 0000000000000246 R12: 000055ff0d01a000 >>>> May 12 20:22:24 tek04 kernel: R13: 0000000000000004 R14: >>>> 000055ff0d0194b0 R15: 0000000000000004 >>>> May 12 20:22:24 tek04 kernel: Modules linked in: cmac bnep >>>> intel_powerclamp snd_hda_codec_realtek snd_hda_codec_generic btusb >>>> ledtrig_audio snd_hda_codec_hdmi btrtl kvm_intel snd_hda_intel btbcm >>>> snd_intel_dspcfg btintel kvm snd_hda_codec irqbypass bluetooth serio_raw >>>> pcspkr hfcpci snd_hda_core iTCO_wdt evdev input_leds joydev sg snd_hwdep >>>> intel_pmc_bxt ecdh_generic rfkill iTCO_vendor_support mISDN_core ecc >>>> snd_pcm snd_timer tiny_power_button snd soundcore button i7core_edac >>>> acpi_cpufreq wmi nft_counter nf_log_ipv6 nf_log_ipv >>>> May 12 20:22:24 tek04 kernel: ---[ end trace 9a03f30c7b4aa246 ]--- >>>> May 12 20:22:25 tek04 kernel: RIP: 0010:biovec_slab.cold.45+0xf/0x11 >>>> May 12 20:22:25 tek04 kernel: Code: b3 ae ff 89 c6 48 c7 c7 30 82 21 82 >>>> e8 03 81 fe ff b8 b6 ff ff ff e9 3d bc ae ff 0f b7 f7 48 c7 c7 c4 82 21 >>>> 82 e8 ea 80 fe ff <0f> 0b 49 8b b4 24 d0 00 00 00 48 c7 c7 40 84 21 82 >>>> e8 d4 80 fe ff >>>> May 12 20:22:25 tek04 kernel: RSP: 0018:ffffc9000274b730 EFLAGS: 00010292 >>>> May 12 20:22:25 tek04 kernel: RAX: 000000000000000b RBX: >>>> ffffc9000274b764 RCX: 0000000000000000 >>>> May 12 20:22:25 tek04 kernel: RDX: ffff888333c5e400 RSI: >>>> ffff888333c57480 RDI: ffff888333c57480 >>>> May 12 20:22:25 tek04 kernel: RBP: 0000000000000800 R08: >>>> 0000000000000000 R09: c0000000ffffdfff >>>> May 12 20:22:25 tek04 kernel: R10: ffffc9000274b580 R11: >>>> ffffc9000274b578 R12: ffff8881170f0118 >>>> May 12 20:22:25 tek04 kernel: R13: ffff888109911b00 R14: >>>> ffff8881170f00d0 R15: 0000000000000800 >>>> May 12 20:22:25 tek04 kernel: FS: 00007f3a1ca7cb80(0000) >>>> GS:ffff888333c40000(0000) knlGS:0000000000000000 >>>> May 12 20:22:25 tek04 kernel: CS: 0010 DS: 0000 ES: 0000 CR0: >>>> 0000000080050033 >>>> May 12 20:22:25 tek04 kernel: CR2: 00005628963f4fd0 CR3: >>>> 000000016c43c000 CR4: 00000000000006e0 >>>> >>>> A printk has been added to line 52 of block/bio.c to dump the nr_vecs >>>> variable to the kernel log before the BUG(). Obviously nr_vecs (472 >>>> logged) is bigger than expected by bvec_alloc/bio_alloc_bioset (max >>>> 256), which finally triggers the BUG(). >>>> >>>> Removing the BUG() from line 52 of block/bio.c, thus basically restoring >>>> the Linux 5.11.x behavior of bvec_alloc/bio_alloc_bioset to just return >>>> NULL, when nr_vecs is too big seems to resolve the issue. >>>> >>>> Regards >>>> Thorsten >>>> >