On 3 Nov 2022 12:54:44 +0100 Jan Kara <jack@xxxxxxx> > Hello, > > I was tracking down the following crash with 6.0 kernel with > patch-6.0.5-rt14.patch applied: > > [ T6611] ------------[ cut here ]------------ > [ T6611] kernel BUG at fs/inode.c:625! > [ T6611] Internal error: Oops - BUG: 0 [#1] PREEMPT_RT SMP > [ T6611] Modules linked in: xfs(E) af_packet(E) iscsi_ibft(E) iscsi_boot_sysfs(E) rfkill(E) mlx5_ib(E) ib_uverbs(E) ib_core(E) arm_spe_pmu(E) mlx5_core(E) sunrpc(E) mlxfw(E) pci_hyperv_intf(E) nls_iso8859_1(E) acpi_ipmi(E) nls_cp437(E) ipmi_ssif(E) vfat(E) ipmi_devintf(E) tls(E) igb(E) psample(E) button(E) arm_cmn(E) arm_dmc620_pmu(E) ipmi_msghandler(E) fat(E) cppc_cpufreq(E) arm_dsu_pmu(E) fuse(E) ip_tables(E) x_tables(E) ast(E) i2c_algo_bit(E) drm_vram_helper(E) aes_ce_blk(E) aes_ce_cipher(E) crct10dif_ce(E) ghash_ce(E) gf128mul(E) nvme(E) drm_kms_helper(E) sha2_ce(E) syscopyarea(E) sha256_arm64(E) sysfillrect(E) xhci_pci(E) sha1_ce(E) sysimgblt(E) nvme_core(E) xhci_pci_renesas(E) fb_sys_fops(E) nvme_common(E) drm_ttm_helper(E) sbsa_gwdt(E) t10_pi(E) ttm(E) xhci_hcd(E) crc64_rocksoft_generic(E) crc64_rocksoft(E) usbcore(E) crc64(E) drm(E) usb_common(E) i2c_designware_platform(E) i2c_designware_core(E) btrfs(E) blake2b_generic(E) libcrc32c(E) xor(E) xor_neon(E) > [ T6611] raid6_pq(E) sg(E) dm_multipath(E) dm_mod(E) scsi_dh_rdac(E) scsi_dh_emc(E) scsi_dh_alua(E) scsi_mod(E) scsi_common(E) > [ T6611] CPU: 11 PID: 6611 Comm: dbench Tainted: G E 6.0.0-rt14-rt+ #1 4a18df02c109f1e703cf2ff86b77cf9cd9d5a188 > [ T6611] Hardware name: GIGABYTE R272-P30-JG/MP32-AR0-JG, BIOS F16f (SCP: 1.06.20210615) 07/01/2021 > [ T6611] pstate: 80400009 (Nzcv daif +PAN -UAO -TCO -DIT -SSBS BTYPE=--) > [ T6611] pc : clear_inode+0xa0/0xc0 > [ T6611] lr : clear_inode+0x38/0xc0 > [ T6611] sp : ffff80000f4f3cd0 > [ T6611] x29: ffff80000f4f3cd0 x28: ffff07ff92142000 x27: 0000000000000000 > [ T6611] x26: ffff08012aef6058 x25: 0000000000000002 x24: ffffb657395e8000 > [ T6611] x23: ffffb65739072008 x22: ffffb656e0bed0a8 x21: ffff08012aef6190 > [ T6611] x20: ffff08012aef61f8 x19: ffff08012aef6058 x18: 0000000000000014 > [ T6611] x17: 00000000f0d86255 x16: ffffb65737dfdb00 x15: 0100000004000000 > [ T6611] x14: 644d000008090000 x13: 644d000008090000 x12: ffff80000f4f3b20 > [ T6611] x11: 0000000000000002 x10: ffff083f5ffbe1c0 x9 : ffffb657388284a4 > [ T6611] x8 : fffffffffffffffe x7 : ffff80000f4f3b20 x6 : ffff80000f4f3b20 > [ T6611] x5 : ffff08012aef6210 x4 : ffff08012aef6210 x3 : 0000000000000000 > [ T6611] x2 : ffff08012aef62d8 x1 : ffff07ff8fbbf690 x0 : ffff08012aef61a0 > [ T6611] Call trace: > [ T6611] clear_inode+0xa0/0xc0 > [ T6611] evict+0x160/0x180 > [ T6611] iput+0x154/0x240 > [ T6611] do_unlinkat+0x184/0x300 > [ T6611] __arm64_sys_unlinkat+0x48/0xc0 > [ T6611] el0_svc_common.constprop.4+0xe4/0x2c0 > [ T6611] do_el0_svc+0xac/0x100 > [ T6611] el0_svc+0x78/0x200 > [ T6611] el0t_64_sync_handler+0x9c/0xc0 > [ T6611] el0t_64_sync+0x19c/0x1a0 > [ T6611] Code: d4210000 d503201f d4210000 d503201f (d4210000) > [ T6611] ---[ end trace 0000000000000000 ]--- > > The machine is aarch64 architecture, kernel config is attached. I have seen > the crashes also with 5.14-rt kernel so it is not a new thing. The crash is > triggered relatively reliably (on two different aarch64 machines) by our > performance testing framework when running dbench benchmark against an XFS > filesystem. > > Now originally I thought this is some problem with XFS or writeback code > but after debugging this for some time I don't think that anymore. > clear_inode() complains about inode->i_wb_list being non-empty. In fact > looking at the list_head, I can see it is corrupted. In all the occurences > of the problem ->prev points back to the list_head itself but ->next points > to some list_head that used to be part of the sb->s_inodes_wb list (or > actually that list spliced in wait_sb_inodes() because I've seen a pointer to > the stack as ->next pointer as well). > > This is not just some memory ordering issue with the check in > clear_inode(). If I add sb->s_inode_wblist_lock locking around the check in > clear_inode(), the problem still reproduces. > > If I enable CONFIG_DEBUG_LIST or if I convert sb->s_inode_wblist_lock to > raw_spinlock_t, the problem disappears. > > Finally, I'd note that the list is modified from three places which makes > audit relatively simple. sb_mark_inode_writeback(), > sb_clear_inode_writeback(), and wait_sb_inodes(). All these places hold > sb->s_inode_wblist_lock when modifying the list. So at this point I'm at > loss what could be causing this. As unlikely as it seems to me I've started > wondering whether it is not some subtle issue with RT spinlocks on aarch64 > possibly in combination with interrupts (because sb_clear_inode_writeback() > may be called from an interrupt). > > Any ideas? Feel free to collect debug info ONLY in your spare cycles, given your relatively reliable reproducer. Only for thoughts. Hillf +++ b/fs/fs-writeback.c @@ -1256,6 +1256,7 @@ void sb_mark_inode_writeback(struct inod if (list_empty(&inode->i_wb_list)) { spin_lock_irqsave(&sb->s_inode_wblist_lock, flags); if (list_empty(&inode->i_wb_list)) { + ihold(inode); list_add_tail(&inode->i_wb_list, &sb->s_inodes_wb); trace_sb_mark_inode_writeback(inode); } @@ -1272,12 +1273,19 @@ void sb_clear_inode_writeback(struct ino unsigned long flags; if (!list_empty(&inode->i_wb_list)) { + int put = 0; spin_lock_irqsave(&sb->s_inode_wblist_lock, flags); if (!list_empty(&inode->i_wb_list)) { + put = 1; list_del_init(&inode->i_wb_list); trace_sb_clear_inode_writeback(inode); } spin_unlock_irqrestore(&sb->s_inode_wblist_lock, flags); + if (put) { + ihold(inode); + iput(inode); + iput(inode); + } } }