Hi, this is your Linux kernel regression tracker speaking. Top-posting for once, to make this easy accessible to everyone. Below issue that started to happen between v5.10.80..v5.10.90 was recently reported to bugzilla, but the reporter didn't even get a single reply afaics. Could somebody maybe take a look? Bisection is likely no easy in this case, so a few tips to narrow down the area to search might help a lot here. https://bugzilla.kernel.org/show_bug.cgi?id=215562 Ciao, Thorsten On 03.02.22 16:03, Thorsten Leemhuis wrote: > Hi, this is your Linux kernel regression tracker speaking. > > There is a regression in bugzilla.kernel.org I'd like to add to the > tracking: > > #regzbot introduced: v5.10.80..v5.10.90 > #regzbot from: Patrick Schaaf <kernelorg@xxxxxx> > #regzbot title: mm: unable to handle page fault in cache_reap > #regzbot link: https://bugzilla.kernel.org/show_bug.cgi?id=215562 > > Quote: > >> We've been running self-built 5.10.x kernels on DL380 hosts for quite a while, also inside the VMs there. >> >> With I think 5.10.90 three weeks or so back, we experienced a lockup upon umounting a larger, dirty filesystem on the host side, unfortunately without capturing a backtrace back then. >> >> Today something feeling similar, happened again, on a machine running 5.10.93 both on the host and inside its 10 various VMs. >> >> Problem showed shortly (minutes) after shutting down one of the VMs (few hundred GB memory / dataset, VM shutdown was complete already; direct I/O), and then some LVM volume renames, a quick short outside ext4 mount followed by an umount (8 GB volume, probably a few hundred megabyte only to write). Actually monitoring suggests that disk writes were already done about a minute before the onset. >> >> What we then experienced, was the following BUG:, followed by one after the other CPU saying goodbye with soft lockup messages over the course of a few minutes; meanwhile there was no more pinging the box, logging in on console, etc. We hard powercycled and it recovered fully. >> >> here's the BUG that was logged; if it is useful for someone to see the followup soft lockup messages, tell me + I'll add them. >> >> Feb 02 15:22:27 kvm3j kernel: BUG: unable to handle page fault for address: ffffebde00000008 >> Feb 02 15:22:27 kvm3j kernel: #PF: supervisor read access in kernel mode >> Feb 02 15:22:27 kvm3j kernel: #PF: error_code(0x0000) - not-present page >> Feb 02 15:22:27 kvm3j kernel: Oops: 0000 [#1] SMP PTI >> Feb 02 15:22:27 kvm3j kernel: CPU: 7 PID: 39833 Comm: kworker/7:0 Tainted: G I 5.10.93-kvm #1 >> Feb 02 15:22:27 kvm3j kernel: Hardware name: HP ProLiant DL380p Gen8, BIOS P70 12/20/2013 >> Feb 02 15:22:27 kvm3j kernel: Workqueue: events cache_reap >> Feb 02 15:22:27 kvm3j kernel: RIP: 0010:free_block.constprop.0+0xc0/0x1f0 >> Feb 02 15:22:27 kvm3j kernel: Code: 4c 8b 16 4c 89 d0 48 01 e8 0f 82 32 01 00 00 4c 89 f2 48 bb 00 00 00 00 00 ea ff ff 48 01 d0 48 c1 e8 0c 48 c1 e0 06 48 01 d8 <48> 8b 50 08 48 8d 4a ff 83 e2 01 48 > >> Feb 02 15:22:27 kvm3j kernel: RSP: 0018:ffffc9000252bdc8 EFLAGS: 00010086 >> Feb 02 15:22:27 kvm3j kernel: RAX: ffffebde00000000 RBX: ffffea0000000000 RCX: ffff888889141b00 >> Feb 02 15:22:27 kvm3j kernel: RDX: 0000777f80000000 RSI: ffff893d3edf3400 RDI: ffff8881000403c0 >> Feb 02 15:22:27 kvm3j kernel: RBP: 0000000080000000 R08: ffff888100041300 R09: 0000000000000003 >> Feb 02 15:22:27 kvm3j kernel: R10: 0000000000000000 R11: ffff888100041308 R12: dead000000000122 >> Feb 02 15:22:27 kvm3j kernel: R13: dead000000000100 R14: 0000777f80000000 R15: ffff893ed8780d60 >> Feb 02 15:22:27 kvm3j kernel: FS: 0000000000000000(0000) GS:ffff893d3edc0000(0000) knlGS:0000000000000000 >> Feb 02 15:22:27 kvm3j kernel: CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 >> Feb 02 15:22:27 kvm3j kernel: CR2: ffffebde00000008 CR3: 000000048c4aa002 CR4: 00000000001726e0 >> Feb 02 15:22:27 kvm3j kernel: Call Trace: >> Feb 02 15:22:27 kvm3j kernel: drain_array_locked.constprop.0+0x2e/0x80 >> Feb 02 15:22:27 kvm3j kernel: drain_array.constprop.0+0x54/0x70 >> Feb 02 15:22:27 kvm3j kernel: cache_reap+0x6c/0x100 >> Feb 02 15:22:27 kvm3j kernel: process_one_work+0x1cf/0x360 >> Feb 02 15:22:27 kvm3j kernel: worker_thread+0x45/0x3a0 >> Feb 02 15:22:27 kvm3j kernel: ? process_one_work+0x360/0x360 >> Feb 02 15:22:27 kvm3j kernel: kthread+0x116/0x130 >> Feb 02 15:22:27 kvm3j kernel: ? kthread_create_worker_on_cpu+0x40/0x40 >> Feb 02 15:22:27 kvm3j kernel: ret_from_fork+0x22/0x30 >> Feb 02 15:22:27 kvm3j kernel: Modules linked in: hpilo >> Feb 02 15:22:27 kvm3j kernel: CR2: ffffebde00000008 >> Feb 02 15:22:27 kvm3j kernel: ---[ end trace ded3153d86a92898 ]--- >> Feb 02 15:22:27 kvm3j kernel: RIP: 0010:free_block.constprop.0+0xc0/0x1f0 >> Feb 02 15:22:27 kvm3j kernel: Code: 4c 8b 16 4c 89 d0 48 01 e8 0f 82 32 01 00 00 4c 89 f2 48 bb 00 00 00 00 00 ea ff ff 48 01 d0 48 c1 e8 0c 48 c1 e0 06 48 01 d8 <48> 8b 50 08 48 8d 4a ff 83 e2 01 48 > >> Feb 02 15:22:27 kvm3j kernel: RSP: 0018:ffffc9000252bdc8 EFLAGS: 00010086 >> Feb 02 15:22:27 kvm3j kernel: RAX: ffffebde00000000 RBX: ffffea0000000000 RCX: ffff888889141b00 >> Feb 02 15:22:27 kvm3j kernel: RDX: 0000777f80000000 RSI: ffff893d3edf3400 RDI: ffff8881000403c0 >> Feb 02 15:22:27 kvm3j kernel: RBP: 0000000080000000 R08: ffff888100041300 R09: 0000000000000003 >> Feb 02 15:22:27 kvm3j kernel: R10: 0000000000000000 R11: ffff888100041308 R12: dead000000000122 >> Feb 02 15:22:27 kvm3j kernel: R13: dead000000000100 R14: 0000777f80000000 R15: ffff893ed8780d60 >> Feb 02 15:22:27 kvm3j kernel: FS: 0000000000000000(0000) GS:ffff893d3edc0000(0000) knlGS:0000000000000000 >> Feb 02 15:22:27 kvm3j kernel: CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 >> Feb 02 15:22:27 kvm3j kernel: CR2: ffffebde00000008 CR3: 000000048c4aa002 CR4: 00000000001726e0 > > Ciao, Thorsten (wearing his 'Linux kernel regression tracker' hat) > > P.S.: As a Linux kernel regression tracker I'm getting a lot of reports > on my table. I can only look briefly into most of them. Unfortunately > therefore I sometimes will get things wrong or miss something important. > I hope that's not the case here; if you think it is, don't hesitate to > tell me about it in a public reply, that's in everyone's interest. > > BTW, I have no personal interest in this issue, which is tracked using > regzbot, my Linux kernel regression tracking bot > (https://linux-regtracking.leemhuis.info/regzbot/). I'm only posting > this mail to get things rolling again and hence don't need to be CC on > all further activities wrt to this regression. > > --- > Additional information about regzbot: > > If you want to know more about regzbot, check out its web-interface, the > getting start guide, and/or the references documentation: > > https://linux-regtracking.leemhuis.info/regzbot/ > https://gitlab.com/knurd42/regzbot/-/blob/main/docs/getting_started.md > https://gitlab.com/knurd42/regzbot/-/blob/main/docs/reference.md > > The last two documents will explain how you can interact with regzbot > yourself if your want to. > > Hint for reporters: when reporting a regression it's in your interest to > tell #regzbot about it in the report, as that will ensure the regression > gets on the radar of regzbot and the regression tracker. That's in your > interest, as they will make sure the report won't fall through the > cracks unnoticed. > > Hint for developers: you normally don't need to care about regzbot once > it's involved. Fix the issue as you normally would, just remember to > include a 'Link:' tag to the report in the commit message, as explained > in Documentation/process/submitting-patches.rst > That aspect was recently was made more explicit in commit 1f57bd42b77c: > https://git.kernel.org/linus/1f57bd42b77c