Re: [RFC PATCH] mm: silence soft lockups from unlock_page

Hugh Dickins <hughd@xxxxxxxxxx> · Wed, 5 Aug 2020 22:46:12 -0700 (PDT)

On Mon, 27 Jul 2020, Greg KH wrote:
> 
> Linus just pointed me at this thread.
> 
> If you could run:
> 	echo -n 'module xhci_hcd =p' > /sys/kernel/debug/dynamic_debug/control
> and run the same workload to see if anything shows up in the log when
> xhci crashes, that would be great.

Thanks, I tried that, and indeed it did have a story to tell:

ep 0x81 - asked for 16 bytes, 10 bytes untransferred
ep 0x81 - asked for 16 bytes, 10 bytes untransferred
ep 0x81 - asked for 16 bytes, 10 bytes untransferred
   a very large number of lines like the above, then
Cancel URB 00000000d81602f7, dev 4, ep 0x0, starting at offset 0xfffd42c0
// Ding dong!
ep 0x81 - asked for 16 bytes, 10 bytes untransferred
Stopped on No-op or Link TRB for slot 1 ep 0
xhci_drop_endpoint called for udev 000000005bc07fa6
drop ep 0x81, slot id 1, new drop flags = 0x8, new add flags = 0x0
add ep 0x81, slot id 1, new drop flags = 0x8, new add flags = 0x8
xhci_check_bandwidth called for udev 000000005bc07fa6
// Ding dong!
Successful Endpoint Configure command
Cancel URB 000000006b77d490, dev 4, ep 0x81, starting at offset 0x0
// Ding dong!
Stopped on No-op or Link TRB for slot 1 ep 2
Removing canceled TD starting at 0x0 (dma).
list_del corruption: prev(ffff8fdb4de7a130)->next should be ffff8fdb41697f88,
   but is 6b6b6b6b6b6b6b6b; next(ffff8fdb4de7a130)->prev is 6b6b6b6b6b6b6b6b.
------------[ cut here ]------------
kernel BUG at lib/list_debug.c:53!
RIP: 0010:__list_del_entry_valid+0x8e/0xb0
Call Trace:
 <IRQ>
 handle_cmd_completion+0x7d4/0x14f0 [xhci_hcd]
 xhci_irq+0x242/0x1ea0 [xhci_hcd]
 xhci_msi_irq+0x11/0x20 [xhci_hcd]
 __handle_irq_event_percpu+0x48/0x2c0
 handle_irq_event_percpu+0x32/0x80
 handle_irq_event+0x4a/0x80
 handle_edge_irq+0xd8/0x1b0
 handle_irq+0x2b/0x50
 do_IRQ+0xb6/0x1c0
 common_interrupt+0x90/0x90
 </IRQ>

Info provided for your interest, not expecting any response.
The list_del info in there is non-standard, from a patch of mine:
I find hashed addresses in debug output less than helpful.

> 
> Although if you are using an "older version" of the driver, there's not
> much I can suggest except update to a newer one :)

Yes, I was reluctant to post any info, since really the ball is at our
end of the court, not yours. I did have a go at bringing in the latest
xhci driver instead, but quickly saw that was not a sensible task for
me. And I did scan the git log of xhci changes (especially xhci-ring.c
changes): thought I saw a likely relevant and easily applied fix commit,
but in fact it made no difference here.

I suspect it's in part a hardware problem, but driver not recovering
correctly. I've replaced the machine (but also noticed that the same
crash has occasionally been seen on other machines). I'm sure it has
no relevance to this unlock_page() thread, though it's quite possible
that it's triggered under stress, and Linus's changes allowed greater
stress.

Hugh