Re: [RFC PATCH] mm: silence soft lockups from unlock_page

Greg KH <gregkh@xxxxxxxxxxxxxxxxxxx> · Tue, 18 Aug 2020 15:50:45 +0200

On Wed, Aug 05, 2020 at 10:46:12PM -0700, Hugh Dickins wrote:
> On Mon, 27 Jul 2020, Greg KH wrote:
> > 
> > Linus just pointed me at this thread.
> > 
> > If you could run:
> > 	echo -n 'module xhci_hcd =p' > /sys/kernel/debug/dynamic_debug/control
> > and run the same workload to see if anything shows up in the log when
> > xhci crashes, that would be great.
> 
> Thanks, I tried that, and indeed it did have a story to tell:
> 
> ep 0x81 - asked for 16 bytes, 10 bytes untransferred
> ep 0x81 - asked for 16 bytes, 10 bytes untransferred
> ep 0x81 - asked for 16 bytes, 10 bytes untransferred
>    a very large number of lines like the above, then
> Cancel URB 00000000d81602f7, dev 4, ep 0x0, starting at offset 0xfffd42c0
> // Ding dong!
> ep 0x81 - asked for 16 bytes, 10 bytes untransferred
> Stopped on No-op or Link TRB for slot 1 ep 0
> xhci_drop_endpoint called for udev 000000005bc07fa6
> drop ep 0x81, slot id 1, new drop flags = 0x8, new add flags = 0x0
> add ep 0x81, slot id 1, new drop flags = 0x8, new add flags = 0x8
> xhci_check_bandwidth called for udev 000000005bc07fa6
> // Ding dong!
> Successful Endpoint Configure command
> Cancel URB 000000006b77d490, dev 4, ep 0x81, starting at offset 0x0
> // Ding dong!
> Stopped on No-op or Link TRB for slot 1 ep 2
> Removing canceled TD starting at 0x0 (dma).
> list_del corruption: prev(ffff8fdb4de7a130)->next should be ffff8fdb41697f88,
>    but is 6b6b6b6b6b6b6b6b; next(ffff8fdb4de7a130)->prev is 6b6b6b6b6b6b6b6b.
> ------------[ cut here ]------------
> kernel BUG at lib/list_debug.c:53!
> RIP: 0010:__list_del_entry_valid+0x8e/0xb0
> Call Trace:
>  <IRQ>
>  handle_cmd_completion+0x7d4/0x14f0 [xhci_hcd]
>  xhci_irq+0x242/0x1ea0 [xhci_hcd]
>  xhci_msi_irq+0x11/0x20 [xhci_hcd]
>  __handle_irq_event_percpu+0x48/0x2c0
>  handle_irq_event_percpu+0x32/0x80
>  handle_irq_event+0x4a/0x80
>  handle_edge_irq+0xd8/0x1b0
>  handle_irq+0x2b/0x50
>  do_IRQ+0xb6/0x1c0
>  common_interrupt+0x90/0x90
>  </IRQ>
> 
> Info provided for your interest, not expecting any response.
> The list_del info in there is non-standard, from a patch of mine:
> I find hashed addresses in debug output less than helpful.

Thanks for this, that is really odd.

> > 
> > Although if you are using an "older version" of the driver, there's not
> > much I can suggest except update to a newer one :)
> 
> Yes, I was reluctant to post any info, since really the ball is at our
> end of the court, not yours. I did have a go at bringing in the latest
> xhci driver instead, but quickly saw that was not a sensible task for
> me. And I did scan the git log of xhci changes (especially xhci-ring.c
> changes): thought I saw a likely relevant and easily applied fix commit,
> but in fact it made no difference here.
> 
> I suspect it's in part a hardware problem, but driver not recovering
> correctly. I've replaced the machine (but also noticed that the same
> crash has occasionally been seen on other machines). I'm sure it has
> no relevance to this unlock_page() thread, though it's quite possible
> that it's triggered under stress, and Linus's changes allowed greater
> stress.

I will be willing to blame hardware problems for this as well, but will
save this report in case something else shows up in the future, thanks!

greg k-h