On Wed, Aug 05, 2020 at 10:46:12PM -0700, Hugh Dickins wrote: > On Mon, 27 Jul 2020, Greg KH wrote: > > > > Linus just pointed me at this thread. > > > > If you could run: > > echo -n 'module xhci_hcd =p' > /sys/kernel/debug/dynamic_debug/control > > and run the same workload to see if anything shows up in the log when > > xhci crashes, that would be great. > > Thanks, I tried that, and indeed it did have a story to tell: > > ep 0x81 - asked for 16 bytes, 10 bytes untransferred > ep 0x81 - asked for 16 bytes, 10 bytes untransferred > ep 0x81 - asked for 16 bytes, 10 bytes untransferred > a very large number of lines like the above, then > Cancel URB 00000000d81602f7, dev 4, ep 0x0, starting at offset 0xfffd42c0 > // Ding dong! > ep 0x81 - asked for 16 bytes, 10 bytes untransferred > Stopped on No-op or Link TRB for slot 1 ep 0 > xhci_drop_endpoint called for udev 000000005bc07fa6 > drop ep 0x81, slot id 1, new drop flags = 0x8, new add flags = 0x0 > add ep 0x81, slot id 1, new drop flags = 0x8, new add flags = 0x8 > xhci_check_bandwidth called for udev 000000005bc07fa6 > // Ding dong! > Successful Endpoint Configure command > Cancel URB 000000006b77d490, dev 4, ep 0x81, starting at offset 0x0 > // Ding dong! > Stopped on No-op or Link TRB for slot 1 ep 2 > Removing canceled TD starting at 0x0 (dma). > list_del corruption: prev(ffff8fdb4de7a130)->next should be ffff8fdb41697f88, > but is 6b6b6b6b6b6b6b6b; next(ffff8fdb4de7a130)->prev is 6b6b6b6b6b6b6b6b. > ------------[ cut here ]------------ > kernel BUG at lib/list_debug.c:53! > RIP: 0010:__list_del_entry_valid+0x8e/0xb0 > Call Trace: > <IRQ> > handle_cmd_completion+0x7d4/0x14f0 [xhci_hcd] > xhci_irq+0x242/0x1ea0 [xhci_hcd] > xhci_msi_irq+0x11/0x20 [xhci_hcd] > __handle_irq_event_percpu+0x48/0x2c0 > handle_irq_event_percpu+0x32/0x80 > handle_irq_event+0x4a/0x80 > handle_edge_irq+0xd8/0x1b0 > handle_irq+0x2b/0x50 > do_IRQ+0xb6/0x1c0 > common_interrupt+0x90/0x90 > </IRQ> > > Info provided for your interest, not expecting any response. > The list_del info in there is non-standard, from a patch of mine: > I find hashed addresses in debug output less than helpful. Thanks for this, that is really odd. > > > > Although if you are using an "older version" of the driver, there's not > > much I can suggest except update to a newer one :) > > Yes, I was reluctant to post any info, since really the ball is at our > end of the court, not yours. I did have a go at bringing in the latest > xhci driver instead, but quickly saw that was not a sensible task for > me. And I did scan the git log of xhci changes (especially xhci-ring.c > changes): thought I saw a likely relevant and easily applied fix commit, > but in fact it made no difference here. > > I suspect it's in part a hardware problem, but driver not recovering > correctly. I've replaced the machine (but also noticed that the same > crash has occasionally been seen on other machines). I'm sure it has > no relevance to this unlock_page() thread, though it's quite possible > that it's triggered under stress, and Linus's changes allowed greater > stress. I will be willing to blame hardware problems for this as well, but will save this report in case something else shows up in the future, thanks! greg k-h