+cc Greg, Jiri in case serial8250 really is an issue here. For context, the original syzbot report is at [1] that gets snipped here. [1]:https://lore.kernel.org/all/6723aa4d.050a0220.35b515.0161.GAE@xxxxxxxxxx/ On Thu, Oct 31, 2024 at 03:24:57PM -0400, Alan Stern wrote: > On Thu, Oct 31, 2024 at 04:58:29PM +0000, Lorenzo Stoakes wrote: > > +Alan re: USB stalls > > > > On Thu, Oct 31, 2024 at 09:41:02AM -0700, syzbot wrote: > > > Hello, > > > > > > syzbot has tested the proposed patch and the reproducer did not trigger any issue: > > > > > > Reported-by: syzbot+7402e6c8042635c93ead@xxxxxxxxxxxxxxxxxxxxxxxxx > > > Tested-by: syzbot+7402e6c8042635c93ead@xxxxxxxxxxxxxxxxxxxxxxxxx > > > > > > Tested on: > > > > > > commit: cffcc47b mm/mlock: set the correct prev on failure > > > git tree: git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm.git/ mm-hotfixes-unstable > > > console output: https://syzkaller.appspot.com/x/log.txt?x=1304a630580000 > > > kernel config: https://syzkaller.appspot.com/x/.config?x=6648774f7c39d413 > > > dashboard link: https://syzkaller.appspot.com/bug?extid=7402e6c8042635c93ead > > > compiler: gcc (Debian 12.2.0-14) 12.2.0, GNU ld (GNU Binutils for Debian) 2.40 > > > > > > Note: no patches were applied. > > > Note: testing is done by a robot and is best-effort only. > > > > OK this seems likely to be intermittant (and unrelated to what's in > > mm-unstable-fixes honestly) and does make me wonder if the fix referenced > > in [0] really has sorted things out? Or whether it has perhaps help > > mitigate the issue but not sufficiently in conjunction with debug things > > that slow things down. > > This looks very different from the issues that were addressed by the fix > I mentioned in [0]. In particular, the log traces for this series of > bug reports all start with something like this: > > serial_out drivers/tty/serial/8250/8250.h:142 [inline] > serial8250_console_fifo_write drivers/tty/serial/8250/8250_port.c:3322 [inline] > serial8250_console_write+0xf9e/0x17c0 drivers/tty/serial/8250/8250_port.c:3393 > console_emit_next_record kernel/printk/printk.c:3092 [inline] > console_flush_all+0x800/0xc60 kernel/printk/printk.c:3180 > __console_flush_and_unlock kernel/printk/printk.c:3239 [inline] > console_unlock+0xd9/0x210 kernel/printk/printk.c:3279 > vprintk_emit+0x424/0x6f0 kernel/printk/printk.c:2407 > vprintk+0x7f/0xa0 kernel/printk/printk_safe.c:68 > _printk+0xc8/0x100 kernel/printk/printk.c:2432 > printk_stack_address arch/x86/kernel/dumpstack.c:72 [inline] > > indicating that perhaps the problem is related to the 8250 driver. Or > perhaps that driver just happens to wait for long periods and so is more > likely to show up when the real problem occurs. Yeah, see below, I think the waiting is probably the issue to be honest. It's hard to know if this backtrace just happened to be code executing at the time of the stall or is actually related. Have cc'd serial8250 people in any case. > > By contrast, the log traces for the [0] bug reports all show something > like this: > > context_switch kernel/sched/core.c:5315 [inline] > __schedule+0x105f/0x34b0 kernel/sched/core.c:6675 > __schedule_loop kernel/sched/core.c:6752 [inline] > schedule+0xe7/0x350 kernel/sched/core.c:6767 > usb_kill_urb.part.0+0x1ca/0x250 drivers/usb/core/urb.c:713 > usb_kill_urb+0x83/0xa0 drivers/usb/core/urb.c:702 > usb_start_wait_urb+0x255/0x4c0 drivers/usb/core/message.c:65 > usb_internal_control_msg drivers/usb/core/message.c:103 [inline] > usb_control_msg+0x327/0x4b0 drivers/usb/core/message.c:154 > > because that bug involved usb_kill_urb() waiting indefinitely for an > event that never happens. > Ah thanks, sorry am pattern-matching a bit here on usb-related things. I suspect issue here is the test is on some level assuming that it can have delays or hold ups that would normally be ok in a non-debug kernel, but not taking into account the fact that CONFIG_DEBUG_VM_MAPLE_TREE can really, really slow things down and is a very heavy-handed option. I think we should nearly always turn it on as it correctly identifies serious issues, however in cases where we _expect_ slowdown or some significant waiting relating to hardware or simulated hardware actions we might want to reconsider that or at least increase timeouts. Liam has submitted a patch to explicitly rule out an infinite loop in the maple tree as a source of any stall [2] though there is absolutely no reason why this should happen other than in the face of overwhelming memory corruption. Still suspect these are just due to slow down. Perhaps somebody from the syzkaller side can look into mitigation? [2]:https://lore.kernel.org/all/20241031193608.1965366-1-Liam.Howlett@xxxxxxxxxx > Alan Stern > > > [0]:https://lore.kernel.org/all/967f3aa0-447a-4121-b80b-299c926a33f5@xxxxxxxxxxxxxxxxxxx/