Re: System crash/lockup after plugging CDC ACM device

Greg KH <gregkh@xxxxxxxxxxxxxxxxxxx> · Wed, 15 Jul 2020 12:50:34 +0200

On Wed, Jul 15, 2020 at 12:31:42PM +0200, David Guillen Fandos wrote:
> On Wed, 2020-07-15 at 11:30 +0200, Greg KH wrote:
> > On Wed, Jul 15, 2020 at 10:58:03AM +0200, David Guillen Fandos wrote:
> > > Hello linux-usb,
> > > 
> > > I think I might have found a kernel bug related to the USB
> > > subsystem
> > > (cdc_acm perhaps).
> > > 
> > > Context: I was playing around with a device I'm creating,
> > > essentially a
> > > USB quad modem device that exposes four modems to the host system.
> > > This
> > > device is still a prototype so there's a few bugs here and there,
> > > most
> > > likely in the USB descriptors and control requests.
> > > 
> > > What happens: After plugging the device the system starts spitting
> > > warnings and BUGs and it locks up. Most of the time some CPUs get
> > > into
> > > some spinloop and never comes back (you can see it being detected
> > > by
> > > the watchdog after a few seconds). Generally after that the USB
> > > devices
> > > stop working completely and at some point the machine freezes
> > > completely. In a couple of ocasions I managed to see a bug in dmesg
> > > saying "unable to handle page fault for address XXX" and
> > > "Supervisor
> > > read access in kernel mode" "error code (0x0000) not present page".
> > > I
> > > could not get a trace for that one since the kernel died completely
> > > and
> > > my log files were truncated/lost.
> > > 
> > > Since it is happening to my two machines (both Intel but rather
> > > different controllers, Sunrise Point-LP USB 3.0 vs 8 Series/C220)
> > > and
> > > with different kernel versions I suspect this might be a bug in the
> > > kernel.
> > > 
> > > I have 4 logs that I collected, they are sort of long-ish, not sure
> > > how
> > > to best send them to the list.
> > 
> > Send the crashes with the callback list, that should be quite small,
> > right?  We don't need the full log.
> > 
> > The first crash is the most important, the others can be from the
> > first
> > one and are not reliable.
> > 
> > thanks,
> > 
> > greg k-h
> 
> Ok then, here comes one of the logs, I selected some bits only
> 
> [  147.302016] WARNING: CPU: 3 PID: 134 at kernel/workqueue.c:1473
> __queue_work+0x364/0x410
> [...]
> [  147.302322] Call Trace:
> [  147.302329]  <IRQ>
> [  147.302342]  queue_work_on+0x36/0x40
> [  147.302353]  __usb_hcd_giveback_urb+0x9c/0x110
> [  147.302362]  usb_giveback_urb_bh+0xa0/0xf0
> [  147.302372]  tasklet_action_common.constprop.0+0x66/0x100
> [  147.302382]  __do_softirq+0xe9/0x2dc
> [  147.302391]  irq_exit+0xcf/0x110
> [  147.302397]  do_IRQ+0x55/0xe0
> [  147.302408]  common_interrupt+0xf/0xf
> [  147.302413]  </IRQ>
> [...]
> [  184.771172] watchdog: BUG: soft lockup - CPU#3 stuck for 23s!
> [kworker/3:2:134]

That was the first message?

Ok, we need some more logs, how about the 30 lines right before the
above?

And what kernel version are you using?

thanks,

greg k-h