Re: System crash/lockup after plugging CDC ACM device

David Guillen Fandos <david@xxxxxxxxxxx> · Wed, 15 Jul 2020 12:31:42 +0200

On Wed, 2020-07-15 at 11:30 +0200, Greg KH wrote:
> On Wed, Jul 15, 2020 at 10:58:03AM +0200, David Guillen Fandos wrote:
> > Hello linux-usb,
> > 
> > I think I might have found a kernel bug related to the USB
> > subsystem
> > (cdc_acm perhaps).
> > 
> > Context: I was playing around with a device I'm creating,
> > essentially a
> > USB quad modem device that exposes four modems to the host system.
> > This
> > device is still a prototype so there's a few bugs here and there,
> > most
> > likely in the USB descriptors and control requests.
> > 
> > What happens: After plugging the device the system starts spitting
> > warnings and BUGs and it locks up. Most of the time some CPUs get
> > into
> > some spinloop and never comes back (you can see it being detected
> > by
> > the watchdog after a few seconds). Generally after that the USB
> > devices
> > stop working completely and at some point the machine freezes
> > completely. In a couple of ocasions I managed to see a bug in dmesg
> > saying "unable to handle page fault for address XXX" and
> > "Supervisor
> > read access in kernel mode" "error code (0x0000) not present page".
> > I
> > could not get a trace for that one since the kernel died completely
> > and
> > my log files were truncated/lost.
> > 
> > Since it is happening to my two machines (both Intel but rather
> > different controllers, Sunrise Point-LP USB 3.0 vs 8 Series/C220)
> > and
> > with different kernel versions I suspect this might be a bug in the
> > kernel.
> > 
> > I have 4 logs that I collected, they are sort of long-ish, not sure
> > how
> > to best send them to the list.
> 
> Send the crashes with the callback list, that should be quite small,
> right?  We don't need the full log.
> 
> The first crash is the most important, the others can be from the
> first
> one and are not reliable.
> 
> thanks,
> 
> greg k-h

Ok then, here comes one of the logs, I selected some bits only

[  147.302016] WARNING: CPU: 3 PID: 134 at kernel/workqueue.c:1473
__queue_work+0x364/0x410
[...]
[  147.302322] Call Trace:
[  147.302329]  <IRQ>
[  147.302342]  queue_work_on+0x36/0x40
[  147.302353]  __usb_hcd_giveback_urb+0x9c/0x110
[  147.302362]  usb_giveback_urb_bh+0xa0/0xf0
[  147.302372]  tasklet_action_common.constprop.0+0x66/0x100
[  147.302382]  __do_softirq+0xe9/0x2dc
[  147.302391]  irq_exit+0xcf/0x110
[  147.302397]  do_IRQ+0x55/0xe0
[  147.302408]  common_interrupt+0xf/0xf
[  147.302413]  </IRQ>
[...]
[  184.771172] watchdog: BUG: soft lockup - CPU#3 stuck for 23s!
[kworker/3:2:134]

The other machines:

Jul 15 01:22:16 localhost kernel: WARNING: CPU: 0 PID: 0 at
kernel/workqueue.c:1473 __queue_work+0x342/0x3e0
[...]
Jul 15 01:22:16 localhost kernel: Call Trace:
Jul 15 01:22:16 localhost kernel: <IRQ>
Jul 15 01:22:16 localhost kernel: queue_work_on+0x36/0x40
Jul 15 01:22:16 localhost kernel: __usb_hcd_giveback_urb+0x6f/0x120
Jul 15 01:22:16 localhost kernel: usb_giveback_urb_bh+0xa0/0xf0
Jul 15 01:22:16 localhost kernel:
tasklet_action_common.isra.0+0x5b/0x100
Jul 15 01:22:16 localhost kernel: __do_softirq+0xee/0x2ff
Jul 15 01:22:16 localhost kernel: irq_exit+0xe9/0xf0
Jul 15 01:22:16 localhost kernel: do_IRQ+0x55/0xe0
Jul 15 01:22:16 localhost kernel: common_interrupt+0xf/0xf
Jul 15 01:22:16 localhost kernel: </IRQ>

Pretty similar as well. This is the first crash, the other ones mention
the cdc-acm but the first ones do not.

In one occasion the first crash seems different, here goes:

[   46.713823] ------------[ cut here ]------------
[   46.713827] refcount_t: underflow; use-after-free.
[   46.713884] WARNING: CPU: 2 PID: 33 at lib/refcount.c:28
refcount_warn_saturate+0xa6/0xf0
[...]
[   46.714148] Call Trace:
[   46.714166]  acm_disconnect+0x198/0x280 [cdc_acm]
[   46.714188]  usb_unbind_interface+0x8a/0x270
[   46.714198]  __device_release_driver+0x15c/0x210
[   46.714204]  device_release_driver+0x24/0x30
[   46.714209]  bus_remove_device+0xdb/0x140
[   46.714218]  device_del+0x16f/0x3f0
[   46.714227]  usb_disable_device+0xd6/0x290
[   46.714234]  usb_disconnect.cold+0x7e/0x205
[   46.714242]  hub_event+0xbfa/0x1810
[   46.714257]  process_one_work+0x1b4/0x380
[   46.714264]  worker_thread+0x53/0x3e0
[   46.714269]  ? process_one_work+0x380/0x380
[   46.714276]  kthread+0x115/0x140
[   46.714284]  ? __kthread_bind_mask+0x60/0x60
[   46.714293]  ret_from_fork+0x35/0x40

Let me know if you need more relevant/complete logs. Will be glad to
send them over.

Thanks
David