uhci irq race before term_td is allocated

Don Zickus <dzickus@xxxxxxxxxx> · Mon, 15 Oct 2012 16:47:28 -0400

Hi Alan,

I am seeing an odd panic with uhci when a 160 cpu box panics and starts
running a kdump kernel (which is the same exact image as the boot kernel)
for our RHEL-6 (2.6.32) kernel.  Now I understand 2.6.32 is not something
upstream supports.

However, my question is what is expected to happen in the following
situtation.  The attached panic below shows

usb_hcd_pci_probe->
  usb_add_hcd->
    <snip>
        /* enable irqs just before we start the controller,
         * if the BIOS provides legacy PCI irqs.
         */
        if (usb_hcd_is_primary_hcd(hcd) && irqnum) {
                retval = usb_hcd_request_irqs(hcd, irqnum, irqflags);
                if (retval)
                        goto err_request_irq;
        }

        hcd->state = HC_STATE_RUNNING;
        retval = hcd->driver->start(hcd);
        if (retval < 0) {
                dev_err(hcd->self.controller, "startup error %d\n",
retval);
                goto err_hcd_driver_start;
        }
    <snip>

That interrupts are setup and enabled before hcd->driver_start is called.

Now in

uhci_start->
    <snip>
    uhci->term_td = uhci_alloc_td(uhci);
    <snip>

Happens.  But what if an interrupt comes in before that call runs, for
example:

uhci_irq->
    <snip>
        if (status & USBSTS_RD)
                usb_hcd_poll_rh_status(hcd);
        else {
                spin_lock(&uhci->lock);
                uhci_scan_schedule(uhci);
                spin_unlock(&uhci->lock);
        }
    <snip>

In uhci_scan_schedule(uhci)

uhci_scan_schedule->
  uhci_clear_next_interrupt->
    uhci->term_td->status &= ~cpu_to_hc32(uhci, TD_CTRL_IOC);

This panics becase term_td is not allocated yet.

Now I could be wrong about the interrupts and the uhci_start routine and
perhaps this is prevented somehow.  This is why I am asking what is the
expectation for the above scenario.

Below is an actual panic that was reported against 2.6.32 which I think
shows the scenario I described above (that could be wrong too).  Though I
did expect an interrupt exception frame in there to backup my theory...

usb usb5: Product: UHCI Host Controller
usb usb5: Manufacturer: Linux 2.6.32-220.7.1.el6.x86_64 uhci_hcd
usb usb5: SerialNumber: 0000:00:1d.3
usb usb5: configuration #1 chosen from 1 choice
hub 5-0:1.0: USB hub found
hub 5-0:1.0: 2 ports detected
uhci_hcd 0000:02:00.4: PCI INT B -> GSI 17 (level, low) -> IRQ 17
uhci_hcd 0000:02:00.4: UHCI Host Controller
uhci_hcd 0000:02:00.4: new USB bus registered, assigned bus number 6
uhci_hcd 0000:02:00.4: port count misdetected? forcing to 2 ports
BUG: unable to handle kernel NULL pointer dereference at 0000000000000004
IP: [<ffffffff813c1a56>] uhci_scan_schedule+0x46/0xb20
PGD 0
Oops: 0002 [#1] SMP
last sysfs file:
CPU 0
Modules linked in:

Pid: 56, comm: work_for_cpu Not tainted 2.6.32-220.7.1.el6.x86_64 #1 HP
ProLiant DL980 G7
RIP: 0010:[<ffffffff813c1a56>]  [<ffffffff813c1a]
uhci_scan_schedule+0x46/0xb20
RSP: 0018:ffff880016331ca0  EFLAGS: 00010002
RAX: 0000000000000000 RBX: ffff880019104598 RCX: 0000000000000000
RDX: ffff8800191045e8 RSI: ffff880019104400 RDI: ffff880019104598
RBP: ffff880016331d20 R08: ffff880018aaa170 R09: ffff880019104400
R10: ffffffff8139b5c0 R11: ffff8800190fda80 R12: ffff880019104598
R13: ffff880019104400 R14: ffff880019104620 R15: 0000000000000000
FS:  0000000000000000(0000) GS:ffff880003200000(0000)
knlGS:0000000000000000
CS:  0010 DS: 0018 ES: 0018 CR0: 000000008005003b
CR2: 0000000000000004 CR3: 0000000004a85000 CR4: 00000000000006f0
DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400
Process work_for_cpu (pid: 56, threadinfo ffff880016330000, task
ffff880016322b00)
Stack:
 0000000000000000 0000000000000000 0000000000000000 ffff8800191045e8
<0> 0000000000000000 0000000000000000 0000000000000000 0000000000000000
<0> 0000000000000000 0000000000000000 0000000000003731 ffff880019104620
Call Trace:
 [<ffffffff813c4abc>] uhci_irq+0x7c/0x180
 [<ffffffff810daa94>] ? disable_irq_nosync+0x64/0xa0
 [<ffffffff8139b5ff>] usb_hcd_irq+0x3f/0x90
 [<ffffffff810dacdc>] request_threaded_irq+0x1bc/0x2f0
 [<ffffffff8139b5c0>] ? usb_hcd_irq+0x0/0x90
 [<ffffffff8139d1be>] usb_add_hcd+0x3be/0x800
 [<ffffffff813adae8>] usb_hcd_pci_probe+0x158/0x3d0
 [<ffffffff8108b5f0>] ? do_work_for_cpu+0x0/0x30
 [<ffffffff81289b47>] local_pci_probe+0x17/0x20
 [<ffffffff8108b608>] do_work_for_cpu+0x18/0x30
 [<ffffffff81090726>] kthread+0x96/0xa0
 [<ffffffff8100c14a>] child_rip+0xa/0x20
 [<ffffffff81090690>] ? kthread+0x0/0xa0
 [<ffffffff8100c140>] ? child_rip+0x0/0x20
Code: 00 00 48 89 fb a8 01 0f 85 69 0a 00 00 48 8d 57 50 83 c8 01 88 87 c8
00 00 00 48 89 55 98 83 
e0 bd 88 83 c8 00 00 00 48 8b 43 20 <81> 60 04 ff ff ff fe 83 bb bc 00 00
00 00 0f 84 09 0a 00 00 
8b
RIP  [<ffffffff813c1a56>] uhci_scan_schedule+0x46/0xb20
 RSP <ffff880016331ca0>
CR2: 0000000000000004
---[ end trace b5ebaaef70b2501d ]---

If you aren't familar with kdump issues.  Then it is not uncommon for
interrupts to still be active on devices (though irqpoll is set to prevent
irq flooding on boot up).  This box doesn't have much for usb devices.
The only thing people think might be helping cause this problem is from
customers running '/usr/sbin/gpm -m /dev/input/mice -t exps2'.

On the other hand, we are not to sure how to duplicate the problem other
than 'echo c > /proc/sysrq-trigger' about 100 times and see if we get
lucky.

I was just wondering if you had a quick thought about this or not.

Thanks,
Don

--
To unsubscribe from this list: send the line "unsubscribe linux-usb" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at  http://vger.kernel.org/majordomo-info.html