Re: ehci-hcd.c causes: irq <number>: nobody cared

"Dr. Werner Fink" <werner@xxxxxxx> · Tue, 13 May 2014 17:29:23 +0200

On Tue, May 13, 2014 at 11:08:27AM -0400, Alan Stern wrote:
> Please CC: your patches to the maintainer of the driver you are 
> changing.
> 
> On Tue, 13 May 2014, Dr. Werner Fink wrote:
> 
> > Hi,
> > 
> > this bug hits my system now a long time.  I had found e.g. this
> > 
> >  speedy kernel: [ 9575.033019] irq 16: nobody cared (try booting with the "irqpoll" option)
> >  speedy kernel: [ 9575.033022] Pid: 0, comm: swapper/0 Not tainted 3.7.10-1.1-desktop #1
> 
> The 3.7 kernel is fairly old.  It's entirely possible that the problem
> has already been fixed in 3.14.

The patch I've attached is for 3.14 and AFAICS it is likely not fixed. In past
I had reported this problem more than once and got always the same answer that
the new kernel will not show this problem.

> >  speedy kernel: [ 9575.033023] Call Trace:
> >  speedy kernel: [ 9575.033031]  [<ffffffff81004818>] dump_trace+0x88/0x300
> >  speedy kernel: [ 9575.033035]  [<ffffffff8158b033>] dump_stack+0x69/0x6f
> >  speedy kernel: [ 9575.033038]  [<ffffffff810d6c56>] __report_bad_irq+0x36/0xe0
> >  speedy kernel: [ 9575.033041]  [<ffffffff810d7158>] note_interrupt+0x1e8/0x240
> >  speedy kernel: [ 9575.033045]  [<ffffffff810d4772>] handle_irq_event_percpu+0xc2/0x250
> >  speedy kernel: [ 9575.033047]  [<ffffffff810d4947>] handle_irq_event+0x47/0x70
> >  speedy kernel: [ 9575.033049]  [<ffffffff810d7c50>] handle_fasteoi_irq+0x60/0x100
> >  speedy kernel: [ 9575.033051]  [<ffffffff810046c8>] handle_irq+0x18/0x30
> >  speedy kernel: [ 9575.033053]  [<ffffffff810043a2>] do_IRQ+0x52/0xd0
> >  speedy kernel: [ 9575.033056]  [<ffffffff8159806d>] common_interrupt+0x6d/0x6d
> >  speedy kernel: [ 9575.033061]  [<ffffffff8132018c>] intel_idle+0xec/0x160
> >  speedy kernel: [ 9575.033064]  [<ffffffff81452e0d>] cpuidle_idle_call+0x9d/0x330
> >  speedy kernel: [ 9575.033067]  [<ffffffff8100be0a>] cpu_idle+0x6a/0xe0
> >  speedy kernel: [ 9575.033071]  [<ffffffff81ac8bc8>] start_kernel+0x3b8/0x3c3
> >  speedy kernel: [ 9575.033073]  [<ffffffff81ac8436>] x86_64_start_kernel+0x105/0x114
> >  speedy kernel: [ 9575.033075] handlers:
> >  speedy kernel: [ 9575.033077] [<ffffffff813f2220>] usb_hcd_irq
> >  speedy kernel: [ 9575.033080] [<ffffffffa0282940>] rtl8139_interrupt [8139too]
> >  speedy kernel: [ 9575.033080] Disabling IRQ #16
> > 
> > IRQ 16 is used by ehci_hcd:usb1 and eth1.
> 
> How do you know that the problem was caused by ehci-hcd rather than 
> 8139too?  Or by some other piece of hardware entirely?

I've seen this also with an other ethernet card.  And the status bit is
always a bit described in the USB.

> >  Adding the "irqpoll" option to the kernels
> > command line had not helped.  Therefore I had debugged this problem by adding a printk()
> > debug line in the ehci_irq() function of drivers/usb/host/ehci-hcd.c.  This had shown
> > out that my USB controller causes STS_RECL (reclamation readonly status bit) in the
> > IRQ status.
> 
> What makes you think that STS_RECL is the cause of the problem?  It is 
> quite normal for STS_RECL to be set.

As described:  the printk() does show exactly this bit.

> > After a while this had lead to the message in the subject with the side effect that
> > networking becomes slow.
> 
> How do you know that something else didn't cause the "nobody cared" 
> error?

Yes.

> > From the debugging code I've evolved the attached patch.  It is not perfect as it
> > returns IRQ_NONE for the first time the STS_RECL status bit is found but it does
> > its job.
> 
> Please put your patches in the main email message; don't attach them.  
> Now there's no easy way for me to include it in this reply.
> 
> The patch is definitely wrong.  It will never set spurious_recl, 
> because the "if (unlikely(masked_status & STS_RECL))" test can't 
> succeed unless spurious_recl has already been set.

OK ... the patch was changed as I had been told that I should do it this
way.  In my original code I simply use

	masked_status = status & (INTR_MASK | STS_FLR | STS_RECL);

	/* Shared IRQ? */
	if (!masked_status || unlikely(ehci->rh_state == EHCI_RH_HALTED)) {
		spin_unlock_irqrestore(&ehci->lock, flags);
		printk("ehci_irq status: %#8.8x", status);
		return IRQ_NONE;
	}

and with this I can use my ethernet card more than 15 minutes.  The printk()
line I used first after I had also used some printk() lines in the ethernet
driver to see what was wrong with the shared IRQ.  Then I had identified the
STS_RECL from the printk() above in my logs and or'd the STS_RECL to the
masked status bits.  After this all problems had been gone.

> 
> Alan Stern

Werner

-- 
  "Having a smoking section in a restaurant is like having
          a peeing section in a swimming pool." -- Edward Burr
Attachment:
pgpMqHxWx_Bu3.pgp

Description: PGP signature