Re: Testing for hardware bug in EHCI controllers

Alan Stern <stern@xxxxxxxxxxxxxxxxxxx> · Mon, 18 Mar 2013 10:59:03 -0400 (EDT)

On Mon, 18 Mar 2013, Noone Nowhere wrote:

> > I can't fix the block count issue.  That's not a bug in the program;
> > it's a bug somewhere else.
> 
> Who can fix it? Should we test another stick?

That depends on where the bug is.  If it is in the stick then testing 
a different brand of flash drive would help.

> >A workaround for the bug has already been submitted to the various
> >stable kernel series.  The workaround may decrease performance slightly
> >for things like large data transfers to/from a mass-storage device.  I
> >haven't tried to measure the effect, and it will be different on
> >different platforms anyway.
> 
> Will UAS(when operational) on 2.0 be affected?

It could well be.

> > Evidently hardware from a large number of vendors is affected by this
> > bug.  You have to wonder if they all got the intellectual property
> > from a common source.
> 
> Or better who copies/steals from whom?

:-)

> > It will apply to a 3.8 kernel.  After applying that patch, you should
> > find that the EHCI bug doesn't show up.
> 
> Will retest all boards with a kernel having the patch.
> 
> Testing only the A50M, in dmesg we get:
> 
> [  139.795681] ehci-pci 0000:00:12.2: IAA with IAAD still set? , 150 times
> [  141.814383] ehci-pci 0000:00:12.2: shutdown urb ffff88004cdf5d80
> ep1in-bulk , 600 times

The "IAA with IAAD" messages are probably the result of a harmless
oversight in the workaround patch.  I believe that the patch below,
applied on top of the workaround, will eliminate them.  Alternatively,
you can simply run the tests with a kernel that has CONFIG_USB_DEBUG
disabled.

> what is alarming is that the test program still fails and prints:
> 
> 100
> 200
> 300
> URB timed out; bug may be present
> Wrong URB completed
> 
> , with the patch applied the program shouldn't have exited on its own, right?

"Should" is a slippery concept.  The hardware bug we're concerned about 
"should" have been fixed in Intel's original hardware design.

The test program does a bunch of error checking but practically no
error recovery.  If almost any tiny little thing goes wrong, the
program exits.  You might think that nothing else "should" go wrong --
but the fact is that errors do happen.

If you would like to spend time and energy to figure out the cause of 
this particular error, go right ahead.  I'll help as much as I can.

> Finally the patch seems somehow wrong because it makes generic
> assumptions. We shouldn't slowdown the USB controller of a ppc based
> PS3 or XBOX360 for a PCI(add in cards) and a x86(chipsets) specific
> issue unless otherwise proven.

That is a very strong statement, and I do not totally agree with it.

For one thing, we do not know that this hardware issue is specific to
x86 chipsets or PCI add-on cards.  In fact, until people began sending
in the results of their testing, I had no way to know which EHCI
controllers were affected.  The only safe course was to assume that
they all are.

>  It is not right to waste an ARM tablet
> with an AHB USB controller. Are you sure it will not brake anything
> else? Even in x86 you slowdown half of our machines because the rest
> are broken. If we were asked, we would blacklist controllers based on
> pairs of PCI device ID and revision.

That's what I would like to do.  But nobody knows what 
vendor/device/revision values should be in the blacklist.

>  That is mandatory because
> blacklisting only vendor will work for intel but not for VIA as the
> same PCI ID is used on many different chipsets and rev has to be
> additionally used. When a known pair is met a debug only message can
> be printed and when an unknown pair is met a normal print asking user
> to send the results to this list.

No.  Because of other changes to the driver, failure to apply the
workaround when it is needed may lead to I/O errors or other problems.  
We don't want that to happen.

Alan Stern



Index: 3.8/drivers/usb/host/ehci-hcd.c
===================================================================

--- 3.8.orig/drivers/usb/host/ehci-hcd.c
+++ 3.8/drivers/usb/host/ehci-hcd.c
@@ -748,7 +748,7 @@ static irqreturn_t ehci_irq (struct usb_
 		/* guard against (alleged) silicon errata */
 		if (cmd & CMD_IAAD)
 			ehci_dbg(ehci, "IAA with IAAD still set?\n");
-		if (ehci->async_iaa)
+		if (ehci->iaa_in_progress)
 			COUNT(ehci->stats.iaa);
 		end_unlink_async(ehci);
 	}
Index: 3.8/drivers/usb/host/ehci-q.c
===================================================================
--- 3.8.orig/drivers/usb/host/ehci-q.c
+++ 3.8/drivers/usb/host/ehci-q.c
@@ -1194,8 +1194,9 @@ static void start_iaa_cycle(struct ehci_
 	 * Do nothing if an IAA cycle is already running or
 	 * if one will be started shortly.
 	 */
-	if (ehci->async_iaa || ehci->async_unlinking)
+	if (ehci->iaa_in_progress || ehci->async_unlinking)
 		return;
+	ehci->iaa_in_progress = true;
 
 	/* If the controller isn't running, we don't have to wait for it */
 	if (unlikely(ehci->rh_state < EHCI_RH_RUNNING)) {
@@ -1247,6 +1248,8 @@ static void end_unlink_async(struct ehci
 		ehci_writel(ehci, (u32) ehci->async->qh_dma,
 			    &ehci->regs->async_next);
 
+	ehci->iaa_in_progress = false;
+
 	/* Process the idle QHs */
  restart:
 	ehci->async_unlinking = true;
Index: 3.8/drivers/usb/host/ehci-timer.c
===================================================================
--- 3.8.orig/drivers/usb/host/ehci-timer.c
+++ 3.8/drivers/usb/host/ehci-timer.c
@@ -305,7 +305,7 @@ static void ehci_iaa_watchdog(struct ehc
 	 * (a) SMP races against real IAA firing and retriggering, and
 	 * (b) clean HC shutdown, when IAA watchdog was pending.
 	 */
-	if (ehci->async_iaa) {
+	if (ehci->iaa_in_progress) {
 		u32 cmd, status;
 
 		/* If we get here, IAA is *REALLY* late.  It's barely
Index: 3.8/drivers/usb/host/ehci.h
===================================================================
--- 3.8.orig/drivers/usb/host/ehci.h
+++ 3.8/drivers/usb/host/ehci.h
@@ -121,6 +121,7 @@ struct ehci_hcd {			/* one per controlle
 	bool			scanning:1;
 	bool			need_rescan:1;
 	bool			intr_unlinking:1;
+	bool			iaa_in_progress:1;
 	bool			async_unlinking:1;
 	bool			shutdown:1;
 	struct ehci_qh		*qh_scan_next;

--
To unsubscribe from this list: send the line "unsubscribe linux-usb" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at  http://vger.kernel.org/majordomo-info.html