Hi alan, On Sat, Dec 08, 2012 at 05:33:27PM -0500, Alan Stern wrote: > On Sat, 8 Dec 2012, Piergiorgio Sartor wrote: > > > Hi Alan, > > > > the output of lspci: > > > > 00:0b.1 USB Controller: nVidia Corporation MCP51 USB Controller (rev a3) (prog-if 20 [EHCI]) > > Subsystem: ASUSTeK Computer Inc. Device 81c0 > > Flags: bus master, 66MHz, fast devsel, latency 0, IRQ 22 > > Memory at fe02e000 (32-bit, non-prefetchable) [size=256] > > Capabilities: [44] Debug port: BAR=1 offset=0098 > > Capabilities: [80] Power Management version 2 > > Kernel driver in use: ehci_hcd > > nVidia, eh? Their stuff does have some weird bugs... I already heard that... :-) > > The output of "dmesg" with your patch: > > > > [ 773.387144] ehci_hcd 0000:00:0b.1: Waited too long for the async schedule status (0/8000), giving up > > Believe it or not, this one line explains everything -- and the rest of > the debugging log confirms it. This didn't show up in your earlier > kernel log because this message wasn't added until the 3.6 release, and > your earlier test used 3.5. I was using exactly the point bisect found as problematic, I was afraid to change things. Anyway, good that I kept it in the log. > Basically, this is a bug in nVidia's EHCI controller hardware. The > driver told the controller to turn off its async schedule, and 20 ms > later the schedule still was running. Although the EHCI specification > doesn't put any time limit on how long turning off the schedule may > take, in practice it shouldn't be any more than a couple of ms (and > normally much less). Question is, why it was working before kernel 3.6.0? Or, at least, why it did not show up. We know the patch triggering it (it seems), what was done before instead of that? > But then the driver needed to turn the schedule back on. The command > to do so was ignored by the controller, since it was still trying to > carry out the earlier command to turn off the schedule. Eventually -- > no way to know exactly when -- the schedule _did_ turn off. And then > it never turned back on! Is this the reason why the "ehci", after the problem is triggered, it does not work anymore? One more question, why the heavy traffic triggers it? Could it be the controller is too busy and it does not answer (within 20ms) to the request? > That's why none of the transfers from that point on were successful. > > > [ 803.804848] ehci_hcd 0000:00:0b.1: alan start cur time 4295471040 last scan 4295440395 > > [ 803.804862] ehci_hcd 0000:00:0b.1: command 10035 cmd reg 10035 io watchdog 1 async count 10 > > The "Async Schedule Enable" bit is bit 0x20 in the command register; > the debugging output shows that it is turned on. Nevertheless the > async schedule isn't running, as shown by the rest of the debugging > output. > > > One comment, I was using, apparently, kernel 3.7.0-rc4+ > > It's a good thing you did! > > > and it was quite difficult to reproduce the error. > > It happened only after I increased the CPU (and I/O) load > > doing a "make clean" in /usr/src/linux > > 3.7 waits longer for the schedule to turn off before giving up than 3.5 > does; that probably explains why the bug was harder to trigger. > > The question now is what to do about this. I suppose the waiting time > could be increased -- but how much? More of that, is it "sane" to just increase a timeout in order to workaround the issue? Again, I think the change in the code that caused this to show up must be somehow reconsidered, or not? Would it be better, after the timeout, to re-try to turn on the async schedule for a couple of times? With some wait inbetween, of course. > You can experiment with different values easily enough. In the source > file drivers/usb/host/ehci-timer.c, in the ehci_poll_ASS() function > near line 117, you'll see this: > > /* Poll again later, but give up after about 20 ms */ > if (ehci->ASS_poll_count++ < 20) { > ehci_enable_event(ehci, EHCI_HRTIMER_POLL_ASS, true); > return; > } > ehci_dbg(ehci, "Waited too long for the async schedule status (%x/%x), giving up\n", > want, actual); > > Change the 20's to something bigger, like 50 or 100, and see if that > makes the problem go away. (For thoroughness, also try changing them > to something smaller, like 5, and see if that makes the problem occur > more readily.) I'll give it a try. Thanks for the analysis and the explanation, I hope to have this fixed soon. bye, -- piergiorgio -- To unsubscribe from this list: send the line "unsubscribe linux-usb" in the body of a message to majordomo@xxxxxxxxxxxxxxx More majordomo info at http://vger.kernel.org/majordomo-info.html