On Thu, 3 Jun 2010, Robert Evans wrote: > Stratus Technologies of Maynard Massachusetts (www.stratus.com) > manufactures fault tolerant servers. The current Stratus products use > Nehalem CPUs and the servers are supported with Red Hat Enterprise > Linux, VMware ESX, Windows and VOS. > > I have no access to servers with OHCI. In my testing, although both > EHCI and UHCI are present in the ICH10, the problem always was > associated with the UHCI controller. This is to be expected. ehci-hcd and uhci-hcd are written differently in this respect. With ehci-hcd, if an URB is unlinked after the controller has stopped then the URB completes then and there, within the unlink call. With uhci-hcd, if an URB is unlinked after the controller has stopped then the unlink call returns immediately and the URB completes in a different context (a timer interrupt). I did it this way partly because it was simpler and partly in order to avoid a change in behavior (URBs unlinked while the controller is running _always_ complete in a different context). However the call path followed by the handler for that timer passes through usb_hcd_poll_rh_status(), where it fails if the root hub has been unregistered. Robert's patch removes the troublesome test, allowing the call to succeed. Without the patch there's no way to unlink uncompleted URBs after a controller failure -- they will never complete and they cause khubd to hang. The test he removes is not critical. I originally put it in mostly for safety's sake; I didn't realize at the time that some HCDs would continue to need a root-hub status poll even after the root hub was gone (which is ironic, since the driver that ended up needing it is my own!). At any rate, the test was always unreliable -- it doesn't synchronize with the usb_bus_list_lock and hence it is subject to races. In the end, I think it's better to have the HCDs ensure that root-hub polling stops when it's no longer needed. The core can't do this, because a device interrupt could cause the HCD to request another poll at any time. It turns out there's a related problem. Just to be sure the root-hub polls really do stop, the core cancels the timer in usb_remove_hcd(). It has to wait until after the IRQ is unregistered, since an interrupt could restart the timer (as I mentioned above). HCDs unregister their IRQ lines as part of their stop() callback -- but they also deallocate their data structures in the same callback. The consequence is that unless the HCD is careful to stop the root-hub-poll timer at the right point within the callback, a poll request could arrive _after_ the data structures have been freed. Obviously this would be disastrous. The test in usb_hcd_poll_rh_status() would prevent this, but as I mentioned, it's not fully reliable. I'm going to submit a patch to make sure each HCD does stop the timer at the right point, as well as a couple of others that address problems Robert uncovered in uhci-hcd during his failover testing. In the meantime, you should keep his patch pending -- to avoid problems it should not be applied before the ones I will send. Apart from the need for this delay, the patch itself is fine. You can add my Acked-by: to it. Alan Stern -- To unsubscribe from this list: send the line "unsubscribe linux-usb" in the body of a message to majordomo@xxxxxxxxxxxxxxx More majordomo info at http://vger.kernel.org/majordomo-info.html