Re: [PATCH] usb/core/hcd.c - avoid khubd hang on HC surprise removal

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 



On Thu, 3 Jun 2010, Robert Evans wrote:

> Stratus Technologies of Maynard Massachusetts (www.stratus.com)
> manufactures fault tolerant servers.  The current Stratus products use
> Nehalem CPUs and the servers are supported with Red Hat Enterprise
> Linux, VMware ESX, Windows and VOS.
> 
> I have no access to servers with OHCI.  In my testing, although both
> EHCI and UHCI are present in the ICH10, the problem always was
> associated with the UHCI controller.

This is to be expected.  ehci-hcd and uhci-hcd are written differently
in this respect.  With ehci-hcd, if an URB is unlinked after the
controller has stopped then the URB completes then and there, within
the unlink call.  With uhci-hcd, if an URB is unlinked after the
controller has stopped then the unlink call returns immediately and the
URB completes in a different context (a timer interrupt).  I did it
this way partly because it was simpler and partly in order to avoid a
change in behavior (URBs unlinked while the controller is running
_always_ complete in a different context).

However the call path followed by the handler for that timer passes
through usb_hcd_poll_rh_status(), where it fails if the root hub has 
been unregistered.  Robert's patch removes the troublesome test, 
allowing the call to succeed.  Without the patch there's no way to 
unlink uncompleted URBs after a controller failure -- they will never 
complete and they cause khubd to hang.

The test he removes is not critical.  I originally put it in mostly for
safety's sake; I didn't realize at the time that some HCDs would
continue to need a root-hub status poll even after the root hub was
gone (which is ironic, since the driver that ended up needing it is my
own!).

At any rate, the test was always unreliable -- it doesn't synchronize
with the usb_bus_list_lock and hence it is subject to races.  In the
end, I think it's better to have the HCDs ensure that root-hub polling
stops when it's no longer needed.  The core can't do this, because a
device interrupt could cause the HCD to request another poll at any
time.

It turns out there's a related problem.  Just to be sure the root-hub
polls really do stop, the core cancels the timer in usb_remove_hcd().  
It has to wait until after the IRQ is unregistered, since an interrupt
could restart the timer (as I mentioned above).  HCDs unregister their
IRQ lines as part of their stop() callback -- but they also deallocate
their data structures in the same callback.  The consequence is that
unless the HCD is careful to stop the root-hub-poll timer at the right
point within the callback, a poll request could arrive _after_ the data
structures have been freed.  Obviously this would be disastrous.  The
test in usb_hcd_poll_rh_status() would prevent this, but as I
mentioned, it's not fully reliable.

I'm going to submit a patch to make sure each HCD does stop the timer
at the right point, as well as a couple of others that address problems
Robert uncovered in uhci-hcd during his failover testing.  In the
meantime, you should keep his patch pending -- to avoid problems it
should not be applied before the ones I will send.

Apart from the need for this delay, the patch itself is fine.  You can
add my Acked-by: to it.

Alan Stern

--
To unsubscribe from this list: send the line "unsubscribe linux-usb" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[Index of Archives]     [Linux Media]     [Linux Input]     [Linux Audio Users]     [Yosemite News]     [Linux Kernel]     [Linux SCSI]     [Old Linux USB Devel Archive]

  Powered by Linux