RE: [PATCH 0/4 v5] xHCI: PM patchset

Alan Stern <stern@xxxxxxxxxxxxxxxxxxx> · Thu, 29 Jul 2010 11:11:22 -0400 (EDT)

On Thu, 29 Jul 2010, Xu, Andiry wrote:

> > > It looks like the problem is the special-case test in
> hub_port_init().
> > > It skips the reset if the device is new and is running at
> SuperSpeed.
> > > Clearly this is not the right criterion.
> > 
> > What I wanted to do was skip the port reset when a USB 3.0 device is
> > first plugged into the system.  When the device undergoes link
> training,
> > it has already gone through a warm port reset, so there's no need to
> > reset the device twice.  Now, that's just an optimization, and it
> > doesn't really matter to me if USB 3.0 devices get reset during an
> > initial connection.

We may need to rethink this somewhat.  For example, are you sure it's a
good idea to go through link training before hub_port_init is called?  
To me, it's starting to look as though that reset call should be where
the link is trained.

That's more or less how it works under USB 2.0.  Neither the kernel nor
the hardware knows whether a new device will run at high speed until
that reset completes.

> > I guess I don't understand what the issue is with the special case in
> > hub_port_init().  I thought the real issue was that the xHCI host
> > controller is has been reset and has lost its memory about any devices
> > that the USB core has allocated.
> > 
> 
> Hmm...I think so. xHCI driver allocates memory for each device to
> reflect its state and other information. If xHC is reset, these memories
> are still there, but their data is totally invalid. In the end these
> memories will be freed and re-allocated when devices are re-initialized.

That's the problem Andiry encountered, but it's only a symptom of a 
deeper problem.  The deeper problem is that the kernel and the hardware 
have differing notions about whether or not a device exists, and that 
difference will lead to trouble.

In hub_port_init the kernel believes there is a device attached to the 
specified port.  If the xHCI controller doesn't (e.g., if it has just 
been reset), then it needs to be told to find out at this point.  
However usbcore doesn't know what the controller hardware believes, so 
the test and corresponding action must be carried out within xhci-hcd.

> > USB 2.0 devices have a similar issue.  If the xHCI hardware doesn't
> know
> > about them (i.e. the usb_device structure is not allocated), then
> > xhci_reset_device() will return an error status.
> > 
> > The initialization sequence from the xHCI driver's perspective is like
> > this:  When usb_alloc_dev() is called and the device is under xHCI,
> then
> > xhci_alloc_dev() is called, and the device is issued a slot ID by the
> > xHC host.  When hub_set_address() is called, xhci_address_device() is
> > called.

Okay, so now it appears this is wrong.  xhci_alloc_dev should be called
by the reset code, if the device structure doesn't already exist.  
(There may be similar issues regarding when the slot ID is released.)  
If everything does already exist, then the device can be reset
normally.

> > xhci_address_device() sets up the device slot with a default control
> > endpoint ring and some information about device speed and parent hub
> is
> > stored in the device context.  Then it issues a command to the xHCI
> host
> > controller to set the address for the device and add the control
> > endpoint to the host controller's schedule.  Until this command
> > completes, a call to xhci_reset_device() will fail.  This
> initialization
> > happens for all speeds of devices.

What happens if there already is a default control endpoint ring, etc.?

> > So I think the issue that Andiry is running into is that the xHCI host
> > controller is reset, and it loses all the device context state.  (I'm
> > not sure if it's leaking memory at that point, Andiry may want to turn
> > on kmemleak to check.)  The USB core still thinks the device exists,
> > since the usb_device structure is still present, and so it tries to
> call
> > xhci_reset_device() after the failed resume.

Yes.

> I called xhci_free_virt_device() for all the devices after xHC is reset.
> The command ring and event ring are not freed but will be
> re-initialized. There should not be memory leak as I can see. If they
> are not freed after xHC reset, USB core will free them later after reset
> failure. The issue is USB core does not know xHC reset and it will do
> regular steps which are useless: resume the device, try resetting the
> device for several times, etc. 

But the reset should _not_ fail.  Instead of trying to actually reset
the port, it should cause xhci-hcd to allocate a new device slot and a
whole bunch of new data structures, or possibly to re-use the existing
structures.

Maybe the callback name should be changed to 
xhci_discover_or_reset_device.

> > > No.  I'm saying that xhci-hcd's address_device callback should
> actually
> > > change the device's address to the value assigned by the kernel.
> > >
> > > Hmm, the Wireless USB core also seems to have its own ideas about
> > > addressing devices.  It wouldn't be surprising if they end up
> causing
> > > trouble too.
> > 
> > I don't have a way to ask the xHCI host controller to issue a
> particular
> > address to a device.  The problem is that the hardware supports
> > per-device virtualization, and guests must not be allowed to pick an
> > address that conflicts with another guest.  So the hardware picks the
> > address during the initial call into the xHCI driver's
> > xhci_address_device() in hub_set_address().
> > 
> > What's worse is that the hardware checks the control endpoint for a
> > SetAddress command, and doesn't allow software to manually set the
> > address for the device.  See section 4.6.5:
> > 
> > "If the xHC detects a SET_ADDRESS request on the Default Control
> > Endpoint Transfer Ring, it shall generate a TRB Error Completion
> Status
> > for the TD. The xHC shall never forward a SET_ADDRESS request on a
> > Default Control Endpoint Transfer Ring to a USB device."

Hmm.  There isn't any Get-Address request, is there?  (It would be
pointless, since the host controller couldn't transmit the request
without already knowing the address.)  So the kernel will have no way
of knowing that the device's actual address isn't what the kernel
thinks it is.

The conclusion is that Andiry's suggestion was correct.  In addition to
invoking the address_device callback, hub_set_address should also call
update_address.  Or to put it another way, the call to update_address
in hub_set_address should be moved outside the "else" clause, down next
to the usb_set_device_state call.  And xhci_address_device shouldn't 
store the actual hardware address in the device structure; it should 
keep that information private.

> > But I think there's a deeper issue with the failed host controller
> > resume.  I think we're seeing this char file conflict messages because
> > there is a difference in what the USB core and the xHCI host think the
> > device state is.
> > 
> > I think what needs to happen in that case is to tell the USB core that
> > all devices under the host have disconnected, and then force them to
> be
> > re-enumerated.  Or even possibly re-initialize the host controller,
> > since the event ring pointers and such may be lost.  Andiry, what do
> you
> > think?
> > 
> 
> I guess so. Since device address is assigned by hardware, I'm not sure
> if there will be any issues if kernel uses a different address. If USB
> core can be notified about xHC reset, and free all the devices and then
> re-enumerate them one by one, I think there will not be any conflicts.
> The event ring and command ring are re-initialized after xHC reset in my
> patch.

No!  That is very definitely the wrong approach.  The core _has_ to 
believe that the devices still exist and that they can be restored to 
their previous states.  If you change that, Linus will get very mad at 
you.  :-)

Reinitializing the host controller is okay.  In fact, you should be 
doing that anyway if it tells you that it didn't retain its state 
during a suspend.

Alan Stern

--
To unsubscribe from this list: send the line "unsubscribe linux-usb" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at  http://vger.kernel.org/majordomo-info.html