Re: Possible bug in xhci-ring.c

Julius Werner <jwerner@xxxxxxxxxxxx> · Tue, 1 Jul 2014 14:22:28 -0700

Thanks for the report, this sounds very troubling. The piece of code
you pointed out only affects single-segment transfer rings. I think
the kernel generally switched to using multi-segment transfer rings
for everything in commit 2fdcd47b69 "xHCI: Allocate 2 segments for
transfer ring", which explains why the problem doesn't affect kernels
after 3.2 (and it also means that I didn't test this case directly).
Nevertheless, we should of course make that case working.

I still think the logic in the commit is correct, it's just
unfortunately quite confusing (the code flow wasn't very readable to
begin with and I didn't do much to improve it). If you take the
surrounding code into account, the value of state->new_deq_ptr changes
quite a bit over time, and points to a different position in the two
cases: in the old code, it refers to the actual, final "new dequeue
pointer" (the TRB that the ring will be set to start at with the Set
TR Dequeue Pointer command), which should be one after cur->last_trb.
So we check if that TRB is in front of the stopped_trb (although it
should chronologically come after it), and conclude that we wrapped
around.

In my patch, state->new_deq_pointer actually points at the virtual
address of hw_dequeue (which is equivalent to the removed stopped_trb)
at that time. We haven't jumped it forward to its final position yet,
but we can still check if cur->last_trb (which should come
chronologically after it) is in front of it on the ring, and conclude
that we will wrap around once we jump forward. (For reference, the
"chronological" order of things should be [ep_ring->dequeue], which is
zero or more TRBs before [hw_dequeue], which is the same as
[stopped_trb], which is zero or more TRBs before cur->last_trb, which
is exactly one TRB before [state->new_deq_ptr at the end of the
function].)

If you have a setup that easily reproduces this bug, could you please
just gather a big load of debug output so we can get a better idea of
what it's doing? You'll need to add the CONFIG_USB_XHCI_HCD_DEBUGGING
Kconfig and then make all the dev_dbg() in that file work (easiest way
is to just '#define DEBUG' in xhci.h). Also add an explicit call of
xhci_debug_ring(xhci, ep_ring) to the top of the function and a few
more prints to see the initial values of hw_dequeue,
state->new_deq_seg, state->new_deq_ptr and state->new_cycle_state, and
how they change over the course of the function. That will hopefully
help us figure out where the logic is messed up, because I can't
really spot a mistake in there yet.

On Tue, Jul 1, 2014 at 11:34 AM, Maciej Puzio <mx34567@xxxxxxxxx> wrote:
> Hi, I am writing about commit 1f81b6d22a5980955b01e08cf27fb745dc9b686f
> "usb: xhci: Prefer endpoint context dequeue pointer over stopped_trb".
> This commit has been introduced in kernel 3.15-rc3, and subsequently
> backported to older kernels. In kernel 3.2.59 (used by Ubuntu 12.04 LTS),
> and newer kernels in this branch, it causes a regression with varying
> symptoms, if some triggering conditions are met. The symptoms appear
> when the USB device is plugged into the USB 3.0 port, and may include
> the device not being recognized, or the device being recognized after a
> long (18 minute) delay, accompanied with various errors being logged.
> If the device is plugged in during boot, system may be unable to
> boot. The symptoms, or lack thereof, are highly depended on USB 3.0
> hardware, and on kernel version. So far, all observed cases involved
> USB 3.0 controllers from Asmedia ASM1042 family. Even with this
> controller, some USB devices do not trigger the regression, but a
> large number do, from a SD card reader, to an external HDD, to a RAID
> eclosure. For details, please see bug reports [1], [2] and [3]. The
> regression is not reproducible in kernel 3.16.0-rc3, but the code
> affected by the commit remained essentially the same (as in 3.15-rc3
> and 3.2.59), which gives a reason to believe that the bug is masked
> by another change and the regression is avoided by coincidence.
>
> I would like to focus here on results of my debugging of this issue.
> I believe that I have narrowed down the cause of the problem to a
> specific line, and I would like to ask people knowledgeable with xhci
> code if my assumptions and conclusions are correct.
>
> Commit: 1f81b6d22a5980955b01e08cf27fb745dc9b686f
> File: drivers/usb/host/xhci-ring.c
> Function: xhci_find_new_dequeue_state
>
> The regression appears to be caused by state->new_cycle_state having
> wrong value (0 rather than 1) at the end of execution of
> xhci_find_new_dequeue_state() function. Inside this function, the
> value is toggled between 0 and 1 under various conditions. It seems
> that regression is caused by flipping the value of new_cycle_state
> one time too many. The commit in question refactored the body of this
> function, and among other changes, replaced lines
>
>   if (ep_ring->first_seg == ep_ring->first_seg->next &&
>         state->new_deq_ptr < dev->eps[ep_index].stopped_trb)
>     state->new_cycle_state ^= 0x1;
>
> with lines
>
>   if (ep_ring->first_seg == ep_ring->first_seg->next &&
>          cur_td->last_trb < state->new_deq_ptr)
>     state->new_cycle_state ^= 0x1;
>
> (These lines were also moved around, so the replacement is not very
> apparent in commit's diff.)
>
> My tests suggest that the regression results when value of
> new_cycle_state is toggled in this piece of code. Specifically, I
> suspect that the problem lies in second part of the condition,
> involving state->new_deq_ptr. It appears to me that cur_td->last_trb
> in the new code is equivalent to dev->eps[ep_index].stopped_trb in
> the old code (I observed it to be either equal, or less by 16 than
> new_deq_ptr, in old and new versions). The difference between versions
> is that comparison has been reversed (effectively from '<' to '>').
> This negates this part of the expression and causes a flip of
> new_cycle_state to occur where it did not previously. Changing the
> direction of the comparison operator or commenting these lines out fixes
> the regression.
>
> The question is: was this reversal intentional, or is it a bug?
>
> I also tried to answer the question of why the regression does not
> occur in many circumstances.
>
> 1. In kernel 3.16 regression does not occur, because ep_ring seg list
> is not empty (i.e. first part of condition is not met).
> 2. In kernel 3.2.58 regression does not occur, because it has an old
> version of xhci_find_new_dequeue_state() code, and second part of
> condition is not met.
> 3. In affected kernels (3.2.59 and 3.2.60) regression does not occur
> with some USB devices, because in such cases xhci_find_new_dequeue_state()
> is not called.
> 4. In affected kernels some devices are recognized after 18 minutes,
> because at that point cur_td->last_trb is incremented by 16, becomes
> equal to state->new_deq_ptr, and second part of the expression
> becomes false.
>
> I do not fully understand why the above behavior shows so many differences
> depending on kernel version and hardware.
>
> Help will be greatly appreciated. Thank you!
>
> Maciej Puzio
>
> Links:
> [1] https://bugs.launchpad.net/ubuntu/+source/linux/+bug/1328984
> [2] https://bugs.launchpad.net/ubuntu/+source/linux/+bug/1330530
> [3] https://bugs.launchpad.net/ubuntu/+source/linux/+bug/1333229
--
To unsubscribe from this list: send the line "unsubscribe linux-usb" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at  http://vger.kernel.org/majordomo-info.html