Possible bug in xhci-ring.c

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 



Hi, I am writing about commit 1f81b6d22a5980955b01e08cf27fb745dc9b686f
"usb: xhci: Prefer endpoint context dequeue pointer over stopped_trb".
This commit has been introduced in kernel 3.15-rc3, and subsequently
backported to older kernels. In kernel 3.2.59 (used by Ubuntu 12.04 LTS),
and newer kernels in this branch, it causes a regression with varying
symptoms, if some triggering conditions are met. The symptoms appear
when the USB device is plugged into the USB 3.0 port, and may include
the device not being recognized, or the device being recognized after a
long (18 minute) delay, accompanied with various errors being logged.
If the device is plugged in during boot, system may be unable to
boot. The symptoms, or lack thereof, are highly depended on USB 3.0
hardware, and on kernel version. So far, all observed cases involved
USB 3.0 controllers from Asmedia ASM1042 family. Even with this
controller, some USB devices do not trigger the regression, but a
large number do, from a SD card reader, to an external HDD, to a RAID
eclosure. For details, please see bug reports [1], [2] and [3]. The
regression is not reproducible in kernel 3.16.0-rc3, but the code
affected by the commit remained essentially the same (as in 3.15-rc3
and 3.2.59), which gives a reason to believe that the bug is masked
by another change and the regression is avoided by coincidence.

I would like to focus here on results of my debugging of this issue.
I believe that I have narrowed down the cause of the problem to a
specific line, and I would like to ask people knowledgeable with xhci
code if my assumptions and conclusions are correct.

Commit: 1f81b6d22a5980955b01e08cf27fb745dc9b686f
File: drivers/usb/host/xhci-ring.c
Function: xhci_find_new_dequeue_state

The regression appears to be caused by state->new_cycle_state having
wrong value (0 rather than 1) at the end of execution of
xhci_find_new_dequeue_state() function. Inside this function, the
value is toggled between 0 and 1 under various conditions. It seems
that regression is caused by flipping the value of new_cycle_state
one time too many. The commit in question refactored the body of this
function, and among other changes, replaced lines

  if (ep_ring->first_seg == ep_ring->first_seg->next &&
        state->new_deq_ptr < dev->eps[ep_index].stopped_trb)
    state->new_cycle_state ^= 0x1;

with lines

  if (ep_ring->first_seg == ep_ring->first_seg->next &&
         cur_td->last_trb < state->new_deq_ptr)
    state->new_cycle_state ^= 0x1;

(These lines were also moved around, so the replacement is not very
apparent in commit's diff.)

My tests suggest that the regression results when value of
new_cycle_state is toggled in this piece of code. Specifically, I
suspect that the problem lies in second part of the condition,
involving state->new_deq_ptr. It appears to me that cur_td->last_trb
in the new code is equivalent to dev->eps[ep_index].stopped_trb in
the old code (I observed it to be either equal, or less by 16 than
new_deq_ptr, in old and new versions). The difference between versions
is that comparison has been reversed (effectively from '<' to '>').
This negates this part of the expression and causes a flip of
new_cycle_state to occur where it did not previously. Changing the
direction of the comparison operator or commenting these lines out fixes
the regression.

The question is: was this reversal intentional, or is it a bug?

I also tried to answer the question of why the regression does not
occur in many circumstances.

1. In kernel 3.16 regression does not occur, because ep_ring seg list
is not empty (i.e. first part of condition is not met).
2. In kernel 3.2.58 regression does not occur, because it has an old
version of xhci_find_new_dequeue_state() code, and second part of
condition is not met.
3. In affected kernels (3.2.59 and 3.2.60) regression does not occur
with some USB devices, because in such cases xhci_find_new_dequeue_state()
is not called.
4. In affected kernels some devices are recognized after 18 minutes,
because at that point cur_td->last_trb is incremented by 16, becomes
equal to state->new_deq_ptr, and second part of the expression
becomes false.

I do not fully understand why the above behavior shows so many differences
depending on kernel version and hardware.

Help will be greatly appreciated. Thank you!

Maciej Puzio

Links:
[1] https://bugs.launchpad.net/ubuntu/+source/linux/+bug/1328984
[2] https://bugs.launchpad.net/ubuntu/+source/linux/+bug/1330530
[3] https://bugs.launchpad.net/ubuntu/+source/linux/+bug/1333229
--
To unsubscribe from this list: send the line "unsubscribe linux-usb" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at  http://vger.kernel.org/majordomo-info.html




[Index of Archives]     [Linux Media]     [Linux Input]     [Linux Audio Users]     [Yosemite News]     [Linux Kernel]     [Linux SCSI]     [Old Linux USB Devel Archive]

  Powered by Linux