Hi, I am writing about commit 1f81b6d22a5980955b01e08cf27fb745dc9b686f "usb: xhci: Prefer endpoint context dequeue pointer over stopped_trb". This commit has been introduced in kernel 3.15-rc3, and subsequently backported to older kernels. In kernel 3.2.59 (used by Ubuntu 12.04 LTS), and newer kernels in this branch, it causes a regression with varying symptoms, if some triggering conditions are met. The symptoms appear when the USB device is plugged into the USB 3.0 port, and may include the device not being recognized, or the device being recognized after a long (18 minute) delay, accompanied with various errors being logged. If the device is plugged in during boot, system may be unable to boot. The symptoms, or lack thereof, are highly depended on USB 3.0 hardware, and on kernel version. So far, all observed cases involved USB 3.0 controllers from Asmedia ASM1042 family. Even with this controller, some USB devices do not trigger the regression, but a large number do, from a SD card reader, to an external HDD, to a RAID eclosure. For details, please see bug reports [1], [2] and [3]. The regression is not reproducible in kernel 3.16.0-rc3, but the code affected by the commit remained essentially the same (as in 3.15-rc3 and 3.2.59), which gives a reason to believe that the bug is masked by another change and the regression is avoided by coincidence. I would like to focus here on results of my debugging of this issue. I believe that I have narrowed down the cause of the problem to a specific line, and I would like to ask people knowledgeable with xhci code if my assumptions and conclusions are correct. Commit: 1f81b6d22a5980955b01e08cf27fb745dc9b686f File: drivers/usb/host/xhci-ring.c Function: xhci_find_new_dequeue_state The regression appears to be caused by state->new_cycle_state having wrong value (0 rather than 1) at the end of execution of xhci_find_new_dequeue_state() function. Inside this function, the value is toggled between 0 and 1 under various conditions. It seems that regression is caused by flipping the value of new_cycle_state one time too many. The commit in question refactored the body of this function, and among other changes, replaced lines if (ep_ring->first_seg == ep_ring->first_seg->next && state->new_deq_ptr < dev->eps[ep_index].stopped_trb) state->new_cycle_state ^= 0x1; with lines if (ep_ring->first_seg == ep_ring->first_seg->next && cur_td->last_trb < state->new_deq_ptr) state->new_cycle_state ^= 0x1; (These lines were also moved around, so the replacement is not very apparent in commit's diff.) My tests suggest that the regression results when value of new_cycle_state is toggled in this piece of code. Specifically, I suspect that the problem lies in second part of the condition, involving state->new_deq_ptr. It appears to me that cur_td->last_trb in the new code is equivalent to dev->eps[ep_index].stopped_trb in the old code (I observed it to be either equal, or less by 16 than new_deq_ptr, in old and new versions). The difference between versions is that comparison has been reversed (effectively from '<' to '>'). This negates this part of the expression and causes a flip of new_cycle_state to occur where it did not previously. Changing the direction of the comparison operator or commenting these lines out fixes the regression. The question is: was this reversal intentional, or is it a bug? I also tried to answer the question of why the regression does not occur in many circumstances. 1. In kernel 3.16 regression does not occur, because ep_ring seg list is not empty (i.e. first part of condition is not met). 2. In kernel 3.2.58 regression does not occur, because it has an old version of xhci_find_new_dequeue_state() code, and second part of condition is not met. 3. In affected kernels (3.2.59 and 3.2.60) regression does not occur with some USB devices, because in such cases xhci_find_new_dequeue_state() is not called. 4. In affected kernels some devices are recognized after 18 minutes, because at that point cur_td->last_trb is incremented by 16, becomes equal to state->new_deq_ptr, and second part of the expression becomes false. I do not fully understand why the above behavior shows so many differences depending on kernel version and hardware. Help will be greatly appreciated. Thank you! Maciej Puzio Links: [1] https://bugs.launchpad.net/ubuntu/+source/linux/+bug/1328984 [2] https://bugs.launchpad.net/ubuntu/+source/linux/+bug/1330530 [3] https://bugs.launchpad.net/ubuntu/+source/linux/+bug/1333229 -- To unsubscribe from this list: send the line "unsubscribe linux-usb" in the body of a message to majordomo@xxxxxxxxxxxxxxx More majordomo info at http://vger.kernel.org/majordomo-info.html