Re: Possible bug in xhci-ring.c

Julius Werner <jwerner@xxxxxxxxxxxx> · Mon, 7 Jul 2014 18:37:40 -0700

Thanks Maciej, that was exactly what I needed! Looks like the problem
in your situation is that last_trb is actually (chronologically)
before hw_dequeue, which the code doesn't expect. You enqueue a
single-TRB-TD at 2f45e080 (so last_trb points there as well), which
encounters a Stall Error. I would guess that the Event TRB for that
error still returns 2f45e080 in its TRB Pointer (which the old code
stored in the stopped_trb variable that I removed), but the hw_dequeue
value we read from the Endpoint Context is 2f45e090.

I don't see this behavior on a DesignWare and an Intel xHC that I have
here, so I guess your Asmedia host controller is simply quirky/broken.
I think it should really store the failed TRB in the TR Dequeue
Pointer of the Endpoint Context when encountering an error, and not
just advance to the next TRB on its own... but I have a hard time
finding a decisive rule for that in the XHCI spec right now (it makes
this clear for the Stop Endpoint Command, but not really for errors).
Sarah, Mathias, do you know if this is specified somewhere or is it
really left up to the implementation?

Nevertheless, we should try to accommodate for this somehow. I think
the only way to catch it is to already look for last_trb on first
search for hw_dequeue, and cycle the bit once more if we encounter it
first (to counteract the incorrect cycling later on). Unfortunately
that will make this whole thing yet more complicated, but it should
work. I'll try to submit a patch tomorrow, but you will need to test
it on that host controller (for both 3.2 and later).

> Some additional thoughts:
>
>> The piece of code you pointed out only affects single-segment
>> transfer rings. I think the kernel generally switched to using
>> multi-segment transfer rings for everything in commit 2fdcd47b69
>> "xHCI: Allocate 2 segments for transfer ring", which explains why
>> the problem doesn't affect kernels after 3.2
>
> Does this mean that in kernels after 3.2 the problematic code line
> (that toggles new_cycle_state) is never executed, i.e. dead code?

Well... yes, it's never executed, but I wouldn't call it dead code.
It's just coincidence that the kernel currently doesn't use
single-segment rings, there's no intrinsic reason for it and it might
change in the future. I think it's still a good idea to keep the case
working.

Also, really the only reason later kernels work for that controller is
because we don't handle the case where a TD wraps all the way around a
whole multi-segment ring back to the same segment. We should do the
extra cycle in that case as well (and a normal transfer would look
like that on your controller), so I think we should fix both of these
bug and make sure the cycle state is correct in every conceivable
setup.
--
To unsubscribe from this list: send the line "unsubscribe linux-usb" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at  http://vger.kernel.org/majordomo-info.html