On Mon, Jan 11, 2016 at 02:14:58PM +0200, Mathias Nyman wrote: > On 08.01.2016 17:19, Ron wrote: > >On Fri, Jan 08, 2016 at 04:10:09PM +0200, Mathias Nyman wrote: > >>On 07.01.2016 20:47, Ron wrote: > >>>On Fri, Jan 08, 2016 at 03:09:20AM +1030, Ron wrote: > >>>>On Fri, Jan 08, 2016 at 02:52:28AM +1030, Ron wrote: > >>>>>On Thu, Jan 07, 2016 at 05:38:09PM +0200, Mathias Nyman wrote: > >>>>>>Hi > >>>>>> > >>>>>>On 02.01.2016 08:32, Ron wrote: > >>>>>>> > >>>>>>>Hi, > >>>>>>> > >>>>>>>It appears the commit e210c422b6fdd2dc123bedc588f399aefd8bf9de > >>>>>>>"xhci: don't finish a TD if we get a short transfer event mid TD" > >>>>>>>is causing transfers larger than 16kB to be unreliable. > >>>>>>> > >>>>>>>If I limit transfers to be no larger than 16kB, then it also works as > >>>>>>>expected in an XHCI port with an unmodified build of Linus' current > >>>>>>>head (v4.4-rc7-76-g9c982e8), but transfers larger than that do not. > >>>>>>>I see an alternating cycle of a successful transfer, followed by two > >>>>>>>that will time out waiting in libusb (with a 5 second timeout set), > >>>>>>>before getting another successful transfer and the cycle repeating. > >>>>>>> > >>>>>>>I can run more tests and dig into this deeper if the reason for it > >>>>>>>isn't immediately obvious in hindsight. > >>>>>>> > >>>>>> > >>>>>>Thanks for the info, > >>>>>>I can't spot anything obvious, but my brain might still be in vacation mode. > >>>>>> > >>>>>>Could you reproduce it with the attached patch, it only adds extra debugging? > >>>>>> > >>>>>>We should either see no output, or the following sequence: > >>>>>> > >>>>>> 1. "mid bulk/intr SP, wait for last TRB event" > >>>>>> 2. "last trb has length set" > >>>>>> 3. "and last trb is SHORT_TX, OK" > >>>>> > >>>>> > >>>>>I guess one out of 3 ain't good ... all I see logged is: > >>>>> > >>>>> [ 60.015708] xhci_hcd 0000:04:00.0: mid bulk/intr SP, wait for last TRB event > >>>>> [ 65.017374] xhci_hcd 0000:04:00.0: mid bulk/intr SP, wait for last TRB event > >>>>> [ 70.455451] xhci_hcd 0000:04:00.0: mid bulk/intr SP, wait for last TRB event > >>>>> [ 75.456248] xhci_hcd 0000:04:00.0: mid bulk/intr SP, wait for last TRB event > >>>>> > >>>>>I'm passing 5 seconds to libusb as the requested timeout. > >>>> > >>>>And if I limit the maximum transfer size to 16kB (the above was with > >>>>64kB transfers), then I see nothing logged at all. > >>>> > >>>>So if that code does indeed look sane, perhaps the issue is really > >>>>in the code that's splitting large transfers doing something funny? > >>> > >>>So, tracing this a bit deeper, what I'm seeing is that in the case of > >>>a transfer of 16kB or less, there's only one TRB in the TD, so it's > >>>always the last_trb and the wait code never triggers (ie. basically > >>>the same code path as with this patch reverted). > >>> > >>>In the case of a larger transfer, there are N / 16kB TRBs in the TD, > >>>and in the case where it stalls, I get a short transfer on the first > >>>TRB, and then it just hangs waiting until it times out, reporting: > >>> > >>> xhci_hcd ep 0x81 - asked for 65536 bytes, 16382 bytes untransferred > >>> > >>>with event_trb == td->first_trb. > >>> > >>> > >>>What's still not clear to me yet is which assumption is broken. > >>>Clearly we aren't actually getting more TRB events after the short > >>>transfer. > >>> > >>>And the other thing which possibly makes this odd is that the FTDI will > >>>always return 2 bytes in a packet, even when there "no data" in it, > >>>since it prefixes each packet with 2 "modem status" bytes. Which means > >>>the 16382 untransferred above is it returning 2 status bytes and none > >>>of the data that we actually requested from it (which is a normal thing > >>>for it to do sometimes, and if we issue another read the data will > >>>eventually come - but I don't yet know if that's also contributing here). > >>> > >>> > >>>Sorry for the trickle of small facts, I'm still wrapping my head around > >>>everything I need to know to understand this properly. > >>> > >> > >>This is all good information. > >> > >>We mark all TRBs for incoming bulk transfers as "interrupt on short transfer" > >>and the last TRB of the TD as "interrupt on completion" > >> > >>If the we get a short packet event mid TD xhc should create two events, the first > >>one for the trb that was short, and a second one one the last trb, with the "interrupt on completion" > > > >There is one thing we need to be careful of here, with the FTDI at least. > >Once we get a short packet, we do actually need to end the transaction > >there and not append any further data to it. The reason for that is > >those modem status bytes. The only way to find them and filter them from > >the actual data you want is to split the received transfer back into > >wMaxPacketSize'd blocks and remove the first 2 bytes from each packet, > >since it prepends them to every packet. > > > >If a subsequent trb after the short was to add more data to that, we'd > >lose that framing and not be able to extract the data from it reliably > >(since we wouldn't know where the short in the middle ended). > > > >I'm not sure if the events we expect in this case already cover that, > >but it seemed worth mentioning in case they might return more data. > > > >>There was an errata added to the specs specifying in more detail what this second events status should be, > >>but the second event should be there anyways. controller will "fast forward" through the remains of the TD after > >>a short packet, generating the event. > > > >If I understand that part though, then we shouldn't actually get more > >data, just a final completion event for the whole TD ... > > > >>It's possible that something goes wrong in updating the dequeue pointer of either the endpoint ring or > >>event ring, leading to handling the wrong event, or wrong trb for that endpoint (perhaps handling the same "mid TD" TRB twice?) > >> > >>I have yet another patch. This one prints out the content of the short transfer event, and the event following a short transfer. > > > >Ok, here's a longer dump of what that outputs, which I'll annotate to > >show what's actually going on with the device at each point. > > > >The extra values in parens of the "asked for" lines are > >(td->last_trb - td->first_trb), and (event_trb - td->first_trb) > >which was how I tracked which trb it was stalling on earlier. > > > > > > > The logs shows that we never get the second event for the TD, > > The transfer ring and corresponding events: > EP TRB addr, event, comment > > d207cce0, 01(Success), Previous event was a SHORT > d207ccf0, 0d(SHORT ), only TRB of 1 long TD, returned > -new TD- > d207cd00, 0d(SHORT ), 1st TRB of 4 long TD. Previous event was SHORT. wait > d207cd10, , no event > d207cd20, , no event > d207cd30, , no event > -new TD- > d207cd40, 0d(SHORT), 1st TRB of 4 long TD. Previous event was short. wait > d207cd50, , no event > d207cd60, , no event > d207cd70, , no event > -new TD- > d207cd80, , no event > d207cd90, , no event > d207cda0, , no event > d207cdb0, 01(Success), Previous event was short > - new TD - > d207cdc0, 0d(SHORT), only TRB in 1 long TD > - new TD - > ... > > There are no events for the TDs last TRB at d207cd30 or d207cd70. > An event would have been logged as the previous events were short transfers (at d207cd00 and d207cd40) > > Looks like some controllers are not following the specs and we can't rely on the second event. > > So the patch should be reverted, and instead handle the controllers that work according to specs > as a special case. So we need to return the TD after the first SHORT event, and then make sure a possible > second event doesn't mess up anything. > > Just for reference, xhci specs 1.0 section 4.10.1.1 specify that: > > "If the Short Packet occurred while processing a Transfer TRB with only an ISP flag set, then > two events shall be generated for the transfer; one for the Transfer TRB that the short > Packet occurred on, and a second for the last TRB with the IOC flag set. > > Software shall not interpret an Short Packet Event as indicating that the TD that it is > associated with is “complete”, unless the TRB Pointer field of the Transfer Event references > the last TRB of the TD." > > ISP = interrupt on short packet ( we set for all bulk in TRBs) > IOC = interupt on completion ( we set for last TRB in TD) One last data point then, if it's looking like a controller snafu, the logs above were from testing on a bare metal host, and I'd first seen this bug on a different machine, also bare metal (both motherboards are Gigabyte, but different ages and one has an 'Intel chipset' on it, the other AMD) ... But I've also reproduced this in a QEMU/KVM VM using the nec-xhci controller that it provides. In that case the device was actually being passed through to the guest from an EHCI socket on the host machine. I haven't yet seen a machine where this does work as expected. Thanks! Ron -- To unsubscribe from this list: send the line "unsubscribe linux-usb" in the body of a message to majordomo@xxxxxxxxxxxxxxx More majordomo info at http://vger.kernel.org/majordomo-info.html