Re: Help needed for EHCI problem: removing an active bulk-in QH

Alan Stern <stern@xxxxxxxxxxxxxxxxxxx> · Thu, 22 Oct 2015 17:14:41 -0400 (EDT)

[Resend with Hans's correct email address this time...]

On Thu, 22 Oct 2015, Alan Stern wrote:

> Hans and everyone else:
> 
> This continues the discussion of a problem originally posted to the 
> libusb-devel mailing list
> (see <http://marc.info/?l=libusb-devel&m=144423444825269&w=2> if 
> you're curious).
> 
> The EHCI controller in question is an AMD/ATI SB7x0/SB8x0/SB9x0, as 
> found on the RX780/RX790 motherboard.  I haven't seen this problem 
> occur with Intel hardware.
> 
> The problem arises when an active bulk-in QH is removed from the async
> schedule.  The current qTD is cancelled, and it is the last qTD on the
> QH's queue.  At the time the QH is removed from the async list, the
> overlay region shows that only a fraction of the qTD has been completed
> (maybe 4 KB transferred out of 16 KB total).
> 
> 10 ms later, four new qTDs are added to the QH and it gets added back
> to the async schedule.  Although I don't know this for certain, I
> believe the second of these qTDs is stored at the same address as the
> one that was cancelled.  That's what naturally would happen if the
> memory pool satisfies an allocation from the most recently freed area.
> 
> Anyway, a short time later, it sometimes happen that the controller
> gets stuck.  The Active bit in the QH's overlay region is clear, and
> the Current and Next qTD pointers both point to the second qTD in the
> queue, which obviously is why the controller is not making any forward
> progress.  The first qTD's Active bit is still set and its Bytes To
> Transfer is still set to 16 KB.  The second qTD's Active bit is off and
> its Bytes To Transfer is 0.  In spite of this, neither qTD's data
> buffer has been overwritten.
> 
> Although it's hard to tell exactly what went wrong, my guess is that
> the after the QH was removed from the async schedule, the controller
> continued to process it until all 16 KB had been transferred.  (This
> would have taken no more than 0.5 ms.)  Then at some point, the QH
> overlay and the now-completed qTD were written back -- that would
> explain why the second qTD in the queue shows up as not Active and with
> no bytes remaining to transfer.
> 
> On the other hand, that qTD wasn't reused until 10 ms after the QH was
> removed from the schedule, and it was completely reinitialized before
> reuse.  The write-back must have occurred later than this; I have no
> idea why.  I also don't know why the write-back of the QH's overlay
> region didn't overwrite the Next qTD pointer.
> 
> 
> This is clearly a complicated problem.  It's possible that we're simply
> dealing with defective hardware, but I tend to doubt it.  It seems more
> likely that the problem is caused by improperly removing the active QH
> from the async schedule.  The driver does not follow the instructions
> given in section 4.8.2 of the EHCI spec, which says that software
> should not remove active QHs.
> 
> [In practice it's not feasible to wait for an active QH to become
> inactive before removing it, for several reasons.  For one, the QH may
> _never_ become inactive (if the endpoint NAKs indefinitely).  For
> another, the procedure given in the spec (deactivate the qTDs on the
> queue) is racy, since the controller can perform a new overlay or
> writeback at any time.]
> 
> In an attempt to cope with potential problems, the Linux EHCI driver
> goes through _two_ Interrupt on Async Advance (IAA) cycles after taking
> a QH off the async list before considering it to be fully gone from the
> schedule.  (I have observed situations where the QH overlay region was
> written back _after_ the first IAA interrupt.)  But it seems that this
> isn't enough.
> 
> As far as I can see, the only alternative is to stop the async schedule
> whenever an active QH has to be removed.  But that would impose a
> significant penalty on any other async transfers, so I really don't
> want to do it.
> 
> Hans, can you describe how the BSD EHCI driver handles this issue?  
> 
> Any ideas for fixing this or suggestions for additional debugging would 
> be welcome.
> 
> Alan Stern

--
To unsubscribe from this list: send the line "unsubscribe linux-usb" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at  http://vger.kernel.org/majordomo-info.html