Help needed for EHCI problem: removing an active bulk-in QH

Alan Stern <stern@xxxxxxxxxxxxxxxxxxx> · Thu, 22 Oct 2015 17:08:29 -0400 (EDT)

Hans and everyone else:

This continues the discussion of a problem originally posted to the 
libusb-devel mailing list
(see <http://marc.info/?l=libusb-devel&m=144423444825269&w=2> if 
you're curious).

The EHCI controller in question is an AMD/ATI SB7x0/SB8x0/SB9x0, as 
found on the RX780/RX790 motherboard.  I haven't seen this problem 
occur with Intel hardware.

The problem arises when an active bulk-in QH is removed from the async
schedule.  The current qTD is cancelled, and it is the last qTD on the
QH's queue.  At the time the QH is removed from the async list, the
overlay region shows that only a fraction of the qTD has been completed
(maybe 4 KB transferred out of 16 KB total).

10 ms later, four new qTDs are added to the QH and it gets added back
to the async schedule.  Although I don't know this for certain, I
believe the second of these qTDs is stored at the same address as the
one that was cancelled.  That's what naturally would happen if the
memory pool satisfies an allocation from the most recently freed area.

Anyway, a short time later, it sometimes happen that the controller
gets stuck.  The Active bit in the QH's overlay region is clear, and
the Current and Next qTD pointers both point to the second qTD in the
queue, which obviously is why the controller is not making any forward
progress.  The first qTD's Active bit is still set and its Bytes To
Transfer is still set to 16 KB.  The second qTD's Active bit is off and
its Bytes To Transfer is 0.  In spite of this, neither qTD's data
buffer has been overwritten.

Although it's hard to tell exactly what went wrong, my guess is that
the after the QH was removed from the async schedule, the controller
continued to process it until all 16 KB had been transferred.  (This
would have taken no more than 0.5 ms.)  Then at some point, the QH
overlay and the now-completed qTD were written back -- that would
explain why the second qTD in the queue shows up as not Active and with
no bytes remaining to transfer.

On the other hand, that qTD wasn't reused until 10 ms after the QH was
removed from the schedule, and it was completely reinitialized before
reuse.  The write-back must have occurred later than this; I have no
idea why.  I also don't know why the write-back of the QH's overlay
region didn't overwrite the Next qTD pointer.

This is clearly a complicated problem.  It's possible that we're simply
dealing with defective hardware, but I tend to doubt it.  It seems more
likely that the problem is caused by improperly removing the active QH
from the async schedule.  The driver does not follow the instructions
given in section 4.8.2 of the EHCI spec, which says that software
should not remove active QHs.

[In practice it's not feasible to wait for an active QH to become
inactive before removing it, for several reasons.  For one, the QH may
_never_ become inactive (if the endpoint NAKs indefinitely).  For
another, the procedure given in the spec (deactivate the qTDs on the
queue) is racy, since the controller can perform a new overlay or
writeback at any time.]

In an attempt to cope with potential problems, the Linux EHCI driver
goes through _two_ Interrupt on Async Advance (IAA) cycles after taking
a QH off the async list before considering it to be fully gone from the
schedule.  (I have observed situations where the QH overlay region was
written back _after_ the first IAA interrupt.)  But it seems that this
isn't enough.

As far as I can see, the only alternative is to stop the async schedule
whenever an active QH has to be removed.  But that would impose a
significant penalty on any other async transfers, so I really don't
want to do it.

Hans, can you describe how the BSD EHCI driver handles this issue?  

Any ideas for fixing this or suggestions for additional debugging would 
be welcome.

Alan Stern

--
To unsubscribe from this list: send the line "unsubscribe linux-usb" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at  http://vger.kernel.org/majordomo-info.html