Hans and everyone else: This continues the discussion of a problem originally posted to the libusb-devel mailing list (see <http://marc.info/?l=libusb-devel&m=144423444825269&w=2> if you're curious). The EHCI controller in question is an AMD/ATI SB7x0/SB8x0/SB9x0, as found on the RX780/RX790 motherboard. I haven't seen this problem occur with Intel hardware. The problem arises when an active bulk-in QH is removed from the async schedule. The current qTD is cancelled, and it is the last qTD on the QH's queue. At the time the QH is removed from the async list, the overlay region shows that only a fraction of the qTD has been completed (maybe 4 KB transferred out of 16 KB total). 10 ms later, four new qTDs are added to the QH and it gets added back to the async schedule. Although I don't know this for certain, I believe the second of these qTDs is stored at the same address as the one that was cancelled. That's what naturally would happen if the memory pool satisfies an allocation from the most recently freed area. Anyway, a short time later, it sometimes happen that the controller gets stuck. The Active bit in the QH's overlay region is clear, and the Current and Next qTD pointers both point to the second qTD in the queue, which obviously is why the controller is not making any forward progress. The first qTD's Active bit is still set and its Bytes To Transfer is still set to 16 KB. The second qTD's Active bit is off and its Bytes To Transfer is 0. In spite of this, neither qTD's data buffer has been overwritten. Although it's hard to tell exactly what went wrong, my guess is that the after the QH was removed from the async schedule, the controller continued to process it until all 16 KB had been transferred. (This would have taken no more than 0.5 ms.) Then at some point, the QH overlay and the now-completed qTD were written back -- that would explain why the second qTD in the queue shows up as not Active and with no bytes remaining to transfer. On the other hand, that qTD wasn't reused until 10 ms after the QH was removed from the schedule, and it was completely reinitialized before reuse. The write-back must have occurred later than this; I have no idea why. I also don't know why the write-back of the QH's overlay region didn't overwrite the Next qTD pointer. This is clearly a complicated problem. It's possible that we're simply dealing with defective hardware, but I tend to doubt it. It seems more likely that the problem is caused by improperly removing the active QH from the async schedule. The driver does not follow the instructions given in section 4.8.2 of the EHCI spec, which says that software should not remove active QHs. [In practice it's not feasible to wait for an active QH to become inactive before removing it, for several reasons. For one, the QH may _never_ become inactive (if the endpoint NAKs indefinitely). For another, the procedure given in the spec (deactivate the qTDs on the queue) is racy, since the controller can perform a new overlay or writeback at any time.] In an attempt to cope with potential problems, the Linux EHCI driver goes through _two_ Interrupt on Async Advance (IAA) cycles after taking a QH off the async list before considering it to be fully gone from the schedule. (I have observed situations where the QH overlay region was written back _after_ the first IAA interrupt.) But it seems that this isn't enough. As far as I can see, the only alternative is to stop the async schedule whenever an active QH has to be removed. But that would impose a significant penalty on any other async transfers, so I really don't want to do it. Hans, can you describe how the BSD EHCI driver handles this issue? Any ideas for fixing this or suggestions for additional debugging would be welcome. Alan Stern -- To unsubscribe from this list: send the line "unsubscribe linux-usb" in the body of a message to majordomo@xxxxxxxxxxxxxxx More majordomo info at http://vger.kernel.org/majordomo-info.html