Re: EHCI and short packets [was: Re: [Libusb-devel] USB 3.0]

Alan Stern <stern@xxxxxxxxxxxxxxxxxxx> · Tue, 27 Jul 2010 11:00:07 -0400 (EDT)

On Tue, 27 Jul 2010, Hans Petter Selasky wrote:

> > How often do the Intel controllers follow the wrong pointer?
> 
> The issue is 100% reproducible, and appeared to me like some kind of hardware 
> bug. I checked my e-mail archive today, but could not find the e-mails where I 
> debugged this issue. Sorry about that. If you want to find out more you will 
> have to setup a special test to check this out on various hardware yourself!
> 
> Instructions:
> 
> 1) Setup a TD chain like this:
> 
> QTD(0xc2f8e480) at 0x0178e480:
> next=0x0178e400<> altnext=0x00000001<T>
> status=0x40000000: toggle=0 bytes=0x4000 ioc=0 c_page=0x0
> cerr=0 pid=0 stat=ACTIVE
> buffer[0]=0x0de1e000
> buffer[1]=0x0de1f000
> buffer[2]=0x0de20000
> buffer[3]=0x0de21000
> buffer[4]=0x0de21000
> buffer_hi[0]=0x00000000
> buffer_hi[1]=0x00000000
> buffer_hi[2]=0x00000000
> buffer_hi[3]=0x00000000
> buffer_hi[4]=0x00000000

Without checking in detail, it looks like this wants to transfer 16 KB
of data.  Since the altnext field is set to 1, a short packet will
cause the controller to follow the "next" pointer.

> QTD(0xc2f8e400) at 0x0178e400:
> next=0x0178e380<> altnext=0x00000001<T>
> status=0x40000080: toggle=0 bytes=0x4000 ioc=0 c_page=0x0
> cerr=0 pid=0 stat=ACTIVE
> buffer[0]=0x0de22000
> buffer[1]=0x0de23000
> buffer[2]=0x0de24000
> buffer[3]=0x0de25000
> buffer[4]=0x0de25000
> buffer_hi[0]=0x00000000
> buffer_hi[1]=0x00000000
> buffer_hi[2]=0x00000000
> buffer_hi[3]=0x00000000
> buffer_hi[4]=0x00000000

This is much the same as the previous qTD.

> 2) Send from the USB gadget the following byte sequence in a HS BULK endpoint: 
> 512 + ZLP or 1024 + ZLP or 2048 + ZLP or 4096 + ZLP. Try also to replace ZLP 
> with a short packet. One of these cases should trigger the bug, that the EHCI 
> continues working on the next TD, though filling some crap into the bytes bits 
> of the status DWORD.

When you say "the next TD", do you mean the second TD above (at
0x0178e400) or rather the TD that follows it (at 0x0178e380)?

In each case, I would expect the controller to store N bytes in the
first 16-KB buffer (where N is 512, 1024, 2048, or 4096 respectively)  
and 0 bytes in the second buffer, and then to move on to the following
TD.  If you had set altnext to some other value, then the controller
would behave differently.

> NOTE: Usually the software will see an interrupt and and check the TD's, and 
> then it will see a short packet and remove the TD-chain. If the software is 
> quick enough, no bug will trigger. If the software/interrupt handler gets 
> delayed, there is a chance that the EHCI can receive data into the next TD 
> pointed to by the next field.

But that's supposed to happen!

> 3) My conclusion: Avoid receiving more than 16K on any BULK IN endpoint per 
> EHCI IRQ. Chaining on BULK OUT endpoints does not have this kind of bug.

Are you sure this is really a bug?  It doesn't look that way to me.  

And if it is a bug, why do you limit yourself to 16 KB per interrupt?  
Wouldn't it make more sense to set the limit to one qTD per interrupt?  
(Note that a qTD may want to transfer less than 16 KB.)

> The issue was found on INTEL controllers at least. I don't have the exact 
> version. The original test was Mass Storage, where the MSC (SCSI + BOT-
> protocol) device, was short terminating the data-stage and then the EHCI 
> sometimes got the CSW (command status wrapper block) aswell into the remaining 
> part of the TD chain, and then I got a timeout when trying to read the TD a 
> second time.
> 
> I don't have any more information than this. Maybe something to investigate 
> for you Linux guys?

Certainly, if there really does turn out to be a problem.

Alan Stern

--
To unsubscribe from this list: send the line "unsubscribe linux-usb" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at  http://vger.kernel.org/majordomo-info.html