Re: USB protocol help (STALL and NAKs)

Arvid Brodin <arvid.brodin@xxxxxxxx> · Fri, 25 Mar 2011 22:47:33 +0100

Alan Stern wrote:
> On Thu, 24 Mar 2011, Arvid Brodin wrote:
> 
>> Hi,
>>
>> I'm working on the isp1760 driver (mostly modifying the qtd queueing to get rid
>> of BUG() calls in interrupt context).
>>
>> I have a high-speed USB-stick, that (probably due to some protocol error of
>> mine) STALLs when I transfer a 15 MB file to it (after first transferring a few
>> MB successfully with repeated 512 B OUT, NYET, PINGs etc.). After the stall is
>> received, the host immediately sends a "Device request: Clear feature:
>> Endpoint halt" which succeeds. After that, the host continues with IN
>> transactions, but the device NAKs these indefinitely and the bus hangs (and I
> 
> Which bus hangs?

Sorry about my sloppy language. :) I'm not exactly sure what happens, but the
symptoms is the cp (or sync) never returns and cannot be terminated by ctrl-c.
The usb analyzer shows one NAKed IN transfer every 9 micro-seconds, or
thereabouts. I also get this:

INFO: task scsi_eh_0:491 blocked for more than 120 seconds.
"echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
scsi_eh_0     D 9028ba06     0   491      2 0x00000000
Stack: (0x93ed7eb8 to 0x93ed8000)
7ea0:                                                       93ed6000 9028bd3c
7ec0: 90025178 9028bd3c 93ed7f1c 93df9630 93df962c 7fffffff 00000002 ffffe000
...
Call trace:
 [<9028bd3c>] schedule_timeout+0x14/0x170
 [<9028bc12>] wait_for_common+0xaa/0x10c
 [<9028bcec>] wait_for_completion+0xc/0x14
 [<901b674e>] command_abort+0x7a/0xa0
 [<9018a4aa>] scsi_try_to_abort_cmd+0x1e/0x22
 [<9018b976>] scsi_error_handler+0xda/0x29c
 [<9003d612>] kthread+0x5c/0x76
 [<9002d894>] do_exit+0x0/0x524

no locks held by scsi_eh_0/491.
INFO: task sync:514 blocked for more than 120 seconds.
"echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
sync          D 9028ba06     0   514    485 0x80000000
Stack: (0x93f0fdd4 to 0x93f10000)
fdc0:                                              93f0e000 9028ba70 90025178
fde0: 9028ba70 93f0fe00 9036370c 93f0e000 00000002 90abb3e0 9005ab54 00000000
...
Call trace:
 [<9028ba70>] io_schedule+0x30/0x78
 [<9005ab88>] sync_page+0x34/0x40
 [<9028c04a>] __wait_on_bit+0x3a/0x74
 [<9005adde>] wait_on_page_bit+0x6a/0x78
 [<9005b42a>] filemap_fdatawait_range+0x62/0x144
 [<9005bb22>] filemap_fdatawait+0x2a/0x34
 [<9009579a>] sync_inodes_sb+0x10e/0x184
 [<90098b6c>] __sync_filesystem+0x34/0x6c
 [<90098bbe>] sync_one_sb+0x1a/0x20
 [<9008034a>] iterate_supers+0x52/0xa8
 [<90098b00>] sync_filesystems+0x14/0x20
 [<90098bd8>] sys_sync+0x14/0x30
 [<9001d132>] syscall_return+0x00x12

1 lock held by sync/514:
 #0:  (&type->s_umount_key#22){.+.+..}, at: [<9008033e>] iterate_supers+0x46/0xa8

repeated every 120 seconds. If I pull the plug for the device I get a lot of
block device errors and my prompt back.

> 
> The block layer is supposed to time out after 30 seconds, causing the 
> IN transfers to be unlinked and the USB stick to be reset.  Maybe your 
> bus problems prevent this from happening.
> 
>> When I try this stick on my desktop EHCI, I never get the STALL, but lots of
>> NAKs on IN bulk packets. The stick works fine here.
>>
>> a) Should there be some kind of limit/quench on bulk IN NAKs somehow, so that a
>>    (malicious/erroneous) device cannot hang the USB subsystem like this? The
>>    EHCI driver loads the NakCnt field with 4 (EHCI_TUNE_RL_HS), but when 4 NAKs
>>    have been detected and the HC returns the packet, I believe it's just reset
>>    and enqueued again?
> 
> The host controller is supposed to reload the NakCnt field only when
> the async schedule is restarted.  If the NakCnt fields in all the
> active endpoints remain 0 for two passes through the async schedule,
> the controller is supposed to detect that the schedule is empty and go
> to the async sleeping state, after which it restarts the async schedule
> about 10 us later.  See section 4.9 and 4.8.3 - 4.8.6 in the EHCI spec.
> 

I've read the EHCI spec (or large parts anyhow) but I'm having trouble
translating it to the ISP176x (which of course aren't EHCI controllers).
If I understand correctly, the EHCI has the complete schedule (queue heads
and all) in HW memory. That's not the case with the ISP176x. Instead, the
queue is managed by software, and only when a qtd is actually due for
transfer, its "head" ("ptd", Proprietary Transfer Descriptor) is written
to the HW. The hardware then transfers the qtd and signals the SW with an
interrupt when done. If the device NAKs, the ISP176x retries until its
NakCnt is zero. It's then up to the SW to reload the NakCnt from RL and
re-schedule the ptd. It's also possible to set NakCnt and RL to zero to 
make the controller retry forever without signalling (apparently used for
periodic transfers).

The problem arises when the device NAKs indefinitely, which makes the (my)
software never return the urb to the usb core. I simply don't know how to
decide when to give up (and of course, if the HW retries forever, I have no
chance to give up)! There's no asynchronous schedule to restart, and even
if there was, section 4.8.4.1 of the EHCI spec has no "way out" as far as
I can see - it says to reload the NakCnt fields and try again on schedule
restart.

>> b) My host controller driver returns urb->status = -EPIPE (-32) to the usb core
>>    after receiving the STALL packet. I'm guessing this is correct since usb core
>>    then sends the Clear Endpoint Halt command afterwards. Am I right in this?
> 
> Yes.
> 
>> c) Clearly, continued bulk IN requests is not the right thing to do after this.
> 
> Why not?  It often _is_ the right thing to do.
> 
>>    Any ideas why this happens?
> 
> What what happens?  Why does the device send a STALL?  I'd have to see 
> a usbmon or bus analyzer trace to answer that.

I'll try to get a useable usbmon trace.

>> I'm pretty much out of ideas on this now.
>>    Alternatively, I've done something wrong to cause the STALL in the first
>>    place, which puts the host/device in some very unfortunate state - what could
>>    I have done to cause this? (Some problem with ping or toggle state?)
> 
> No.  Without more information, there's no way to know the cause.
> 
> Alan Stern
> 

Thanks,
Arvid Brodin
Enea Services Stockholm AB

--
To unsubscribe from this list: send the line "unsubscribe linux-usb" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at  http://vger.kernel.org/majordomo-info.html