Re: g_mass_storage bug ?

Felipe Balbi <balbi@xxxxxx> · Thu, 25 Sep 2014 16:06:59 -0500

Hi,

On Thu, Sep 25, 2014 at 05:50:59PM +0000, Paul Zimmerman wrote:
> > From: Felipe Balbi [mailto:balbi@xxxxxx]
> > Sent: Wednesday, September 24, 2014 7:41 PM
> > 
> > On Thu, Sep 25, 2014 at 01:37:05AM +0000, Paul Zimmerman wrote:
> > > > > > Then, after the host cleared the halt, the gadget apparently sent the
> > > > > > data that _should_ have been sent previously.  The host was expecting
> > > > > > to receive the CSW at this point, so there was an overflow error.
> > > > > > That's what caused the host to perform a reset.
> > > > > >
> > > > > > Evidently this UDC implements the set_halt method incorrectly.
> > > > > > According to the kerneldoc for usb_ep_set_halt:
> > > > > >
> > > > > >  * Attempts to halt IN endpoints will fail (returning -EAGAIN) if any
> > > > > >  * transfer requests are still queued, or if the controller hardware
> > > > > >  * (usually a FIFO) still holds bytes that the host hasn't collected.
> > > > >
> > > > > damn old bugs :-) I'll fix that up and Cc stable.
> > > >
> > > > alright fixed. Below you can see a combined diff which I'll still split
> > > > into patches so they can be applied properly.
> > >
> > > [ snipped patch which grubs around in the GDBGFIFOSPACE registers ]
> > >
> > > Ick. That shouldn't be necessary. The Synopsys vendor driver works
> > 
> > you need to make sure that both there are no pending transfers and FIFO
> > is empty in case of IN endpoints. Databook itself says (sorry, forgot
> > the section and no access to the docs right now) that to figure out if
> > the FIFO is empty, application can use GDBGFIFOSPACE. Do you have any
> > better ideas ?
> 
> I guess I'm just afraid this may just be papering over the real cause.

Well, the API documentation, as pointed out by Alan, calls for:

380  * Attempts to halt IN endpoints will fail (returning -EAGAIN) if any
381  * transfer requests are still queued, or if the controller hardware
382  * (usually a FIFO) still holds bytes that the host hasn't collected.

There's no other way to ensuring FIFO is empty with this core.

> > > fine with the mass-storage gadget (with stall=1) without doing anything
> > > like that.
> > >
> > > When did the dwc3 driver start showing this problem? Wouldn't it be
> > > better to find the change which caused this? Or has dwc3 always had
> > 
> > it's probably been like that since ever :-) I just figured that my
> > scripts always had stall=0 and ended up never testing the other way.
> > 
> > > this problem, but you never tested with stall=1 before so didn't see
> > > it?
> > 
> > yup. Note that it's not enough to checked for TRB completion because
> > there could still be data in the FIFO which the host hasn't moved yet,
> > unless it's 100% guaranteed that the core won't fire XferComplete until
> > every single bit of data has been moved by the host.
> 
> That is supposed to be 100% guaranteed. The IN XferInProgress and
> XferComplete events are only delivered after the host has ACKed the
> packet.
> 
> > But the way things are, I'd expect core to be firing transfer complete
> > as soon as all data has reached the FIFO (in case of TX).
> 
> Nope.
> 
> That's why I don't understand how this can happen for IN. AFAIK, a STALL
> is only sent in response to something the host sent in the CBW. At that
> point, there shouldn't be any IN transfers active.

you could race with the HW, right ? You just issued Start Transfer for
that IN transfer and control returns to the gadget driver which can, at
that very moment, issue a Set Stall. SW *must* make sure that there's
nothing pending yet. We haven't received XferComplete for that IN we
just started.

> BTW, about a year ago, I fixed a similar issue in our vendor driver.
> But I don't remember what the cause was, I will search the Git log to
> see if I can find the commit that fixed it.

ok, thanks.

-- 
balbi
Attachment:
signature.asc

Description: Digital signature