Wesley Cheng wrote: > > > On 5/3/2021 8:12 PM, Thinh Nguyen wrote: >> Hi Wesley, >> >> Wesley Cheng wrote: >>> >>> >>> On 5/3/2021 7:20 PM, Thinh Nguyen wrote: >>>> Hi, >>>> >>>> Wesley Cheng wrote: >>>>> If an error is received when issuing a start or update transfer >>>>> command, the error handler will stop all active requests (including >>>>> the current USB request), and call dwc3_gadget_giveback() to notify >>>>> function drivers of the requests which have been stopped. Avoid >>>>> having to cancel the current request which is trying to be queued, as >>>>> the function driver will handle the EP queue error accordingly. >>>>> Simply unmap the request as it was done before, and allow previously >>>>> started transfers to be cleaned up. >>>>> >>> >>> Hi Thinh, >>> >>>> >>>> It looks like you're still letting dwc3 stopping and cancelling all the >>>> active requests instead letting the function driver doing the dequeue. >>>> >>> >>> Yeah, main issue isn't due to the function driver doing dequeue, but >>> having cleanup (ie USB request free) if there is an error during >>> usb_ep_queue(). >>> >>> The function driver in question at the moment is the f_fs driver in AIO >>> mode. When async IO is enabled in the FFS driver, every time it queues >>> a packet, it will allocate a io_data struct beforehand. If the >>> usb_ep_queue() fails it will free this io_data memory. Problem is that, >>> since the DWC3 gadget calls the completion with -ECONNRESET, the FFS >>> driver will also schedule a work item (within io_data struct) to handle >>> the completion. So you end up with a flow like below >>> >>> allocate io_data (ffs) >>> --> usb_ep_queue() >>> --> __dwc3_gadget_kick_transfer() >>> --> dwc3_send_gadget_ep_cmd(EINVAL) >>> --> dwc3_gadget_ep_cleanup_cancelled_requests() >>> --> dwc3_gadget_giveback(ECONNRESET) >>> ffs completion callback >>> queue work item within io_data >>> --> usb_ep_queue returns EINVAL >>> ffs frees io_data >>> ... >>> >>> work scheduled >>> --> NULL pointer/memory fault as io_data is freed > > Hi Thinh, > >> >> sounds like a race issue. >> > > It'll always happen if usb_ep_queue() fails with an error. Sorry for not > clarifying, but the "..." represents executing in a different context > :). Anything above the "..." is in the same context. >>> >>>> BTW, what kinds of command and error do you see in your setup and for >>>> what type endpoint? I'm thinking of letting the function driver to >>>> dequeue the requests instead of letting dwc3 automatically >>>> ending/cancelling the queued requests. However, it's a bit tricky to do >>>> that if the error is -ETIMEDOUT since we're not sure if the controller >>>> had already cached the TRBs. >>>> >>> >>> Happens on bulk EPs so far, but I think it wouldn't matter as long as >>> its over the FFS interface. (and using async IO transfers) >> >> Do you know which command and error code? It's strange if >> UPDATE_TRANSFER command failed. >> > > Sorry for missing that part of the question. It is a no xfer resource > error on a start transfer command. So far this happens on low system > memory test cases, so there may be some sequences that were missed, > which led to this particular command error. > > Thanks > Wesley Cheng No xfer resource usually means that the driver attempted to send START_TRANSFER without waiting for END_TRANSFER command to complete. This may be a dwc3 driver issue. Did you check this? Thanks, Thinh > >>> >>>> This seems to add more complexity and I don't have a good solution to >>>> it. Since you're already cancelling all the active request anyway, what >>>> do you think of always letting dwc3_gadget_ep_queue() to go through with >>>> success, but report failure through request completion? >>>> >>> >>> We do have something similar as well downstream (returning success >>> always on dwc3_gadget_ep_queue()) and its been working for us also. >>> Problem is we don't test the ISOC path much, so this is the only type of >>> EP that might come into question... >>> >> >> It should be similiar with isoc. I can't think of a potential issue yet. >> >>> Coming up with a way to address the concerns you brought up was a bit >>> difficult as there were scenarios we needed to consider. next_request() >>> doesn't always have to be the request being queued (even if ep queue >>> triggered it). There was no easy way to determine if kick transfer was >>> due to ep queue, but even if there was, we'd need to remember the >>> previous point as well. >>> >> >> Yeah, there are a few pitfalls. I don't have a good solution to it if we >> want to return failure immediately and let the function driver handle >> the dequeue (if it wants to). >> >> Thanks, >> Thinh >> >