Felipe Balbi wrote: > > Hi, > > Thinh Nguyen <Thinh.Nguyen@xxxxxxxxxxxx> writes: >> Alan Stern wrote: >>> On Wed, May 05, 2021 at 03:57:03PM +0300, Felipe Balbi wrote: >>>> >>>> Hi, >>>> >>>> Wesley Cheng <wcheng@xxxxxxxxxxxxxx> writes: >>>>> On 5/3/2021 7:20 PM, Thinh Nguyen wrote: >>>>>> Hi, >>>>>> >>>>>> Wesley Cheng wrote: >>>>>>> If an error is received when issuing a start or update transfer >>>>>>> command, the error handler will stop all active requests (including >>>>>>> the current USB request), and call dwc3_gadget_giveback() to notify >>>>>>> function drivers of the requests which have been stopped. Avoid >>>>>>> having to cancel the current request which is trying to be queued, as >>>>>>> the function driver will handle the EP queue error accordingly. >>>>>>> Simply unmap the request as it was done before, and allow previously >>>>>>> started transfers to be cleaned up. >>>>>>> >>>>> >>>>> Hi Thinh, >>>>> >>>>>> >>>>>> It looks like you're still letting dwc3 stopping and cancelling all the >>>>>> active requests instead letting the function driver doing the dequeue. >>>>>> >>>>> >>>>> Yeah, main issue isn't due to the function driver doing dequeue, but >>>>> having cleanup (ie USB request free) if there is an error during >>>>> usb_ep_queue(). >>>>> >>>>> The function driver in question at the moment is the f_fs driver in AIO >>>>> mode. When async IO is enabled in the FFS driver, every time it queues >>>>> a packet, it will allocate a io_data struct beforehand. If the >>>>> usb_ep_queue() fails it will free this io_data memory. Problem is that, >>>>> since the DWC3 gadget calls the completion with -ECONNRESET, the FFS >>>>> driver will also schedule a work item (within io_data struct) to handle >>>>> the completion. So you end up with a flow like below >>>>> >>>>> allocate io_data (ffs) >>>>> --> usb_ep_queue() >>>>> --> __dwc3_gadget_kick_transfer() >>>>> --> dwc3_send_gadget_ep_cmd(EINVAL) >>>>> --> dwc3_gadget_ep_cleanup_cancelled_requests() >>>>> --> dwc3_gadget_giveback(ECONNRESET) >>>>> ffs completion callback >>>>> queue work item within io_data >>>>> --> usb_ep_queue returns EINVAL >>>>> ffs frees io_data >>>>> ... >>>>> >>>>> work scheduled >>>>> --> NULL pointer/memory fault as io_data is freed >>> >>> Am I reading this correctly? It looks like usb_ep_queue() returns a >>> -EINVAL error, but then the completion callback gets invoked with a >>> status of -ECONNRESET. >>> >>>> I have some vague memory of discussing (something like) this with Alan >>>> Stern long ago and the conclusion was that the gadget driver should >>>> handle cases such as this. >>> >>> Indeed, this predates the creation of the Gadget API; the same design >>> principle applies to the host-side API. It's a very simple idea: >>> >>> If an URB or usb_request submission succeeds, it is guaranteed >>> that the completion routine will be called when the transfer is >>> finished, cancelled, aborted, or whatever (note that this may >>> happen before the submission call returns). >>> >>> If an URB or usb_request submission fails, it is guaranteed that >>> the completion routine will not be called. >>> >>> So if dwc3 behaves as described above (usb_ep_queue() fails _and_ the >>> completion handler is called), this is a bug. >>> >>> Alan Stern >>> >> >> >> Hi Alan, >> >> Yes, it's a bug, no question about that. The concern here is how should >> we fix it. >> >> In dwc3, when the usb_ep_queue() is called, it does 3 main things: >> 1) Put the request in a pending list to be processed >> 2) Process any started but partially processed request and any >> outstanding request from the pending list and map them to TRBs >> 3) Send a command to the controller telling it to cache and process >> these new TRBs >> >> Currently, if step 3) fails, then usb_ep_queue() returns an error status >> and we stop the controller from processing TRBs and return any request >> related to those outstanding TRBs. This is a bug. >> >> My suggestion here is don't immediately return an error code and let the >> completion interrupt inform of a transfer failure. The reasons are: >> >> a) Step 3) happened when the request is already (arguably) queued. >> b) If the error from step 3) is due to command timed out, the controller >> may have partially cached/processed some of these TRBs. We can't simply >> give back the request immediately without stopping the transfer and fail >> all the in-flight request. >> c) It adds unnecessary complexity to the driver and there are a few >> pitfalls that we have to watch out for when handling giving back the >> request. >> d) Except the case where the device is disconnected or that the request >> is already in-flight, step 1) and 2) are always done after >> usb_ep_queue(). The controller driver already owns these requests and >> can be considered "queued". > > Oh, I see. Indeed this sounds like the best minimal fix for the -rc and > backporting to stables. We may still want to get back to this at some > point and, potentially, split kick_transfer() into two parts that can > make assumptions about the context where they can be called. Yeah, the driver may need some reorganization to keep the logic clear. > > Also, we may want to move the request to the pending list only if the > command succeeds. There should be no race condition as kick_transfer is > always called with interrupts disabled. > Hm... I don't see how this is better. If I understand correctly, that requires that there will be a command issued for every call to usb_ep_queue(). In other word, with this requirement, we're enforcing whether we can give the request to the controller and drop it otherwise. So, if we run out of TRBs, usb_ep_queue() will fail? Thanks, Thinh