On 10/6/23 15:53, Michael Grzeschik wrote: > On Fri, Oct 06, 2023 at 10:00:11AM -0700, Avichal Rakesh wrote: >> >> >> On 10/5/23 15:05, Michael Grzeschik wrote: >>> Hi Avichal, >>> >>> On Thu, Oct 05, 2023 at 11:30:32AM -0700, Avichal Rakesh wrote: >>>> On 10/5/23 03:14, Michael Grzeschik wrote: >>>>> On Thu, Oct 05, 2023 at 11:23:27AM +0300, Laurent Pinchart wrote: >>>>>> On Tue, Oct 03, 2023 at 01:09:06PM +0200, Michael Grzeschik wrote: >>>>>>> On Sat, Sep 30, 2023 at 11:48:18AM -0700, Avichal Rakesh wrote: >>>>>>> > We have been seeing two main stability issues that uvc gadget driver >>>>>>> > runs into when stopping streams: >>>>>>> > 1. Attempting to queue usb_requests to a disabled usb_ep >>>>>>> > 2. use-after-free issue for inflight usb_requests >>>>>>> > >>>>>>> > The three patches below fix the two issues above. Patch 1/3 fixes the >>>>>>> > first issue, and Patch 2/3 and 3/3 fix the second issue. >>>>>>> > >>>>>>> > Avichal Rakesh (3): >>>>>>> > usb: gadget: uvc: prevent use of disabled endpoint >>>>>>> > usb: gadget: uvc: Allocate uvc_requests one at a time >>>>>>> > usb: gadget: uvc: Fix use-after-free for inflight usb_requests >>>>>>> > >>>>>>> > drivers/usb/gadget/function/f_uvc.c | 11 +- >>>>>>> > drivers/usb/gadget/function/f_uvc.h | 2 +- >>>>>>> > drivers/usb/gadget/function/uvc.h | 6 +- >>>>>>> > drivers/usb/gadget/function/uvc_v4l2.c | 21 ++- >>>>>>> > drivers/usb/gadget/function/uvc_video.c | 189 +++++++++++++++++------- >>>>>>> > 5 files changed, 164 insertions(+), 65 deletions(-) >>>>>>> >>>>>>> These patches are not applying on gregkh/usb-testing since >>>>>>> Greg did take my patches first. I have already rebased them. >>>>>> >>>>>> I think they got merged too soon :-( We could fix things on top, but >>>>>> there's very little time to do so for v6.7. >>>>> >>>>> Agreed. I was jumping from one workaround to another one, since this >>>>> is not easy to fix in a proper way. And still after this long discussion >>>>> with Avichal I don't think we are there yet. >>>>> >>>>> >>>>> So far the first two patches from Avichal look legit. But the overall >>>>> Use-After-Free fix is yet to be done properly. >>>>> >>>>> The "abondoned" method he suggested is really bad to follow and will >>>>> add too much complexity and will be hard to debug. >>>>> >>>>> IMHO it should be possible to introduce two cleanup pathes. >>>>> >>>>> One path would be in the uvc_cleanup_requests that will cleanup the >>>>> requests that are actually not used in the controller and are registered >>>>> in the req_free list. >>>>> >>>>> The second path would be the complete functions that are being run >>>>> from the controller and will ensure that the cleanup will really free >>>>> the requests from the controller after they were consumed. >>>>> >>>>> What do you think? >>>> >>>> I am not sure I follow. Patch 3/3 does exactly what you say here. >>> >>> Yes, it was just to summ up what the latest state of the idea was, >>> so Laurent does not read the whole thread in detail. Sorry for not >>> being clear enough about that. >> >> Whoops! Sorry about the misunderstanding! >> >>> >>>> There are two cleanup paths: >>>> 1. uvcg_video_disable cleans up only the requests in req_free, and >>>> 2. complete handler cleans up the in-flight requests. >>>> >>>> The "abandoned" flag is simply to let the completion handler know >>>> which requests to clean up and which ones to re-queue back to >>>> the gadget driver. >>> >>> What I don't get is, why in the case of shutdown there needs to >>> be something re-queued back to the gadget driver. There should not >>> need to be any sort of barrier flag for the requests. Just the >>> complete handler running past a barrier where it knows that the >>> whole device is stopped. So every call on complete should then clean >>> that exact request it is touching currently. >>> >>> I don't know where the extra complexity comes from. >> >> A lot of this complexity comes from assuming a back to back >> STREAMOFF -> STREAMON sequence is possible where the gadget driver >> doesn't have the time to clean up all in-flight usb_requests. >> However, looking through the usb gadget APIs again, and it >> looks like usb_ep_disable enforces that all requests will >> be sent back to the gadget driver before it returns. > > Great! Uhh...apologies, I will have to take this back. I've been trying to use uvc->state as the condition for when completion handler should clean up usb_requests, and I cannot figure out a way to do so cleanly. The fundamental problem with using uvc->state is that it is not protected by any locks. So there is no real way to assert that its value has not changed between reading uvc->state and acting on it. Naively we can write something like the following in the completion handler: void uvc_video_complete(...) { if (uvc->state != UVC_EVENT_STREAMING) { usb_ep_free_request(....); } else { // handle usb_request normally } } But without any locks, there are no guarantees that uvc->state didn't mutate immediately after the if condition was checked, and the complete handler is handling a request that it should've freed instead or vice-versa. This argument would hold for any logic we guard with uvc->state, making uvc->state effectively useless as a check for freeing memory. We can work around it by either 1. Locking uvc->state with some driver level lock to ensure that we can trust the value of uvc->state at least for a little while, or 2. Using some other barrier condition that is protected by another lock If we go with (1), we'd have to add a lock around every and every write to uvc->state, which isn't terrible, but would require more testing to ensure that it doesn't create any new deadlocks. For (2), with the realization that usb_ep_disable flushes all requests, we can add a barrier in uvc_video, protected by req_lock. That should simplify the logic a little bit and will hopefully be easier to reason about. I could of course be missing a simpler solution here, and am happy to be wrong. So please let me know if you have any other ideas on how to guarantee such a check. > >> So you're right: >> With Patch 1/3 in place, I think we can just guard on uvc->state >> alone, because control requests are blocked until usb_ep_disable >> is finished anyway. I'll upload v4 with the "is_abandoned" >> flag removed and the checks simplified once I've verified the >> fix locally. >> >> That should also remove any bookkeeping issues that may have >> triggered the stack below. > > I am currious if we should handle -ECONNRESET and -ESHUTDOWN in more > detail in the completion handler and make sure that the request will not > be added into the req_free list from there. > > Will review your v4 then. Appreciate the reviews, thank you! > >>>> The other "complications" are around making sure we can trust >>>> the values in an inherently racey situation. The reasoning >>>> can admittedly be difficult to follow at a glance, which incidentally >>>> is why I went with a simple to prove timed wait in the past >>>> (https://lore.kernel.org/20230912041910.726442-3-arakesh@xxxxxxxxxx). >>>> >>>> I am not suggesting we go back to a timed wait, but please do look >>>> at the patch and let me know which parts don't make sense, or are >>>> difficult to understand. We can add more documentation about our >>>> assumptions there, or if you have a way to do this that you >>>> think is simpler to reason about, then please let me know and I'll >>>> be more than happy to use that! >>> >>> I really try to spin my head around the idea of the is_abondoned flag >>> you are using. Unfortunatly for now I am out to debug the issues I see >>> with your series. >>> >>> So I did try these patches you send. Yes the deadlock error is gone with >>> v3. But the linked list is still running into cases where >>> dwc3_gadget_giveback(complete) is touching requests that are already >>> freed. >>> >>> [ 61.408715] ------------[ cut here ]------------ >>> [ 61.413897] kernel BUG at lib/list_debug.c:56! >>> ... >>> [ 61.590762] Call trace: >>> [ 61.596890] __list_del_entry_valid+0xb8/0xe8 >>> [ 61.603408] dwc3_gadget_giveback+0x3c/0x1b0 >>> [ 61.607594] dwc3_remove_requests.part.0+0xcc/0x100 >>> [ 61.612948] __dwc3_gadget_ep_disable+0xbc/0x1b8 >>> [ 61.621019] dwc3_gadget_ep_disable+0x48/0x100 >>> [ 61.627925] usb_ep_disable+0x3c/0x138 >>> [ 61.638230] uvc_function_setup_continue+0x3c/0x60 >>> [ 61.645040] uvc_v4l2_streamoff+0x5c/0x80 >>> [ 61.659812] v4l_streamoff+0x40/0x60 >>> [ 61.668950] __video_do_ioctl+0x344/0x420 >>> [ 61.679548] video_usercopy+0x1d0/0x788 >>> [ 61.685677] video_ioctl2+0x40/0x70 >>> [ 61.697439] v4l2_ioctl+0x68/0xa0 >>> [ 61.709200] __arm64_sys_ioctl+0x304/0xda0 >>> [ 61.720768] invoke_syscall.constprop.0+0x70/0x130 Just to confirm: this stack was with all 3 patches applied? Want to make sure this won't happen with v4. Regards, Avi.