Hej, On 2022-11-26 10:10:48 +0100, Sven Peter wrote: > Hi Thinh, > > I've run into a race between dwc3_set_mode and __dwc3_set_mode accessing > dwc->desired_dr_role: It's an incredibly tight race that's hard to hit since > role switch events need to come in just after each other. It's reproducible > with an Apple M1 connected to a device that very quickly switches > roles when shutting down (which happens to be another M1). This sometimes > triggers a device->host->device switch sequence which is fast enough to hit this > race: > > CPU A > dwc3_set_mode(DWC3_GCTL_PRTCAP_HOST) // first role switch event > spin_lock_irqsave(&dwc->lock, flags); > dwc->desired_dr_role = mode; // DWC3_GCTL_PRTCAP_HOST > spin_unlock_irqrestore(&dwc->lock, flags); > queue_work(system_freezable_wq, &dwc->drd_work); // true, schedules __dwc3_set_mode > > CPU B > __dwc3_set_mode > // .... > spin_lock_irqsave(&dwc->lock, flags); > dwc3_set_prtcap(dwc, dwc->desired_dr_role); // DWC3_GCTL_PRTCAP_HOST > spin_unlock_irqrestore(&dwc->lock, flags); > > CPU A > dwc3_set_mode(DWC3_GCTL_PRTCAP_DEVICE) // second role switch event > spin_lock_irqsave(&dwc->lock, flags); > dwc->desired_dr_role = mode; // DWC3_GCTL_PRTCAP_DEVICE > spin_unlock_irqrestore(&dwc->lock, flags); > > CPU B (continues running __dwc3_set_mode) > switch (dwc->desired_dr_role) { // DWC3_GCTL_PRTCAP_DEVICE > case DWC3_GCTL_PRTCAP_HOST: > // not executed since desired_dr_role is DWC3_GCTL_PRTCAP_DEVICE now > break; > > CPU A (continues running dwc3_set_mode) > queue_work(system_freezable_wq, &dwc->drd_work); // __dwc3_set_mode is still running > > CPU B (continues running __dwc3_set_mode) > case DWC3_GCTL_PRTCAP_DEVICE: > // .... > ret = dwc3_gadget_init(dwc); > > > We then have DWC3_GCTL.DWC3_GCTL_PRTCAPDIR = DWC3_GCTL_PRTCAP_HOST and > dwc->current_dr_role = DWC3_GCTL_PRTCAP_HOST but initialized the controller in > device mode when calling dwc3_gadget_init. This obviously doesn't work and is > not easy to recover from. > > Unfortunately we can't just lock dwc3->mutex since dwc3_set_mode may be called > from an extcon interrupt in atomic context (which is probably the reason for > deferring the mode switch to a workqueue). > > Otherwise I can only think of creating a single-threaded work queue and > allocating a new work_struct together with desired_role inside dwc3_set_mode > and putting that onto the queue, i.e. something like: > > struct dwc3_set_mode_work { > struct dwc3 *dwc; > u32 desired_dr_role; > struct work_struct work; > }; > > void dwc3_set_mode(struct dwc3 *dwc, u32 mode) > { > struct dwc3_set_mode_work *work = kzalloc(sizeof(*work), GFP_ATOMIC); > > INIT_WORK(&work->work, __dwc3_set_mode); > work->dwc = dwc; > work->desired_dr_role = mode; > queue_work(dwc->drd_work_queue, &work->work); > } > > That way all role switch events will be executed in order and we can't race > desired_dr_role anymore. > I'm not very happy with that solution but can't think of anything else. > > Any thoughts or ideas? I can otherwise prepare a patch. your alternate solution to operate on a stack copy of dwc->desired_dr_role in __dwc3_set_mode() avoids this race and seems to work fine. Tested with 2 Apple silicon devices. Trivial patch left to the sender since I did just the trivial typing and testing. thanks Janne