On Sat, Apr 13, 2024 at 01:08:41PM +0800, Sam Sun wrote: > On Sat, Apr 13, 2024 at 2:11 AM Alan Stern <stern@xxxxxxxxxxxxxxxxxxx> wrote: > > > > On Sat, Apr 13, 2024 at 12:26:07AM +0800, Sam Sun wrote: > > > On Fri, Apr 12, 2024 at 10:40 PM Alan Stern <stern@xxxxxxxxxxxxxxxxxxx> wrote: > > > > I suspect the usb_hub_to_struct_hub() call is racing with the > > > > spinlock-protected region in hub_disconnect() (in hub.c). > > > > > > > > > If there is any other thing I could help, please let me know. > > > > > > > > Try the patch below. It should eliminate that race, which hopefully > > > > will fix the problem. > > > > > I applied this patch and tried to execute several times, no more > > > kernel core dump in my environment. I think this bug is fixed by the > > > patch. But I do have one more question about it. Since it is a data > > > race bug, it has reproducibility issues originally. How can I confirm > > > if a racy bug is fixed by test? This kind of bug might still have a > > > race window but is harder to trigger. Just curious, not for this > > > patch. I think this patch eliminates the racy window. > > > > If you don't what what is racing, then testing cannot prove that a race > > is eliminated. However, if you do know where a race occurs then it's > > easy to see how mutual exclusion can prevent the race from happening. > > > > In this case the bug might have had a different cause, something other > > than a race between usb_hub_to_struct_hub() and hub_disconnect(). If > > that's so then testing this patch would not be a definite proof that the > > bug is gone. But if that race _is_ the cause of the bug then this patch > > will fix it -- you can see that just by reading the code with no need > > for testing. > > > > Besides, the patch is needed in any case because that race certainly > > _can_ occur. And maybe not only on this pathway. > > > > Thanks for explaining! I will check the related code next time. > > > May I add your "Reported-and-tested-by:" to the patch? > > Sure, thanks for your help! Actually, I've got a completely different patch which I think will fix the problem you encountered. Instead of using mutual exclusion to avoid the race, it prevents the two routines from being called at the same time so the race can't occur in the first place. It also should guarantee the usb_hub_to_struct_hub() doesn't return NULL when disable_store() calls it. Can you try the patch below, instead of (not along with) the first patch? Thanks. Alan Stern Index: usb-devel/drivers/usb/core/hub.c =================================================================== --- usb-devel.orig/drivers/usb/core/hub.c +++ usb-devel/drivers/usb/core/hub.c @@ -1788,16 +1788,15 @@ static void hub_disconnect(struct usb_in mutex_lock(&usb_port_peer_mutex); + for (port1 = hdev->maxchild; port1 > 0; --port1) + usb_hub_remove_port_device(hub, port1); + /* Avoid races with recursively_mark_NOTATTACHED() */ spin_lock_irq(&device_state_lock); - port1 = hdev->maxchild; hdev->maxchild = 0; usb_set_intfdata(intf, NULL); spin_unlock_irq(&device_state_lock); - for (; port1 > 0; --port1) - usb_hub_remove_port_device(hub, port1); - mutex_unlock(&usb_port_peer_mutex); if (hub->hdev->speed == USB_SPEED_HIGH)