Re: [Linux kernel bug] general protection fault in disable_store

Sam Sun <samsun1006219@xxxxxxxxx> · Sat, 13 Apr 2024 00:26:07 +0800

On Fri, Apr 12, 2024 at 10:40 PM Alan Stern <stern@xxxxxxxxxxxxxxxxxxx> wrote:
>
> On Fri, Apr 12, 2024 at 09:08:12PM +0800, Sam Sun wrote:
> > Sorry for the mistake I made when debugging this bug. Now I have more
> > information about it. Disassembly of function disable_store() in the
> > latest upstream kernel is listed below.
> > ```
> > Dump of assembler code for function disable_store:
> >    ...
> >    0xffffffff86e907eb <+187>:   lea    -0x8(%r14),%r12
> >    0xffffffff86e907ef <+191>:   mov    (%rbx),%rax
> >    0xffffffff86e907f2 <+194>:   mov    %rax,0x20(%rsp)
> >    0xffffffff86e907f7 <+199>:   lea    -0xa8(%rax),%rdi
> >    0xffffffff86e907fe <+206>:   mov    %rdi,0x18(%rsp)
> >    0xffffffff86e90803 <+211>:   call   0xffffffff86e20220
> > <usb_hub_to_struct_hub>
> >    0xffffffff86e90808 <+216>:   mov    %rax,%rbx
> >    0xffffffff86e9080b <+219>:   shr    $0x3,%rax
> >    0xffffffff86e9080f <+223>:   movabs $0xdffffc0000000000,%rcx
> >    0xffffffff86e90819 <+233>:   cmpb   $0x0,(%rax,%rcx,1)
> >    0xffffffff86e9081d <+237>:   je     0xffffffff86e90827 <disable_store+247>
> >    0xffffffff86e9081f <+239>:   mov    %rbx,%rdi
> >    0xffffffff86e90822 <+242>:   call   0xffffffff81eeb0b0
> > <__asan_report_load8_noabort>
> >    0xffffffff86e90827 <+247>:   lea    0x60(%rsp),%rsi
> >    ...
> > ```
> > The cmpb in disable_store()<+233> is generated by KASAN to check the
> > shadow memory status. If equals 0, which means the load 8 is valid,
> > pass the KASAN check. However, this time rax is 0, so it first
> > triggers general protection fault, since 0xdffffc0000000000 is not a
> > valid address. rax contains the return address of function
> > usb_hub_to_struct_hub(), in this case is a NULL.
> >
> > In function usb_hub_to_struct_hub(), I checked hdev and its sub
> > domains, and they are not NULL. Is it possible that
> > usb_deauthorized_device() set
> > hdev->actconfig->interface[0]->dev.driver_data to NULL? I cannot
> > confirm that since every time I try to breakpoint the code it crashes
> > differently.
>
> I suspect the usb_hub_to_struct_hub() call is racing with the
> spinlock-protected region in hub_disconnect() (in hub.c).
>
> > If there is any other thing I could help, please let me know.
>
> Try the patch below.  It should eliminate that race, which hopefully
> will fix the problem.
>
> Alan Stern
>
>
>
> Index: usb-devel/drivers/usb/core/hub.c
> ===================================================================
> --- usb-devel.orig/drivers/usb/core/hub.c
> +++ usb-devel/drivers/usb/core/hub.c
> @@ -72,6 +72,9 @@
>   * change to USB_STATE_NOTATTACHED even when the semaphore isn't held. */
>  static DEFINE_SPINLOCK(device_state_lock);
>
> +/* Protect hdev->maxchild and hub's intfdata */
> +static DEFINE_SPINLOCK(hub_state_lock);
> +
>  /* workqueue to process hub events */
>  static struct workqueue_struct *hub_wq;
>  static void hub_event(struct work_struct *work);
> @@ -152,9 +155,13 @@ static inline char *portspeed(struct usb
>  /* Note that hdev or one of its children must be locked! */
>  struct usb_hub *usb_hub_to_struct_hub(struct usb_device *hdev)
>  {
> -       if (!hdev || !hdev->actconfig || !hdev->maxchild)
> -               return NULL;
> -       return usb_get_intfdata(hdev->actconfig->interface[0]);
> +       struct usb_hub *hub = NULL;
> +
> +       spin_lock_irq(&hub_state_lock);
> +       if (hdev && hdev->actconfig && hdev->maxchild)
> +               hub = usb_get_intfdata(hdev->actconfig->interface[0]);
> +       spin_unlock_irq(&hub_state_lock);
> +       return hub;
>  }
>
>  int usb_device_supports_lpm(struct usb_device *udev)
> @@ -1714,7 +1721,9 @@ static int hub_configure(struct usb_hub
>                         break;
>                 }
>         }
> +       spin_lock_irq(&hub_state_lock);
>         hdev->maxchild = i;
> +       spin_unlock_irq(&hub_state_lock);
>         for (i = 0; i < hdev->maxchild; i++) {
>                 struct usb_port *port_dev = hub->ports[i];
>
> @@ -1790,9 +1799,11 @@ static void hub_disconnect(struct usb_in
>
>         /* Avoid races with recursively_mark_NOTATTACHED() */
>         spin_lock_irq(&device_state_lock);
> +       spin_lock(&hub_state_lock);
>         port1 = hdev->maxchild;
>         hdev->maxchild = 0;
>         usb_set_intfdata(intf, NULL);
> +       spin_unlock(&hub_state_lock);
>         spin_unlock_irq(&device_state_lock);
>
>         for (; port1 > 0; --port1)
>

I applied this patch and tried to execute several times, no more
kernel core dump in my environment. I think this bug is fixed by the
patch. But I do have one more question about it. Since it is a data
race bug, it has reproducibility issues originally. How can I confirm
if a racy bug is fixed by test? This kind of bug might still have a
race window but is harder to trigger. Just curious, not for this
patch. I think this patch eliminates the racy window.

Best,
Yue