Re: [PATCH v3] usb: gadget: u_ether: Use __netif_rx() in rx_callback()

Hubert Wiśniewski <hubert.wisniewski.25632@xxxxxxxxx> · Thu, 26 Sep 2024 21:39:07 +0200

On Tue, 2024-09-17 at 16:36 +0200, Sebastian Andrzej Siewior wrote:
> On 2024-09-17 15:56:33 [+0200], Hubert Wiśniewski wrote:
> > netif_rx() now disables bottom halves, which causes the USB gadget to be
> > unable to receive frames if the interface is not brought up quickly enough
> > after being created by the driver (a bug confirmed on AM3352 SoC).
> > 
> > Replacing netif_rx() with __netif_rx() restores the old behavior and fixes
> > the bug. This can be done since rx_callback() is called from the interrupt
> > context.
> > 
> > Fixes: baebdf48c360 ("net: dev: Makes sure netif_rx() can be invoked in any context.")
> > Cc: stable@xxxxxxxxxxxxxxx
> > Signed-off-by: Hubert Wiśniewski <hubert.wisniewski.25632@xxxxxxxxx>
> 
> Now that I see v3, my v2 question stands.
> This supposed only to disable BH if invoked from from non-interrupt.
> hardirq and softirq should be good. A backtrace would be nice and further
> explanation how this becomes a problem. Also lockdep should complain at
> some point.
> 
> Sebastian

This lockup seems to occur only on AM335x SoCs. If the interface is brought
up immediately after it is created:

# modprobe g_ether; ip link set usb0 up

it works fine. But if there is some delay:

# modprobe g_ether; sleep 5; ip link set usb0 up

the interface is unable to receive any frames from the USB host (but the
outgoing frames are sent correctly).
This becomes a problem when the g_ether module is loaded early in the boot
process or it is built in, and the interface is configured later (e.g. by
ifupdown).
The same thing happens when using configfs to create and configure an
Ethernet gadget.

I have built the kernel with CONFIG_PROVE_LOCKING=y and got the following
messages upon bringing the interface up:

Backtrace (timestamps stripped):

 WARNING: CPU: 0 PID: 86 at kernel/softirq.c:362 __local_bh_enable_ip+0x118/0x198
 CPU: 0 UID: 0 PID: 86 Comm: ip Not tainted 6.11.0 #3
 Hardware name: Generic AM33XX (Flattened Device Tree)
 Call trace:
  unwind_backtrace from show_stack+0x10/0x14
  show_stack from dump_stack_lvl+0x68/0x88
  dump_stack_lvl from __warn+0x70/0x1ac
  __warn from warn_slowpath_fmt+0x12c/0x1c4
  warn_slowpath_fmt from __local_bh_enable_ip+0x118/0x198
  __local_bh_enable_ip from netif_rx+0x118/0x208
  netif_rx from rx_complete+0x140/0x27c
  rx_complete from musb_g_giveback+0xf0/0x1d8
  musb_g_giveback from musb_ep_restart_resume_work+0x8/0x10
  musb_ep_restart_resume_work from musb_gadget_queue+0x1c8/0x498
  musb_gadget_queue from usb_ep_queue+0x38/0x134
  usb_ep_queue from rx_submit+0xdc/0x1cc
  rx_submit from rx_fill+0x7c/0xa0
  rx_fill from eth_start+0x30/0x54
  eth_start from eth_open+0x40/0x80
  eth_open from __dev_open+0xe0/0x174
  __dev_open from __dev_change_flags+0x160/0x1d0
  __dev_change_flags from dev_change_flags+0x1c/0x58
  dev_change_flags from devinet_ioctl+0x6d0/0x884
  devinet_ioctl from inet_ioctl+0x1ac/0x2bc
  inet_ioctl from sock_ioctl+0x2c0/0x3b8
  sock_ioctl from sys_ioctl+0x194/0xf88
  sys_ioctl from ret_fast_syscall+0x0/0x1c
 Exception stack(0xe1485fa8 to 0xe1485ff0)
 5fa0:                   000b3630 00000001 00000003 00008914 be802cd8 be802c90
 5fc0: 000b3630 00000001 00000003 00000036 ffffffff ffffffff ffffffff be802ed4
 5fe0: 000c74cc be802c80 00028000 b6e7d14c
 irq event stamp: 2581
 hardirqs last  enabled at (2579): [<c13e10cc>] _raw_spin_unlock_irqrestore+0x44/0x48
 hardirqs last disabled at (2580): [<c13e0e94>] _raw_spin_lock_irqsave+0x64/0x68
 softirqs last  enabled at (2558): [<c0e727b0>] __dev_change_flags+0x7c/0x1d0
 softirqs last disabled at (2581): [<c0e68dd8>] netif_rx+0xa4/0x208

Lockdep:

 ================================
 WARNING: inconsistent lock state
 6.11.0 #3 Tainted: G        W
 --------------------------------
 inconsistent {IN-HARDIRQ-W} -> {HARDIRQ-ON-W} usage.
 ip/86 [HC0[0]:SC0[0]:HE1:SE1] takes:
 c3e54050 (&musb->lock){?.-.}-{2:2}, at: musb_g_giveback+0xf8/0x1d8
 {IN-HARDIRQ-W} state was registered at:
   lock_acquire+0x108/0x358
   _raw_spin_lock_irqsave+0x4c/0x68
   dsps_interrupt+0x28/0x278
   __handle_irq_event_percpu+0xa0/0x2f4
   handle_irq_event_percpu+0xc/0x40
   handle_irq_event+0x38/0xcc
   handle_level_irq+0xb4/0x13c
   handle_irq_desc+0x1c/0x2c
   generic_handle_arch_irq+0x2c/0x64
   call_with_stack+0x18/0x20
   __irq_svc+0x9c/0xbc
   console_flush_all+0x25c/0x5c0
   console_unlock+0x80/0x114
   vprintk_emit+0x248/0x354
   dev_vprintk_emit+0x110/0x144
   dev_printk_emit+0x28/0x50
   __dev_printk+0x74/0x90
   _dev_warn+0x3c/0x68
   _regulator_get+0x1f8/0x31c
   sdhci_omap_regulator_get_caps+0x8/0x84
   sdhci_omap_probe+0x208/0x788
   platform_probe+0x58/0xb8
   really_probe+0xc4/0x28c
   __driver_probe_device+0x84/0x194
   driver_probe_device+0x30/0xc8
   __device_attach_driver+0xa4/0xe0
   bus_for_each_drv+0x80/0xd0
   __device_attach_async_helper+0xa8/0xdc
   async_run_entry_fn+0x20/0xb4
   process_scheduled_works+0x254/0x6e8
   worker_thread+0x13c/0x340
   kthread+0xf4/0x114
   ret_from_fork+0x14/0x24
 irq event stamp: 2649
 hardirqs last  enabled at (2649): [<c13e10cc>] _raw_spin_unlock_irqrestore+0x44/0x48
 hardirqs last disabled at (2648): [<c13e0e94>] _raw_spin_lock_irqsave+0x64/0x68
 softirqs last  enabled at (2644): [<c014b410>] handle_softirqs+0x260/0x45c
 softirqs last disabled at (2613): [<c1385f28>] call_with_stack+0x18/0x20

 other info that might help us debug this:
  Possible unsafe locking scenario:

        CPU0
        ----
   lock(&musb->lock);
   <Interrupt>
     lock(&musb->lock);

  *** DEADLOCK ***

 1 lock held by ip/86:
  #0: c1d60ca4 (rtnl_mutex){+.+.}-{3:3}, at: devinet_ioctl+0xc4/0x884

 stack backtrace:
 CPU: 0 UID: 0 PID: 86 Comm: ip Tainted: G        W          6.11.0 #3
 Tainted: [W]=WARN
 Hardware name: Generic AM33XX (Flattened Device Tree)
 Call trace:
  unwind_backtrace from show_stack+0x10/0x14
  show_stack from dump_stack_lvl+0x68/0x88
  dump_stack_lvl from mark_lock.part.17+0x3cc/0x440
  mark_lock.part.17 from __lock_acquire+0x44c/0x22b0
  __lock_acquire from lock_acquire+0x108/0x358
  lock_acquire from _raw_spin_lock+0x38/0x48
  _raw_spin_lock from musb_g_giveback+0xf8/0x1d8
  musb_g_giveback from musb_ep_restart_resume_work+0x8/0x10
  musb_ep_restart_resume_work from musb_gadget_queue+0x1c8/0x498
  musb_gadget_queue from usb_ep_queue+0x38/0x134
  usb_ep_queue from rx_submit+0xdc/0x1cc
  rx_submit from rx_fill+0x7c/0xa0
  rx_fill from eth_start+0x30/0x54
  eth_start from eth_open+0x40/0x80
  eth_open from __dev_open+0xe0/0x174
  __dev_open from __dev_change_flags+0x160/0x1d0
  __dev_change_flags from dev_change_flags+0x1c/0x58
  dev_change_flags from devinet_ioctl+0x6d0/0x884
  devinet_ioctl from inet_ioctl+0x1ac/0x2bc
  inet_ioctl from sock_ioctl+0x2c0/0x3b8
  sock_ioctl from sys_ioctl+0x194/0xf88
  sys_ioctl from ret_fast_syscall+0x0/0x1c
 Exception stack(0xe1485fa8 to 0xe1485ff0)
 5fa0:                   000b3630 00000001 00000003 00008914 be802cd8 be802c90
 5fc0: 000b3630 00000001 00000003 00000036 ffffffff ffffffff ffffffff be802ed4
 5fe0: 000c74cc be802c80 00028000 b6e7d14c

Surprisingly, with the patch I suggested, there is still a backtrace
generated, but without lockdep warning:

 WARNING: CPU: 0 PID: 87 at net/core/dev.c:5207 __netif_rx+0xd4/0x210
 CPU: 0 UID: 0 PID: 87 Comm: ip Not tainted 6.11.0-dirty #4
 Hardware name: Generic AM33XX (Flattened Device Tree)
 Call trace:
  unwind_backtrace from show_stack+0x10/0x14
  show_stack from dump_stack_lvl+0x68/0x88
  dump_stack_lvl from __warn+0x70/0x1ac
  __warn from warn_slowpath_fmt+0x12c/0x1c4
  warn_slowpath_fmt from __netif_rx+0xd4/0x210
  __netif_rx from rx_complete+0x140/0x27c
  rx_complete from musb_g_giveback+0xf0/0x1d8
  musb_g_giveback from musb_ep_restart_resume_work+0x8/0x10
  musb_ep_restart_resume_work from musb_gadget_queue+0x1c8/0x498
  musb_gadget_queue from usb_ep_queue+0x38/0x134
  usb_ep_queue from rx_submit+0xdc/0x1cc
  rx_submit from rx_fill+0x7c/0xa0
  rx_fill from eth_start+0x30/0x54
  eth_start from eth_open+0x40/0x80
  eth_open from __dev_open+0xe0/0x174
  __dev_open from __dev_change_flags+0x160/0x1d0
  __dev_change_flags from dev_change_flags+0x1c/0x58
  dev_change_flags from devinet_ioctl+0x6d0/0x884
  devinet_ioctl from inet_ioctl+0x1ac/0x2bc
  inet_ioctl from sock_ioctl+0x2c0/0x3b8
  sock_ioctl from sys_ioctl+0x194/0xf88
  sys_ioctl from ret_fast_syscall+0x0/0x1c
 Exception stack(0xe148dfa8 to 0xe148dff0)
 dfa0:                   000b3630 00000001 00000003 00008914 be8c8cd8 be8c8c90
 dfc0: 000b3630 00000001 00000003 00000036 ffffffff ffffffff ffffffff be8c8ed4
 dfe0: 000c74cc be8c8c80 00028000 b6ed614c
 irq event stamp: 2542
 hardirqs last  enabled at (2541): [<c13e10cc>] _raw_spin_unlock_irqrestore+0x44/0x48
 hardirqs last disabled at (2542): [<c13e0e94>] _raw_spin_lock_irqsave+0x64/0x68
 softirqs last  enabled at (2520): [<c0e727b0>] __dev_change_flags+0x7c/0x1d0
 softirqs last disabled at (2518): [<c0e7236c>] dev_set_rx_mode+0x0/0x40

I'm a bit at loss here. The deadlock seems to be unrelated to netif_rx()
(which is not being called in the interrupt context after all), yet
replacing it with __netif_rx() fixes the lockup (though a warning is still
generated, which suggests that the patch does not completely fix the
issue).

-- 
Hubert Wiśniewski <hubert.wisniewski.25632@xxxxxxxxx>