Re: INFO: task hung in unregister_netdevice_notifier (3)

Cong Wang <xiyou.wangcong@xxxxxxxxx> · Mon, 15 Jul 2019 11:58:57 -0700

On Mon, Jul 15, 2019 at 10:23 AM Oliver Hartkopp <socketcan@xxxxxxxxxxxx> wrote:
>
> Hello all,
>
> On 14.07.19 06:07, syzbot wrote:
> > syzbot has found a reproducer for the following crash on:
>
> the internal users of the CAN networking subsystem like CAN_BCM and
> CAN_RAW hold a number of CAN identifier subscriptions ('filters') for
> CAN netdevices (only type ARPHRD_CAN) in their socket data structures.
>
> The per-socket netdevice notifier is used to manage the ad-hoc removal
> of these filters at netdevice removal time.
>
> What I can see in the console output at
>
> https://syzkaller.appspot.com/x/log.txt?x=10e45f0fa00000
>
> seems to be a race between an unknown register_netdevice_notifier() call
> ("A") and the unregister_netdevice_notifier() ("B") likely invoked by
> bcm_release() ("C"):
>
> [ 1047.294207][ T1049]  schedule+0xa8/0x270
> [ 1047.318401][ T1049]  rwsem_down_write_slowpath+0x70a/0xf70
> [ 1047.324114][ T1049]  ? downgrade_write+0x3c0/0x3c0
> [ 1047.438644][ T1049]  ? mark_held_locks+0xf0/0xf0
> [ 1047.443483][ T1049]  ? lock_acquire+0x190/0x410
> [ 1047.448191][ T1049]  ? unregister_netdevice_notifier+0x7e/0x390
> [ 1047.547227][ T1049]  down_write+0x13c/0x150
> [ 1047.579535][ T1049]  ? down_write+0x13c/0x150
> [ 1047.584106][ T1049]  ? __down_timeout+0x2d0/0x2d0
> [ 1047.635356][ T1049]  ? mark_held_locks+0xf0/0xf0
> [ 1047.640721][ T1049]  unregister_netdevice_notifier+0x7e/0x390  <- "B"
> [ 1047.646667][ T1049]  ? __sock_release+0x89/0x280
> [ 1047.709126][ T1049]  ? register_netdevice_notifier+0x630/0x630 <- "A"
> [ 1047.715203][ T1049]  ? __kasan_check_write+0x14/0x20
> [ 1047.775138][ T1049]  bcm_release+0x93/0x5e0                    <- "C"
> [ 1047.795337][ T1049]  __sock_release+0xce/0x280
> [ 1047.829016][ T1049]  sock_close+0x1e/0x30
>
> The question to me is now:
>
> Is the problem located in an (un)register_netdevice_notifier race OR is
> it generally a bad idea to call unregister_netdevice_notifier() in a
> sock release?

To me it doesn't look like anything wrong in CAN. If you look at a few
more reports from syzbot for the same bug, it actually indicates
something else is holding the pernet_ops_rwsem which caused this
hung task.

When NMI is kicked, it shows nf_ct_iterate_cleanup() was getting
stuck:

NMI backtrace for cpu 0
CPU: 0 PID: 1044 Comm: khungtaskd Not tainted 5.2.0-rc5+ #42
Hardware name: Google Google Compute Engine/Google Compute Engine,
BIOS Google 01/01/2011
Call Trace:
 __dump_stack lib/dump_stack.c:77 [inline]
 dump_stack+0x172/0x1f0 lib/dump_stack.c:113
 nmi_cpu_backtrace.cold+0x63/0xa4 lib/nmi_backtrace.c:101
 nmi_trigger_cpumask_backtrace+0x1be/0x236 lib/nmi_backtrace.c:62
 arch_trigger_cpumask_backtrace+0x14/0x20 arch/x86/kernel/apic/hw_nmi.c:38
 trigger_all_cpu_backtrace include/linux/nmi.h:146 [inline]
 check_hung_uninterruptible_tasks kernel/hung_task.c:205 [inline]
 watchdog+0x9b7/0xec0 kernel/hung_task.c:289
 kthread+0x354/0x420 kernel/kthread.c:255
 ret_from_fork+0x24/0x30 arch/x86/entry/entry_64.S:352
Sending NMI from CPU 0 to CPUs 1:
NMI backtrace for cpu 1
CPU: 1 PID: 6877 Comm: kworker/u4:1 Not tainted 5.2.0-rc5+ #42
Hardware name: Google Google Compute Engine/Google Compute Engine,
BIOS Google 01/01/2011
Workqueue: netns cleanup_net
RIP: 0010:__sanitizer_cov_trace_pc+0x0/0x50 kernel/kcov.c:95
Code: 89 25 e4 bc 15 09 41 bc f4 ff ff ff e8 6d 04 ea ff 48 c7 05 ce
bc 15 09 00 00 00 00 e9 a4 e9 ff ff 90 90 90 90 90 90 90 90 90 <55> 48
89 e5 48 8b 75 08 65 48 8b 04 25 c0 fd 01 00 65 8b 15 f0 3a
RSP: 0018:ffff888088fe7ac0 EFLAGS: 00000046
RAX: 0000000000000000 RBX: 0000000000000000 RCX: ffffffff817637c6
RDX: 0000000000000000 RSI: 0000000000000000 RDI: 0000000000000005
RBP: ffff888088fe7af8 R08: ffff888096406340 R09: fffffbfff118bda9
R10: fffffbfff118bda8 R11: ffffffff88c5ed43 R12: ffffffff85b86d11
R13: ffff888096406340 R14: 0000000000000001 R15: dffffc0000000000
FS:  0000000000000000(0000) GS:ffff8880ae900000(0000) knlGS:0000000000000000
CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
CR2: ffffffffff600400 CR3: 0000000099e59000 CR4: 00000000001406e0
Call Trace:
 __local_bh_enable_ip+0x11a/0x270 kernel/softirq.c:171
 local_bh_enable include/linux/bottom_half.h:32 [inline]
 get_next_corpse net/netfilter/nf_conntrack_core.c:2015 [inline]
 nf_ct_iterate_cleanup+0x217/0x4e0 net/netfilter/nf_conntrack_core.c:2038
 nf_conntrack_cleanup_net_list+0x7a/0x240 net/netfilter/nf_conntrack_core.c:2225
 nf_conntrack_pernet_exit+0x159/0x1a0
net/netfilter/nf_conntrack_standalone.c:1151
 ops_exit_list.isra.0+0xfc/0x150 net/core/net_namespace.c:168
 cleanup_net+0x4e2/0xa40 net/core/net_namespace.c:575
 process_one_work+0x989/0x1790 kernel/workqueue.c:2269
 worker_thread+0x98/0xe40 kernel/workqueue.c:2415
 kthread+0x354/0x420 kernel/kthread.c:255
 ret_from_fork+0x24/0x30 arch/x86/entry/entry_64.S:352