On Mon, Jul 15, 2019 at 10:23 AM Oliver Hartkopp <socketcan@xxxxxxxxxxxx> wrote: > > Hello all, > > On 14.07.19 06:07, syzbot wrote: > > syzbot has found a reproducer for the following crash on: > > the internal users of the CAN networking subsystem like CAN_BCM and > CAN_RAW hold a number of CAN identifier subscriptions ('filters') for > CAN netdevices (only type ARPHRD_CAN) in their socket data structures. > > The per-socket netdevice notifier is used to manage the ad-hoc removal > of these filters at netdevice removal time. > > What I can see in the console output at > > https://syzkaller.appspot.com/x/log.txt?x=10e45f0fa00000 > > seems to be a race between an unknown register_netdevice_notifier() call > ("A") and the unregister_netdevice_notifier() ("B") likely invoked by > bcm_release() ("C"): > > [ 1047.294207][ T1049] schedule+0xa8/0x270 > [ 1047.318401][ T1049] rwsem_down_write_slowpath+0x70a/0xf70 > [ 1047.324114][ T1049] ? downgrade_write+0x3c0/0x3c0 > [ 1047.438644][ T1049] ? mark_held_locks+0xf0/0xf0 > [ 1047.443483][ T1049] ? lock_acquire+0x190/0x410 > [ 1047.448191][ T1049] ? unregister_netdevice_notifier+0x7e/0x390 > [ 1047.547227][ T1049] down_write+0x13c/0x150 > [ 1047.579535][ T1049] ? down_write+0x13c/0x150 > [ 1047.584106][ T1049] ? __down_timeout+0x2d0/0x2d0 > [ 1047.635356][ T1049] ? mark_held_locks+0xf0/0xf0 > [ 1047.640721][ T1049] unregister_netdevice_notifier+0x7e/0x390 <- "B" > [ 1047.646667][ T1049] ? __sock_release+0x89/0x280 > [ 1047.709126][ T1049] ? register_netdevice_notifier+0x630/0x630 <- "A" > [ 1047.715203][ T1049] ? __kasan_check_write+0x14/0x20 > [ 1047.775138][ T1049] bcm_release+0x93/0x5e0 <- "C" > [ 1047.795337][ T1049] __sock_release+0xce/0x280 > [ 1047.829016][ T1049] sock_close+0x1e/0x30 > > The question to me is now: > > Is the problem located in an (un)register_netdevice_notifier race OR is > it generally a bad idea to call unregister_netdevice_notifier() in a > sock release? To me it doesn't look like anything wrong in CAN. If you look at a few more reports from syzbot for the same bug, it actually indicates something else is holding the pernet_ops_rwsem which caused this hung task. When NMI is kicked, it shows nf_ct_iterate_cleanup() was getting stuck: NMI backtrace for cpu 0 CPU: 0 PID: 1044 Comm: khungtaskd Not tainted 5.2.0-rc5+ #42 Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS Google 01/01/2011 Call Trace: __dump_stack lib/dump_stack.c:77 [inline] dump_stack+0x172/0x1f0 lib/dump_stack.c:113 nmi_cpu_backtrace.cold+0x63/0xa4 lib/nmi_backtrace.c:101 nmi_trigger_cpumask_backtrace+0x1be/0x236 lib/nmi_backtrace.c:62 arch_trigger_cpumask_backtrace+0x14/0x20 arch/x86/kernel/apic/hw_nmi.c:38 trigger_all_cpu_backtrace include/linux/nmi.h:146 [inline] check_hung_uninterruptible_tasks kernel/hung_task.c:205 [inline] watchdog+0x9b7/0xec0 kernel/hung_task.c:289 kthread+0x354/0x420 kernel/kthread.c:255 ret_from_fork+0x24/0x30 arch/x86/entry/entry_64.S:352 Sending NMI from CPU 0 to CPUs 1: NMI backtrace for cpu 1 CPU: 1 PID: 6877 Comm: kworker/u4:1 Not tainted 5.2.0-rc5+ #42 Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS Google 01/01/2011 Workqueue: netns cleanup_net RIP: 0010:__sanitizer_cov_trace_pc+0x0/0x50 kernel/kcov.c:95 Code: 89 25 e4 bc 15 09 41 bc f4 ff ff ff e8 6d 04 ea ff 48 c7 05 ce bc 15 09 00 00 00 00 e9 a4 e9 ff ff 90 90 90 90 90 90 90 90 90 <55> 48 89 e5 48 8b 75 08 65 48 8b 04 25 c0 fd 01 00 65 8b 15 f0 3a RSP: 0018:ffff888088fe7ac0 EFLAGS: 00000046 RAX: 0000000000000000 RBX: 0000000000000000 RCX: ffffffff817637c6 RDX: 0000000000000000 RSI: 0000000000000000 RDI: 0000000000000005 RBP: ffff888088fe7af8 R08: ffff888096406340 R09: fffffbfff118bda9 R10: fffffbfff118bda8 R11: ffffffff88c5ed43 R12: ffffffff85b86d11 R13: ffff888096406340 R14: 0000000000000001 R15: dffffc0000000000 FS: 0000000000000000(0000) GS:ffff8880ae900000(0000) knlGS:0000000000000000 CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 CR2: ffffffffff600400 CR3: 0000000099e59000 CR4: 00000000001406e0 Call Trace: __local_bh_enable_ip+0x11a/0x270 kernel/softirq.c:171 local_bh_enable include/linux/bottom_half.h:32 [inline] get_next_corpse net/netfilter/nf_conntrack_core.c:2015 [inline] nf_ct_iterate_cleanup+0x217/0x4e0 net/netfilter/nf_conntrack_core.c:2038 nf_conntrack_cleanup_net_list+0x7a/0x240 net/netfilter/nf_conntrack_core.c:2225 nf_conntrack_pernet_exit+0x159/0x1a0 net/netfilter/nf_conntrack_standalone.c:1151 ops_exit_list.isra.0+0xfc/0x150 net/core/net_namespace.c:168 cleanup_net+0x4e2/0xa40 net/core/net_namespace.c:575 process_one_work+0x989/0x1790 kernel/workqueue.c:2269 worker_thread+0x98/0xe40 kernel/workqueue.c:2415 kthread+0x354/0x420 kernel/kthread.c:255 ret_from_fork+0x24/0x30 arch/x86/entry/entry_64.S:352