Re: [syzbot] [rdma?] WARNING in ib_uverbs_release_dev

Leon Romanovsky <leon@xxxxxxxxxx> · Wed, 19 Jun 2024 20:48:02 +0300

On Wed, Jun 19, 2024 at 10:16:20PM +0800, Zhu Yanjun wrote:
> 在 2024/6/19 17:15, Leon Romanovsky 写道:
> > On Tue, Jun 18, 2024 at 11:37:18PM -0700, syzbot wrote:
> > > Hello,
> > > 
> > > syzbot found the following issue on:
> > > 
> > > HEAD commit:    2ccbdf43d5e7 Merge tag 'for-linus' of git://git.kernel.org..
> > > git tree:       upstream
> > > console output: https://syzkaller.appspot.com/x/log.txt?x=179e93fe980000
> > > kernel config:  https://syzkaller.appspot.com/x/.config?x=fa0ce06dcc735711
> > > dashboard link: https://syzkaller.appspot.com/bug?extid=19ec7595e3aa1a45f623
> > > compiler:       Debian clang version 15.0.6, GNU ld (GNU Binutils for Debian) 2.40
> > > 
> > > Unfortunately, I don't have any reproducer for this issue yet.
> > > 
> > > Downloadable assets:
> > > disk image: https://storage.googleapis.com/syzbot-assets/27e64d7472ce/disk-2ccbdf43.raw.xz
> > > vmlinux: https://storage.googleapis.com/syzbot-assets/e1c494bb5c9c/vmlinux-2ccbdf43.xz
> > > kernel image: https://storage.googleapis.com/syzbot-assets/752498985a5e/bzImage-2ccbdf43.xz
> > > 
> > > IMPORTANT: if you fix the issue, please add the following tag to the commit:
> > > Reported-by: syzbot+19ec7595e3aa1a45f623@xxxxxxxxxxxxxxxxxxxxxxxxx
> > > 
> > > smc: removing ib device syz0
> > > ------------[ cut here ]------------
> > > WARNING: CPU: 0 PID: 51 at kernel/rcu/srcutree.c:653 cleanup_srcu_struct+0x404/0x4d0 kernel/rcu/srcutree.c:653
> > > Modules linked in:
> > > CPU: 0 PID: 51 Comm: kworker/u8:3 Not tainted 6.10.0-rc3-syzkaller-00044-g2ccbdf43d5e7 #0
> > > Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS Google 06/07/2024
> > > Workqueue: ib-unreg-wq ib_unregister_work
> > > RIP: 0010:cleanup_srcu_struct+0x404/0x4d0 kernel/rcu/srcutree.c:653
> > > Code: 12 80 00 48 c7 03 00 00 00 00 48 83 c4 48 5b 41 5c 41 5d 41 5e 41 5f 5d e9 14 67 34 0a 90 0f 0b 90 eb e7 90 0f 0b 90 eb e1 90 <0f> 0b 90 eb db 90 0f 0b 90 eb 0a 90 0f 0b 90 eb 04 90 0f 0b 90 48
> > > RSP: 0018:ffffc90000bb7970 EFLAGS: 00010202
> > > RAX: 0000000000000001 RBX: ffff88802a1bc980 RCX: 0000000000000002
> > > RDX: 0000000000000000 RSI: 0000000000000008 RDI: ffffe8ffffd74c58
> > > RBP: 0000000000000001 R08: ffffe8ffffd74c5f R09: 1ffffd1ffffae98b
> > > R10: dffffc0000000000 R11: fffff91ffffae98c R12: dffffc0000000000
> > > R13: ffff88802285b5f0 R14: ffff88802285b000 R15: ffff88802a1bc800
> > > FS:  0000000000000000(0000) GS:ffff8880b9400000(0000) knlGS:0000000000000000
> > > CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
> > > CR2: 00007fa3852cae10 CR3: 000000000e132000 CR4: 0000000000350ef0
> > > Call Trace:
> > >   <TASK>
> > >   ib_uverbs_release_dev+0x4e/0x80 drivers/infiniband/core/uverbs_main.c:136
> > >   device_release+0x9b/0x1c0
> > >   kobject_cleanup lib/kobject.c:689 [inline]
> > >   kobject_release lib/kobject.c:720 [inline]
> > >   kref_put include/linux/kref.h:65 [inline]
> > >   kobject_put+0x231/0x480 lib/kobject.c:737
> > >   remove_client_context+0xb9/0x1e0 drivers/infiniband/core/device.c:776
> > >   disable_device+0x13b/0x360 drivers/infiniband/core/device.c:1282
> > >   __ib_unregister_device+0x6d/0x170 drivers/infiniband/core/device.c:1475
> > >   ib_unregister_work+0x19/0x30 drivers/infiniband/core/device.c:1586
> > >   process_one_work kernel/workqueue.c:3231 [inline]
> > >   process_scheduled_works+0xa2e/0x1830 kernel/workqueue.c:3312
> > >   worker_thread+0x86d/0xd70 kernel/workqueue.c:3393
> > >   kthread+0x2f2/0x390 kernel/kthread.c:389
> > >   ret_from_fork+0x4d/0x80 arch/x86/kernel/process.c:147
> > >   ret_from_fork_asm+0x1a/0x30 arch/x86/entry/entry_64.S:244
> > >   </TASK>
> > 
> > I see that this is caused by call to ib_unregister_device_queued() as a
> > response to NETDEV_UNREGISTER event, but we don't flush anything before.
> > How can we be sure that ib_device is not used anymore?
> 
> Hi, Leon
> 
> This is the console output:
> 
> https://syzkaller.appspot.com/x/log.txt?x=179e93fe980000
> 
> From the above link, it seems that other devices or subsystems failed
> firstly, then caused this call trace to appear. When other problem occurred,
> the whole kernel system was in mess state.So it is not weird that some
> problems occurred.

Which devices/subsystems failed? I grepped the log and don't see
anything suspicious, before first "------------[ cut here ]------------"
sentence.

> 
> To be simple, the root cause is not in RDMA subsystem.
> 
> I will continue to delve into this problem.
> 
> Zhu Yanjun
> > 
> > Thanks
>