> -----Original Message----- > From: Guoqing Jiang <guoqing.jiang@xxxxxxxxx> > Sent: Friday, 28 July 2023 04:29 > To: Jason Gunthorpe <jgg@xxxxxxxx>; Bernard Metzler <BMT@xxxxxxxxxxxxxx> > Cc: leon@xxxxxxxxxx; linux-rdma@xxxxxxxxxxxxxxx > Subject: [EXTERNAL] Re: [PATCH 0/5] Fix potential issues for siw > > > > On 7/28/23 09:16, Guoqing Jiang wrote: > > > > > > On 7/28/23 01:29, Jason Gunthorpe wrote: > >> On Thu, Jul 27, 2023 at 05:17:40PM +0000, Bernard Metzler wrote: > >>> > >>>> -----Original Message----- > >>>> From: Guoqing Jiang <guoqing.jiang@xxxxxxxxx> > >>>> Sent: Thursday, 27 July 2023 16:04 > >>>> To: Bernard Metzler <BMT@xxxxxxxxxxxxxx>; jgg@xxxxxxxx; > >>>> leon@xxxxxxxxxx > >>>> Cc: linux-rdma@xxxxxxxxxxxxxxx > >>>> Subject: [EXTERNAL] [PATCH 0/5] Fix potential issues for siw > >>>> > >>>> Hi, > >>>> > >>>> Several issues appeared if we rmmod siw module after failed to insert > >>>> the module (with manual change like below). > >>>> > >>>> --- a/drivers/infiniband/sw/siw/siw_main.c > >>>> +++ b/drivers/infiniband/sw/siw/siw_main.c > >>>> @@ -577,6 +577,7 @@ static __init int siw_init_module(void) > >>>> if (rv) > >>>> goto out_error; > >>>> > >>>> + goto out_error; > >>>> rdma_link_register(&siw_link_ops); > >>>> > >>>> Basically, these issues are double free, use before initalization or > >>>> null pointer dereference. For more details, pls review the individual > >>>> patch. > >>>> > >>>> Thanks, > >>>> Guoqing > >>> Hi Guoqing, > >>> > >>> very good catch, thank you. I was under the wrong assumption a > >>> module is not loaded if the init_module() returns a value. > >> I think that is actually true, isn't it? I'm confused? > > > > Yes, you are right. Since rv is still 0, so the module appears in the > > kernel. Not sure if some tool could inject err like this. Feel free to > > ignore this. > > The below trace happened if I run a stress test with load siw module and > unload siw in a loop, which should be fixed by patch 5, so I think we > need to apply it, what do you think? I see. makes sense -- you are removing the module while some tx threads did not get a chance to init their state. Why don't we better cleanly initialize the tasks state before spawning it? Your extra task parameter 'running' smells like it could lead to a similar issue - at module removal, the task may have not get a chance to set 'running' true, so it is not stopped and finally left alone. Working on an according patch... Thanks, Bernard. > > [ 414.537961] BUG: spinlock bad magic on CPU#0, modprobe/3722 > [ 414.537965] lock: 0xffff9d847bc380e8, .magic: 00000000, .owner: > <none>/-1, .owner_cpu: 0 > [ 414.537969] CPU: 0 PID: 3722 Comm: modprobe Tainted: G OE > 6.5.0-rc3+ #16 > [ 414.537971] Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS > rel-1.16.0-0-gd239552c-rebuilt.opensuse.org 04/01/2014 > [ 414.537973] Call Trace: > [ 414.537973] <TASK> > [ 414.537975] dump_stack_lvl+0x77/0xd0 > [ 414.537979] dump_stack+0x10/0x20 > [ 414.537981] spin_bug+0xa5/0xd0 > [ 414.537984] do_raw_spin_lock+0x90/0xd0 > [ 414.537985] _raw_spin_lock_irqsave+0x56/0x80 > [ 414.537988] ? __wake_up_common_lock+0x63/0xd0 > [ 414.537990] __wake_up_common_lock+0x63/0xd0 > [ 414.537992] __wake_up+0x13/0x30 > [ 414.537994] siw_stop_tx_thread+0x49/0x70 [siw] > [ 414.538000] siw_exit_module+0x30/0x620 [siw] > [ 414.538006] __do_sys_delete_module.constprop.0+0x18f/0x300 > [ 414.538008] ? syscall_enter_from_user_mode+0x21/0x70 > [ 414.538010] ? __this_cpu_preempt_check+0x13/0x20 > [ 414.538012] ? lockdep_hardirqs_on+0x86/0x120 > [ 414.538014] __x64_sys_delete_module+0x12/0x20 > [ 414.538016] do_syscall_64+0x5c/0x90 > [ 414.538019] ? do_syscall_64+0x69/0x90 > [ 414.538020] ? __this_cpu_preempt_check+0x13/0x20 > [ 414.538022] ? lockdep_hardirqs_on+0x86/0x120 > [ 414.538024] ? syscall_exit_to_user_mode+0x37/0x50 > [ 414.538025] ? do_syscall_64+0x69/0x90 > [ 414.538026] ? syscall_exit_to_user_mode+0x37/0x50 > [ 414.538027] ? do_syscall_64+0x69/0x90 > [ 414.538029] ? syscall_exit_to_user_mode+0x37/0x50 > [ 414.538030] ? do_syscall_64+0x69/0x90 > [ 414.538032] ? irqentry_exit_to_user_mode+0x25/0x30 > [ 414.538033] ? irqentry_exit+0x77/0xb0 > [ 414.538034] ? exc_page_fault+0xae/0x240 > [ 414.538036] entry_SYSCALL_64_after_hwframe+0x6e/0xd8 > [ 414.538038] RIP: 0033:0x7f177eb26c9b > > Thanks, > Guoqing