RE: siw_cm.c:255 siw_cep_put+0x125/0x130 kernel warning while testing blktests srp/002 v5.17-rc7

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 



> -----Original Message-----
> From: Cheng Xu <chengyou@xxxxxxxxxxxxxxxxx>
> Sent: Thursday, 5 May 2022 10:38
> To: Luis Chamberlain <mcgrof@xxxxxxxxxx>
> Cc: Bernard Metzler <BMT@xxxxxxxxxxxxxx>; Bart Van Assche
> <bvanassche@xxxxxxx>; linux-block@xxxxxxxxxxxxxxx; linux-
> rdma@xxxxxxxxxxxxxxx; Pankaj Raghav <pankydev8@xxxxxxxxx>; Pankaj Raghav
> <p.raghav@xxxxxxxxxxx>
> Subject: [EXTERNAL] Re: siw_cm.c:255 siw_cep_put+0x125/0x130 kernel
> warning while testing blktests srp/002 v5.17-rc7
> 
> 
> 
> On 5/5/22 4:40 AM, Luis Chamberlain wrote:
> > On Thu, Apr 21, 2022 at 10:02:47AM +0800, Cheng Xu wrote:
> >>
> >>
> >> On 4/19/22 11:53 PM, Bernard Metzler wrote:
> >>>> -----Original Message-----
> >>>> From: Cheng Xu <chengyou@xxxxxxxxxxxxxxxxx>
> >>>> Sent: Monday, 18 April 2022 10:29
> >>>> To: Luis Chamberlain <mcgrof@xxxxxxxxxx>; Bernard Metzler
> >>>> <BMT@xxxxxxxxxxxxxx>; Bart Van Assche <bvanassche@xxxxxxx>
> >>>> Cc: linux-block@xxxxxxxxxxxxxxx; linux-rdma@xxxxxxxxxxxxxxx; Pankaj
> Raghav
> >>>> <pankydev8@xxxxxxxxx>; Pankaj Raghav <p.raghav@xxxxxxxxxxx>
> >>>> Subject: [EXTERNAL] Re: siw_cm.c:255 siw_cep_put+0x125/0x130 kernel
> >>>> warning while testing blktests srp/002 v5.17-rc7
> >>>>
> >>>>
> >>>>
> >>>> On 4/15/22 7:31 AM, Luis Chamberlain wrote:
> >>>>
> >>>> <...>
> >>>>
> >>>>> [  195.218783] ------------[ cut here ]------------
> >>>>> [  195.221242] WARNING: CPU: 7 PID: 201 at
> >>>> drivers/infiniband/sw/siw/siw_cm.c:255 siw_cep_put+0x125/0x130
> [siw]
> >>>>> [  195.222838] Modules linked in: ib_srp(E) scsi_transport_srp(E)
> >>>> target_core_pscsi(E) target_core_file(E) ib_srpt(E)
> target_core_iblock(E)
> >>>> target_core_mod(E) rdma_cm(E) iw_cm(E) ib_cm(E) scsi_debug(E)
> siw(E)
> >>>> null_blk(E) ib_umad(E) ib_uverbs(E) sd_mod(E) sg(E)
> dm_service_time(E)
> >>>> scsi_dh_rdac(E) scsi_dh_emc(E) scsi_dh_alua(E) dm_multipath(E)
> ib_core(E)
> >>>> dm_mod(E) nvme_fabrics(E) kvm_intel(E) kvm(E) irqbypass(E)
> >>>> crct10dif_pclmul(E) ghash_clmulni_intel(E) aesni_intel(E)
> crypto_simd(E)
> >>>> cryptd(E) joydev(E) evdev(E) serio_raw(E) cirrus(E)
> drm_shmem_helper(E)
> >>>> drm_kms_helper(E) virtio_balloon(E) cec(E) i6300esb(E) button(E)
> drm(E)
> >>>> configfs(E) ip_tables(E) x_tables(E) autofs4(E) ext4(E) crc16(E)
> mbcache(E)
> >>>> jbd2(E) btrfs(E) blake2b_generic(E) xor(E) raid6_pq(E)
> zstd_compress(E)
> >>>> libcrc32c(E) crc32c_generic(E) virtio_net(E) net_failover(E)
> failover(E)
> >>>> virtio_blk(E) ata_generic(E) uhci_hcd(E) ehci_hcd(E)
> crc32_pclmul(E)
> >>>> crc32c_intel(E) ata_piix(E) psmouse(E) nvme(E) libata(E)
> virtio_pci(E)
> >>>>> [  195.222986]  virtio_pci_legacy_dev(E) virtio_pci_modern_dev(E)
> >>>> usbcore(E) virtio(E) usb_common(E) scsi_mod(E) nvme_core(E)
> i2c_piix4(E)
> >>>> virtio_ring(E) t10_pi(E) scsi_common(E) [last unloaded: null_blk]
> >>>>> [  195.241036] sd 3:0:0:1: [sdd] Attached SCSI disn
> >>>>> [  195.241188] CPU: 2 PID: 201 Comm: kworker/u16:22 Kdump: loaded
> >>>> Tainted: G            E     5.17.0-rc7 #1
> >>>>> [  195.246053] Hardware name: QEMU Standard PC (i440FX + PIIX,
> 1996),
> >>>> BIOS 1.15.0-1 04/01/2014
> >>>>> [  195.249123] Workqueue: iw_cm_wq cm_work_handler [iw_cm]
> >>>>> [  195.251274] RIP: 0010:siw_cep_put+0x125/0x130 [siw]
> >>>>> [  195.253548] Code: bb c0 e8 ae 74 0f d7 48 89 ef 5d 41 5c 41 5d
> e9 b1 d6 ef
> >>>> d6 5d be 03 00 00 00 41 5c 41 5d e9 22 b7 0c d7 0f 0b e9 f3 fe ff
> ff <0f> 0b e9 1c
> >>>> ff ff ff 0f 1f 40 00 0f 1f 44 00 00 55 48 8d 6f 20 53
> >>>>> [  195.258982] RSP: 0018:ffffbc53404ebc98 EFLAGS: 00010286
> >>>>> [  195.261018] RAX: 0000000000000001 RBX: 0000000000000000 RCX:
> >>>> 0000000000000000
> >>>>> [  195.263569] RDX: 0000000000000001 RSI: 0000000000000246 RDI:
> >>>> ffffa03d1102a924
> >>>>> [  195.266151] RBP: ffffa03d1102a900 R08: ffffa03d1102a920 R09:
> >>>> ffffbc53404ebc50
> >>>>> [  195.269150] R10: ffffffff98a060e0 R11: 0000000000000000 R12:
> >>>> ffffa03cc4297000
> >>>>> [  195.272744] R13: ffffa03d2a48aea0 R14: ffffa03d2a48ae78 R15:
> >>>> ffffa03cc427ad58
> >>>>> [  195.275575] FS:  0000000000000000(0000)
> GS:ffffa03df7c80000(0000)
> >>>> knlGS:0000000000000000
> >>>>> [  195.278932] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
> >>>>> [  195.280963] CR2: 00005590bc2e4fe8 CR3: 000000008500a004 CR4:
> >>>> 0000000000770ee0
> >>>>> [  195.282803] DR0: 0000000000000000 DR1: 0000000000000000 DR2:
> >>>> 0000000000000000
> >>>>> [  195.284650] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7:
> >>>> 0000000000000400
> >>>>> [  195.286522] PKRU: 55555554
> >>>>> [  195.287998] Call Trace:
> >>>>> [  195.289210]  <TASK>
> >>>>> [  195.290969]  siw_reject+0xac/0x180 [siw]
> >>>>> [  195.292679]  iw_cm_reject+0x68/0xc0 [iw_cm]
> >>>>> [  195.294136]  cm_work_handler+0x59d/0xe20 [iw_cm]
> >>>>> [  195.295588]  process_one_work+0x1e2/0x3b0
> >>>>> [  195.298338]  worker_thread+0x50/0x3a0
> >>>>> [  195.300330]  ? rescuer_thread+0x390/0x390
> >>>>> [  195.302269]  kthread+0xe5/0x110
> >>>>> [  195.304062]  ? kthread_complete_and_exit+0x20/0x20
> >>>>> [  195.307612]  ret_from_fork+0x1f/0x30
> >>>>> [  195.309585]  </TASK>
> >>>>> [  195.310674] ---[ end trace 0000000000000000 ]---
> >>>>> [  195.313290] scsi host4: ib_srp: REJ received
> >>>>> [  195.313293] scsi host4:   REJ reason 0xffffff98
> >>>>> [  195.315433] scsi host4: ib_srp: Connection 0/8 to 172.17.8.113
> failed
> >>>>> [  195.472718] ib_srp:srp_parse_in: ib_srp: 172.17.8.113 ->
> 172.17.8.113:0
> >>>>> [  195.472739] ib_srp:srp_parse_in: ib_srp: 172.17.8.113:5555 ->
> >>>> 172.17.8.113:5555
> >>>>> [  195.472807] ib_srp:srp_parse_in: ib_srp:
> [fe80::5054:ff:fe5b:90dc%3] ->
> >>>> [fe80::5054:ff:fe5b:90dc]:0/202442865%3
> >>>>>> [0] INVALID URI REMOVED
> >>>> 3A__github.com_mcgrof_kdevops&d=DwIGaQ&c=jf_iaSHvJObTbx-
> >>>> siA1ZOg&r=2TaYXQ0T-
> >>>> r8ZO1PP1alNwU_QJcRRLfmYTAgd3QCvqSc&m=7dWDVPFaNFXoRqokXmPFFy
> >>>> XkVL2yItLNzYUDfM4ULTg&s=1ezv_qa-
> >>>> ujLTftm7OxJ5xNZuoKrc70DJPBDccqZokbY&e=
> >>>>>    >    Luis
> >>>>
> >>>> Hi, Bernard
> >>>>
> >>>> I reproduced this issue, and it looks like a condition race between
> >>>> 'cm_work_handler' and 'siw_cm_work_handler'.
> >>>>
> >>>> ----------------------------------------------------------------
> >>>>     Thread0:                         Thread1:
> >>>>     siw_cm_work_handler              cm_work_handler
> >>>> ----------------------------------------------------------------
> >>>> step0:
> >>>> siw_cm_upcall with
> >>>> IW_CM_EVENT_CONNECT_REQUEST
> >>>>
> >>>>                                ===> cm_conn_req_handler
> >>>>                                       ...
> >>>>                                         cm_id->cm_handler (failed)
> >>>>                                         iw_cm_reject
> >>>>                                              siw_reject
> >>>>
> >>>> *step1*:
> >>>> detach cep with listen_cep
> >>>> ----------------------------------------------------------------
> >>>>
> >>>> When siw_reject is called in cm_work_handler, the related cep may
> have
> >>>> not been detached with its listen_cep, through the two steps are
> very
> >>>> close.
> >>>>
> >>>> I think one simple way to fix this issue is keep step1 under
> >>>> siw_cep_set_inuse's protection, and this will make siw_reject will
> be
> >>>> pending util siw_cm_work_handler release the lock:
> >>>>
> >>>> diff --git a/drivers/infiniband/sw/siw/siw_cm.c
> >>>> b/drivers/infiniband/sw/siw/siw_cm.c
> >>>> index 7acdd3c3a599..f033b6da1e9f 100644
> >>>> --- a/drivers/infiniband/sw/siw/siw_cm.c
> >>>> +++ b/drivers/infiniband/sw/siw/siw_cm.c
> >>>> @@ -968,13 +968,15 @@ static void siw_accept_newconn(struct siw_cep
> >>>> *cep)
> >>>>
> >>>>                    siw_cep_set_inuse(new_cep);
> >>>>                    rv = siw_proc_mpareq(new_cep);
> >>>> -               siw_cep_set_free(new_cep);
> >>>>
> >>>>                    if (rv != -EAGAIN) {
> >>>>                            siw_cep_put(cep);
> >>>>                            new_cep->listen_cep = NULL;
> >>>> +                       siw_cep_set_free(new_cep);
> >>>>                            if (rv)
> >>>>                                    goto error;
> >>>> +               } else {
> >>>> +                       siw_cep_set_free(new_cep);
> >>>>                    }
> >>>>            }
> >>>>            return;
> >>>>
> >>>> Thanks,
> >>>> Cheng Xu
> >>>
> >>> Hi Cheng, many thanks for looking into it! Unfortunately
> >>> I am out next 12 days until May. I will immediately look into
> >>> it when back. Your explanation sounds reasonable, but I'd
> >>> like to fully understand.
> >> I'd like to send a patch to fix this. When you back, you can review
> this
> >> issue and the patch.
> >>
> >> Was it fixing the issue for you?
> >>
> >> Sure, With this change, the WARN in dmesg does not appear any more in
> >> my tests.
> >>
> >> Thanks,
> >> Cheng Xu
> >
> > *poke*
> >
> > Would be good to get a fix merged. And if a patch is posted does this
> > need to go to stable?
> >
> >    Luis
> 
> The patch has been accepted and merged to for-rc, see:
> 
> INVALID URI REMOVED
> 3A__lore.kernel.org_all_d528d83466c44687f3872eadcb8c184528b2e2d4.1650526
> 554.git.chengyou-40linux.alibaba.com_T_&d=DwICaQ&c=jf_iaSHvJObTbx-
> siA1ZOg&r=2TaYXQ0T-r8ZO1PP1alNwU_QJcRRLfmYTAgd3QCvqSc&m=gj2AyKoOM_k9fYF-
> _XQ4HcYw_viOIwl6lDNPHqp7L1y2OiVRWvZkTFGFHSSZInor&s=P_HaXIXt9mBbCeBNBLsWe
> RTz5hvnUGUvObzs8lowzCM&e=
> 
> I think this patch need not be merged back to stable, because the issue
> is not a functional problem, but only produce a WARN in dmesg.
> 
> Thanks,
> Cheng Xu

I agree. It does not fix a memory leak or some such.




[Index of Archives]     [Linux USB Devel]     [Video for Linux]     [Linux Audio Users]     [Photo]     [Yosemite News]     [Yosemite Photos]     [Linux Kernel]     [Linux SCSI]     [XFree86]

  Powered by Linux