> -----Original Message----- > From: Cheng Xu <chengyou@xxxxxxxxxxxxxxxxx> > Sent: Thursday, 5 May 2022 10:38 > To: Luis Chamberlain <mcgrof@xxxxxxxxxx> > Cc: Bernard Metzler <BMT@xxxxxxxxxxxxxx>; Bart Van Assche > <bvanassche@xxxxxxx>; linux-block@xxxxxxxxxxxxxxx; linux- > rdma@xxxxxxxxxxxxxxx; Pankaj Raghav <pankydev8@xxxxxxxxx>; Pankaj Raghav > <p.raghav@xxxxxxxxxxx> > Subject: [EXTERNAL] Re: siw_cm.c:255 siw_cep_put+0x125/0x130 kernel > warning while testing blktests srp/002 v5.17-rc7 > > > > On 5/5/22 4:40 AM, Luis Chamberlain wrote: > > On Thu, Apr 21, 2022 at 10:02:47AM +0800, Cheng Xu wrote: > >> > >> > >> On 4/19/22 11:53 PM, Bernard Metzler wrote: > >>>> -----Original Message----- > >>>> From: Cheng Xu <chengyou@xxxxxxxxxxxxxxxxx> > >>>> Sent: Monday, 18 April 2022 10:29 > >>>> To: Luis Chamberlain <mcgrof@xxxxxxxxxx>; Bernard Metzler > >>>> <BMT@xxxxxxxxxxxxxx>; Bart Van Assche <bvanassche@xxxxxxx> > >>>> Cc: linux-block@xxxxxxxxxxxxxxx; linux-rdma@xxxxxxxxxxxxxxx; Pankaj > Raghav > >>>> <pankydev8@xxxxxxxxx>; Pankaj Raghav <p.raghav@xxxxxxxxxxx> > >>>> Subject: [EXTERNAL] Re: siw_cm.c:255 siw_cep_put+0x125/0x130 kernel > >>>> warning while testing blktests srp/002 v5.17-rc7 > >>>> > >>>> > >>>> > >>>> On 4/15/22 7:31 AM, Luis Chamberlain wrote: > >>>> > >>>> <...> > >>>> > >>>>> [ 195.218783] ------------[ cut here ]------------ > >>>>> [ 195.221242] WARNING: CPU: 7 PID: 201 at > >>>> drivers/infiniband/sw/siw/siw_cm.c:255 siw_cep_put+0x125/0x130 > [siw] > >>>>> [ 195.222838] Modules linked in: ib_srp(E) scsi_transport_srp(E) > >>>> target_core_pscsi(E) target_core_file(E) ib_srpt(E) > target_core_iblock(E) > >>>> target_core_mod(E) rdma_cm(E) iw_cm(E) ib_cm(E) scsi_debug(E) > siw(E) > >>>> null_blk(E) ib_umad(E) ib_uverbs(E) sd_mod(E) sg(E) > dm_service_time(E) > >>>> scsi_dh_rdac(E) scsi_dh_emc(E) scsi_dh_alua(E) dm_multipath(E) > ib_core(E) > >>>> dm_mod(E) nvme_fabrics(E) kvm_intel(E) kvm(E) irqbypass(E) > >>>> crct10dif_pclmul(E) ghash_clmulni_intel(E) aesni_intel(E) > crypto_simd(E) > >>>> cryptd(E) joydev(E) evdev(E) serio_raw(E) cirrus(E) > drm_shmem_helper(E) > >>>> drm_kms_helper(E) virtio_balloon(E) cec(E) i6300esb(E) button(E) > drm(E) > >>>> configfs(E) ip_tables(E) x_tables(E) autofs4(E) ext4(E) crc16(E) > mbcache(E) > >>>> jbd2(E) btrfs(E) blake2b_generic(E) xor(E) raid6_pq(E) > zstd_compress(E) > >>>> libcrc32c(E) crc32c_generic(E) virtio_net(E) net_failover(E) > failover(E) > >>>> virtio_blk(E) ata_generic(E) uhci_hcd(E) ehci_hcd(E) > crc32_pclmul(E) > >>>> crc32c_intel(E) ata_piix(E) psmouse(E) nvme(E) libata(E) > virtio_pci(E) > >>>>> [ 195.222986] virtio_pci_legacy_dev(E) virtio_pci_modern_dev(E) > >>>> usbcore(E) virtio(E) usb_common(E) scsi_mod(E) nvme_core(E) > i2c_piix4(E) > >>>> virtio_ring(E) t10_pi(E) scsi_common(E) [last unloaded: null_blk] > >>>>> [ 195.241036] sd 3:0:0:1: [sdd] Attached SCSI disn > >>>>> [ 195.241188] CPU: 2 PID: 201 Comm: kworker/u16:22 Kdump: loaded > >>>> Tainted: G E 5.17.0-rc7 #1 > >>>>> [ 195.246053] Hardware name: QEMU Standard PC (i440FX + PIIX, > 1996), > >>>> BIOS 1.15.0-1 04/01/2014 > >>>>> [ 195.249123] Workqueue: iw_cm_wq cm_work_handler [iw_cm] > >>>>> [ 195.251274] RIP: 0010:siw_cep_put+0x125/0x130 [siw] > >>>>> [ 195.253548] Code: bb c0 e8 ae 74 0f d7 48 89 ef 5d 41 5c 41 5d > e9 b1 d6 ef > >>>> d6 5d be 03 00 00 00 41 5c 41 5d e9 22 b7 0c d7 0f 0b e9 f3 fe ff > ff <0f> 0b e9 1c > >>>> ff ff ff 0f 1f 40 00 0f 1f 44 00 00 55 48 8d 6f 20 53 > >>>>> [ 195.258982] RSP: 0018:ffffbc53404ebc98 EFLAGS: 00010286 > >>>>> [ 195.261018] RAX: 0000000000000001 RBX: 0000000000000000 RCX: > >>>> 0000000000000000 > >>>>> [ 195.263569] RDX: 0000000000000001 RSI: 0000000000000246 RDI: > >>>> ffffa03d1102a924 > >>>>> [ 195.266151] RBP: ffffa03d1102a900 R08: ffffa03d1102a920 R09: > >>>> ffffbc53404ebc50 > >>>>> [ 195.269150] R10: ffffffff98a060e0 R11: 0000000000000000 R12: > >>>> ffffa03cc4297000 > >>>>> [ 195.272744] R13: ffffa03d2a48aea0 R14: ffffa03d2a48ae78 R15: > >>>> ffffa03cc427ad58 > >>>>> [ 195.275575] FS: 0000000000000000(0000) > GS:ffffa03df7c80000(0000) > >>>> knlGS:0000000000000000 > >>>>> [ 195.278932] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 > >>>>> [ 195.280963] CR2: 00005590bc2e4fe8 CR3: 000000008500a004 CR4: > >>>> 0000000000770ee0 > >>>>> [ 195.282803] DR0: 0000000000000000 DR1: 0000000000000000 DR2: > >>>> 0000000000000000 > >>>>> [ 195.284650] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: > >>>> 0000000000000400 > >>>>> [ 195.286522] PKRU: 55555554 > >>>>> [ 195.287998] Call Trace: > >>>>> [ 195.289210] <TASK> > >>>>> [ 195.290969] siw_reject+0xac/0x180 [siw] > >>>>> [ 195.292679] iw_cm_reject+0x68/0xc0 [iw_cm] > >>>>> [ 195.294136] cm_work_handler+0x59d/0xe20 [iw_cm] > >>>>> [ 195.295588] process_one_work+0x1e2/0x3b0 > >>>>> [ 195.298338] worker_thread+0x50/0x3a0 > >>>>> [ 195.300330] ? rescuer_thread+0x390/0x390 > >>>>> [ 195.302269] kthread+0xe5/0x110 > >>>>> [ 195.304062] ? kthread_complete_and_exit+0x20/0x20 > >>>>> [ 195.307612] ret_from_fork+0x1f/0x30 > >>>>> [ 195.309585] </TASK> > >>>>> [ 195.310674] ---[ end trace 0000000000000000 ]--- > >>>>> [ 195.313290] scsi host4: ib_srp: REJ received > >>>>> [ 195.313293] scsi host4: REJ reason 0xffffff98 > >>>>> [ 195.315433] scsi host4: ib_srp: Connection 0/8 to 172.17.8.113 > failed > >>>>> [ 195.472718] ib_srp:srp_parse_in: ib_srp: 172.17.8.113 -> > 172.17.8.113:0 > >>>>> [ 195.472739] ib_srp:srp_parse_in: ib_srp: 172.17.8.113:5555 -> > >>>> 172.17.8.113:5555 > >>>>> [ 195.472807] ib_srp:srp_parse_in: ib_srp: > [fe80::5054:ff:fe5b:90dc%3] -> > >>>> [fe80::5054:ff:fe5b:90dc]:0/202442865%3 > >>>>>> [0] INVALID URI REMOVED > >>>> 3A__github.com_mcgrof_kdevops&d=DwIGaQ&c=jf_iaSHvJObTbx- > >>>> siA1ZOg&r=2TaYXQ0T- > >>>> r8ZO1PP1alNwU_QJcRRLfmYTAgd3QCvqSc&m=7dWDVPFaNFXoRqokXmPFFy > >>>> XkVL2yItLNzYUDfM4ULTg&s=1ezv_qa- > >>>> ujLTftm7OxJ5xNZuoKrc70DJPBDccqZokbY&e= > >>>>> > Luis > >>>> > >>>> Hi, Bernard > >>>> > >>>> I reproduced this issue, and it looks like a condition race between > >>>> 'cm_work_handler' and 'siw_cm_work_handler'. > >>>> > >>>> ---------------------------------------------------------------- > >>>> Thread0: Thread1: > >>>> siw_cm_work_handler cm_work_handler > >>>> ---------------------------------------------------------------- > >>>> step0: > >>>> siw_cm_upcall with > >>>> IW_CM_EVENT_CONNECT_REQUEST > >>>> > >>>> ===> cm_conn_req_handler > >>>> ... > >>>> cm_id->cm_handler (failed) > >>>> iw_cm_reject > >>>> siw_reject > >>>> > >>>> *step1*: > >>>> detach cep with listen_cep > >>>> ---------------------------------------------------------------- > >>>> > >>>> When siw_reject is called in cm_work_handler, the related cep may > have > >>>> not been detached with its listen_cep, through the two steps are > very > >>>> close. > >>>> > >>>> I think one simple way to fix this issue is keep step1 under > >>>> siw_cep_set_inuse's protection, and this will make siw_reject will > be > >>>> pending util siw_cm_work_handler release the lock: > >>>> > >>>> diff --git a/drivers/infiniband/sw/siw/siw_cm.c > >>>> b/drivers/infiniband/sw/siw/siw_cm.c > >>>> index 7acdd3c3a599..f033b6da1e9f 100644 > >>>> --- a/drivers/infiniband/sw/siw/siw_cm.c > >>>> +++ b/drivers/infiniband/sw/siw/siw_cm.c > >>>> @@ -968,13 +968,15 @@ static void siw_accept_newconn(struct siw_cep > >>>> *cep) > >>>> > >>>> siw_cep_set_inuse(new_cep); > >>>> rv = siw_proc_mpareq(new_cep); > >>>> - siw_cep_set_free(new_cep); > >>>> > >>>> if (rv != -EAGAIN) { > >>>> siw_cep_put(cep); > >>>> new_cep->listen_cep = NULL; > >>>> + siw_cep_set_free(new_cep); > >>>> if (rv) > >>>> goto error; > >>>> + } else { > >>>> + siw_cep_set_free(new_cep); > >>>> } > >>>> } > >>>> return; > >>>> > >>>> Thanks, > >>>> Cheng Xu > >>> > >>> Hi Cheng, many thanks for looking into it! Unfortunately > >>> I am out next 12 days until May. I will immediately look into > >>> it when back. Your explanation sounds reasonable, but I'd > >>> like to fully understand. > >> I'd like to send a patch to fix this. When you back, you can review > this > >> issue and the patch. > >> > >> Was it fixing the issue for you? > >> > >> Sure, With this change, the WARN in dmesg does not appear any more in > >> my tests. > >> > >> Thanks, > >> Cheng Xu > > > > *poke* > > > > Would be good to get a fix merged. And if a patch is posted does this > > need to go to stable? > > > > Luis > > The patch has been accepted and merged to for-rc, see: > > INVALID URI REMOVED > 3A__lore.kernel.org_all_d528d83466c44687f3872eadcb8c184528b2e2d4.1650526 > 554.git.chengyou-40linux.alibaba.com_T_&d=DwICaQ&c=jf_iaSHvJObTbx- > siA1ZOg&r=2TaYXQ0T-r8ZO1PP1alNwU_QJcRRLfmYTAgd3QCvqSc&m=gj2AyKoOM_k9fYF- > _XQ4HcYw_viOIwl6lDNPHqp7L1y2OiVRWvZkTFGFHSSZInor&s=P_HaXIXt9mBbCeBNBLsWe > RTz5hvnUGUvObzs8lowzCM&e= > > I think this patch need not be merged back to stable, because the issue > is not a functional problem, but only produce a WARN in dmesg. > > Thanks, > Cheng Xu I agree. It does not fix a memory leak or some such.