Re: Kernel panic over qedr with

Chuck Lever <chuck.lever@xxxxxxxxxx> · Tue, 30 Jan 2018 11:31:42 -0500

> On Jan 30, 2018, at 11:27 AM, Kalderon, Michal <Michal.Kalderon@xxxxxxxxxx> wrote:
> 
>> From: Chuck Lever [mailto:chuck.lever@xxxxxxxxxx]
>> Sent: Tuesday, January 30, 2018 12:01 AM
>>> On Jan 29, 2018, at 9:35 AM, Kalderon, Michal
>> <Michal.Kalderon@xxxxxxxxxx> wrote:
>>> 
>>>> From: linux-rdma-owner@xxxxxxxxxxxxxxx [mailto:linux-rdma-
>>>> owner@xxxxxxxxxxxxxxx] On Behalf Of Kalderon, Michal
>>>> Sent: Wednesday, January 24, 2018 3:10 PM
>>>> To: Chuck Lever <chuck.lever@xxxxxxxxxx>
>>>> Cc: linux-rdma@xxxxxxxxxxxxxxx
>>>> Subject: RE: Kernel panic over qedr with
>>>> 
>>>> [This sender failed our fraud detection checks and may not be who
>>>> they appear to be. Learn about spoofing at
>>>> http://aka.ms/LearnAboutSpoofing]
>>>> 
>>>>> From: Chuck Lever [mailto:chuck.lever@xxxxxxxxxx]
>>>>> Sent: Tuesday, January 23, 2018 7:14 PM
>>>>>> On Jan 23, 2018, at 9:05 AM, Kalderon, Michal
>>>>> <Michal.Kalderon@xxxxxxxxxx> wrote:
>>>>>> 
>>>>>> Hi Chuck,
>>>>>> 
>>>>>> I’m hitting issues with NFS over qedr (either iWARP / RoCE)
>>>>>> 
>>>>>> I get a kernel panic during mount.
>>>>>> Looks like it started with 4.15-rcx Looking at qedr code it looks
>>>>>> like the wr we get may be corrupted/ Below the stack trace and
>>>>>> following logs with RCP debug enabled…
>>>>>> 
>>>>>> Can you please advise on how to proceed with debugging ?
>>>>> 
>>>>> I don't see anything distinctive in the log output.
>>>>> I would start with a git bisect if your reproducer hits the oops every time.
>>>> 
>>>> On it, thanks.
>>> Hi Chuck,
>>> 
>>> Git bisect helped in finding this commit as the one introducing the oops:
>>> 
>>> 01bb35c89d90abe6fd1c0be001f84bbdfa7fa7d1 is the first bad commit
>>> commit 01bb35c89d90abe6fd1c0be001f84bbdfa7fa7d1
>>> Author: Chuck Lever <chuck.lever@xxxxxxxxxx>
>>> Date:   Fri Oct 20 10:48:36 2017 -0400
>>> 
>>>   xprtrdma: RPC completion should wait for Send completion
>>> 
>>> 
>>> I can't see the issue in the patch, but am lacking some more knowledge in
>> implementation.
>>> Not sure why or whether it helps, but we see the oops consistently with
>> iWARP. And intermittently with RoCE.
>>> Can you please take a look ?
>> 
>> I'm guessing, as object generated from my kernel source does not seem to
>> align with your RIP (__qedr_post_send+0x85 lands in the middle of
>> instruction). But you are failing somewhere in qedr_can_post_send,
>> probably qed_chain_get_elem_left_u32.
>> 
>> Can you tell me what the bad WR looks like? Each sendctx is initialized in
>> rpcrdma_sendctx_create, including the Send WR, which is re-used. During
>> operation, xprtrdma in this release overwrites only the send_flags and
>> num_sge fields.
>> Only thing I can think of is that something is overwriting the sendctx. Could
>> be that the SMP acquire/release logic is not stopping
>> rpcrdma_sendctx_get_locked from returning a sendctx that is still in use. Or,
>> the rl_flags field is getting corrupted because I over-optimized the set and
>> clear_bit calls.
> 
> Hi Chuck, 
> 
> I found the issue in rpcrdma, but not sure what the proper fix is.
> The problem was there before this patch, but adding an additional field to rpcrdma_sendctx exposed the bug.
> In rpcrdma_sendctx_create you allocate sizeof(*sc) + ri_max_send_sges *sizeof(ib_sge) 
> ri_max_send_sges is equal to 1 ( max_sge -  RPCRDMA_MIN_SEND_SGE ) our device supports 4 max_sge
> in rpcrdma_prepare_msg_sges you access sge[1] -> this corrupted the memory of the following sendctx
> 
> (specifically wrote the length value to the next ptr of the send_wr, in our code we check if wr->next != NULL and if so, we process it
> As a wr, leading to the NULL ptr dereference __qedr_post_send was in qedr_can_post_send in the line that access wr->
> 
> The reason this patch exposed the bug is because before the size of the struct before the patch was 64 bytes
> sizeof ib_sge is 16 and OS padded to overall 96 bytes, so accessing sge[1] was in the padding area
> once the req ptr was added, sizeof struct was increased to 72 bytes, and the padding was no longer sufficient for sge[1]

This sounds quite plausible, I'll have a look.

> I guess a device that supports more SGEs than 4 is not exposed 
> By just increasing the sge array size I was able to work around the problem. But not sure what the proper fix should be
> With regards to calculating how many sges should be allocated, and why ri_max_send_sges is set to max_sge - MIN_SEND_SGE)
> 
> Thanks,
> Michal
> 
>> 
>>> Thanks,
>>> Michal
>>> 
>>>>> 
>>>>> 
>>>>>> Thanks,
>>>>>> Michal
>>>>>> 
>>>>>> [  782.951762] BUG: unable to handle kernel NULL pointer
>>>>>> dereference at 0000000000000040 [  782.952286] IP:
>>>>>> __qedr_post_send+0x85/0x127c [qedr] [  782.952797] PGD 0 P4D 0 [
>>>>>> 782.953293] Oops: 0000 [#1] SMP PTI [  782.953781] Modules linked
>>>>>> in: qedr(E) qede qed
>>>>>> rpcsec_gss_krb5
>>>>>> nfsv4 dns_resolver nfs fscache xt_CHECKSUM iptable_mangle
>>>>>> ipt_MASQUERADE nf_nat_masquerade_ipv4 iptable_nat nf_nat_ipv4
>>>>> nf_nat nf_conntrack_ipv4 nf_defrag_ipv4 xt_conntrack nf_conntrack
>>>>> libcrc32c ipt_REJECT nf_reject_ipv4 tun bridge stp llc
>>>>> ebtable_filter ebtables ip6table_filter ip6_tables iptable_filter
>>>>> bonding rpcrdma ib_isert iscsi_target_mod ib_iser ib_srpt
>>>>> target_core_mod ib_srp scsi_transport_srp ib_ipoib rdma_ucm ib_ucm
>>>>> ib_uverbs ib_umad
>>>> rdma_cm
>>>>> ib_cm iw_cm ib_core intel_powerclamp coretemp kvm irqbypass
>>>>> crct10dif_pclmul crc32_pclmul ghash_clmulni_intel pcbc aesni_intel
>>>>> crypto_simd glue_helper cryptd iTCO_wdt gpio_ich
>> iTCO_vendor_support
>>>>> ipmi_si pcspkr sg lpc_ich
>>>>> i2c_i801 ipmi_devintf ipmi_msghandler ioatdma i7core_edac shpchp [
>>>>> 782.957637]  acpi_cpufreq nfsd auth_rpcgss nfs_acl lockd grace
>>>>> sunrpc ip_tables ext4 mbcache jbd2 sr_mod sd_mod cdrom ata_generic
>>>>> pata_acpi
>>>>> mgag200 drm_kms_helper syscopyarea sysfillrect sysimgblt fb_sys_fops
>>>>> ttm drm igb ata_piix ptp libata pps_core e1000 crc32c_intel dca
>>>>> i2c_algo_bit i2c_core
>>>>>> [  782.959572] CPU: 7 PID: 3422 Comm: mount.nfs Tainted: G            E
>>>>> 4.15.0-rc9+ #2
>>>>>> [  782.960232] Hardware name: Intel Corporation S5520HC/S5520HC,
>>>>>> BIOS
>>>>>> S5500.86B.01.00.0059.082320111421 08/23/2011 [  782.960901] RIP:
>>>>>> 0010:__qedr_post_send+0x85/0x127c [qedr] [  782.961575] RSP:
>>>>>> 0018:ffffab3d8259b6f8 EFLAGS: 00010082 [  782.962244] RAX:
>>>>>> 0000000000000025 RBX: 0000000000000028 RCX: 0000000000000006 [
>>>>>> 782.962907] RDX: 0000000000000000 RSI: 0000000000000092 RDI:
>>>>>> ffff9b275fd968f0 [  782.963573] RBP: ffffab3d8259b780 R08:
>>>>>> 00000000000004cc R09: 0000000000000000 [  782.964242] R10:
>>>>>> ffff9b2753218540 R11: ffff9b25d9d67400 R12: ffff9b25d60aa000 [
>>>>>> 782.964894] R13: 0000000000000028 R14: 0000000000000000 R15:
>>>>>> ffff9b259400e400 [  782.965557] FS:  00007f91dad3a880(0000)
>>>>>> GS:ffff9b275fd80000(0000) knlGS:0000000000000000 [  782.966235] CS:
>>>>>> 0010
>>>>> DS: 0000 ES: 0000 CR0: 0000000080050033 [  782.966906] CR2:
>>>>> 0000000000000040 CR3: 0000000393dbc003 CR4: 00000000000206e0 [
>>>>> 782.967593] Call Trace:
>>>>>> [  782.968271]  qedr_post_send+0x12e/0x180 [qedr] [  782.968950]
>>>>>> rpcrdma_ep_post+0x83/0xf0 [rpcrdma] [  782.969637]
>>>>>> xprt_rdma_send_request+0x84/0xe0 [rpcrdma] [  782.970349]
>>>>>> xprt_transmit+0x6c/0x370 [sunrpc] [  782.971057]  ?
>>>>>> call_decode+0x820/0x820 [sunrpc] [  782.971759]  ?
>>>>>> call_decode+0x820/0x820 [sunrpc] [  782.972464]
>>>>>> call_transmit+0x194/0x280 [sunrpc] [  782.973174]
>>>>>> __rpc_execute+0x7e/0x3f0 [sunrpc] [  782.973867]
>>>>>> rpc_run_task+0x106/0x150 [sunrpc] [  782.974577]
>>>>>> nfs4_proc_setclientid+0x213/0x380 [nfsv4] [  782.975279]
>>>>>> nfs40_discover_server_trunking+0x80/0xe0 [nfsv4] [  782.975982]
>>>>>> nfs4_discover_server_trunking+0x78/0x2b0 [nfsv4] [  782.976688]
>>>>>> nfs4_init_client+0x11b/0x260 [nfsv4] [  782.977402]  ?
>>>>>> __rpc_init_priority_wait_queue+0x83/0xb0 [sunrpc] [  782.978129]  ?
>>>>>> nfs4_alloc_client+0x15f/0x200 [nfsv4] [  782.978853]  ?
>>>>>> nfs_get_client+0x2c1/0x360 [nfs] [  782.979583]  ?
>>>>>> pcpu_alloc_area+0xc0/0x130 [  782.980317]
>>>>>> nfs4_set_client+0x9d/0xe0 [nfsv4]
>>>>>> 
>>>>>> 
>>>>>> RCP Logs
>>>>>> [  782.948548] RPC:   209 xmit complete
>>>>>> [  782.948549] RPC:   209 sleep_on(queue "xprt_pending" time
>>>> 4295450237)
>>>>>> [  782.948550] RPC:   209 added to queue 000000006f382a4e
>>>> "xprt_pending"
>>>>>> [  782.948550] RPC:   209 setting alarm for 60000 ms
>>>>>> [  782.948552] RPC:   209 sync task going to sleep
>>>>>> [  782.948808] RPC:       rpcrdma_wc_receive: rep 0000000067b47231
>>>> opcode
>>>>> 'recv', length 52: success
>>>>>> [  782.948811] RPC:       rpcrdma_reply_handler: incoming rep
>>>>> 0000000067b47231
>>>>>> [  782.948814] RPC:       rpcrdma_reply_handler: reply 0000000067b47231
>>>>> completes request 0000000096e058c2 (xid 0x4e366e3c)
>>>>>> [  782.948869] RPC:       rpcrdma_inline_fixup: srcp 0x00000000a5e2c17f
>> len
>>>>> 24 hdrlen 24
>>>>>> [  782.948871] RPC:       wake_up_first(0000000004dd7309
>> "xprt_sending")
>>>>>> [  782.948874] RPC:   209 xid 4e366e3c complete (24 bytes received)
>>>>>> [  782.948875] RPC:   209 __rpc_wake_up_task (now 4295450237)
>>>>>> [  782.948876] RPC:   209 disabling timer
>>>>>> [  782.948877] RPC:   209 removed from queue 000000006f382a4e
>>>>> "xprt_pending"
>>>>>> [  782.948881] RPC:       __rpc_wake_up_task done
>>>>>> [  782.948916] RPC:   209 sync task resuming
>>>>>> [  782.948918] RPC:   209 call_status (status 24)
>>>>>> [  782.948919] RPC:   209 call_decode (status 24)
>>>>>> [  782.948921] RPC:   209 validating NULL cred 000000000941bc29
>>>>>> [  782.948923] RPC:   209 using AUTH_NULL cred 000000000941bc29 to
>>>>> unwrap rpc data
>>>>>> [  782.948925] RPC:   209 call_decode result 0
>>>>>> [  782.948926] RPC:   209 return 0, status 0
>>>>>> [  782.948927] RPC:   209 release task
>>>>>> [  782.948929] RPC:       xprt_rdma_free: called on 0x0000000067b47231
>>>>>> [  782.948963] RPC:   209 release request 00000000088290f9
>>>>>> [  782.948964] RPC:       wake_up_first(0000000066fcc180
>> "xprt_backlog")
>>>>>> [  782.948966] RPC:       rpc_release_client(00000000d13a4ef8)
>>>>>> [  782.948968] RPC:   209 freeing task
>>>>>> [  782.949065] svc: svc_destroy(NFSv4 callback, 2)
>>>>>> [  782.949069] RPC:       new task initialized, procpid 3422
>>>>>> [  782.949070] RPC:       allocated task 00000000407c4b9f
>>>>>> [  782.949071] RPC:   210 __rpc_execute flags=0x5280
>>>>>> [  782.949072] RPC:   210 call_start nfs4 proc SETCLIENTID (sync)
>>>>>> [  782.949073] RPC:   210 call_reserve (status 0)
>>>>>> [  782.949074] RPC:   210 reserved req 00000000088290f9 xid 4f366e3c
>>>>>> [  782.949076] RPC:   210 call_reserveresult (status 0)
>>>>>> [  782.949077] RPC:   210 call_refresh (status 0)
>>>>>> [  782.949078] RPC:   210 refreshing UNIX cred 000000005a6be29a
>>>>>> [  782.949079] RPC:   210 call_refreshresult (status 0)
>>>>>> [  782.949080] RPC:   210 call_allocate (status 0)
>>>>>> [  782.949081] RPC:   210 xprt_rdma_allocate: send size = 1456, recv size
>> =
>>>>> 276, req = 00000000f056c1ce
>>>>>> [  782.949082] RPC:   210 call_bind (status 0)
>>>>>> [  782.949083] RPC:   210 call_connect xprt 00000000e40824fb is
>> connected
>>>>>> [  782.949083] RPC:   210 call_transmit (status 0)
>>>>>> [  782.949084] RPC:   210 xprt_prepare_transmit
>>>>>> [  782.949085] RPC:   210 xprt_cwnd_limited cong = 0 cwnd = 8192
>>>>>> [  782.949085] RPC:   210 rpc_xdr_encode (status 0)
>>>>>> [  782.949086] RPC:   210 marshaling UNIX cred 000000005a6be29a
>>>>>> [  782.949088] RPC:   210 using AUTH_UNIX cred 000000005a6be29a to
>>>> wrap
>>>>> rpc data
>>>>>> [  782.949090] RPC:   210 xprt_transmit(192)
>>>>>> [  782.949091] RPC:   210 rpcrdma_marshal_req: inline/inline: hdrlen 28
>>>>> rpclen
>>>>>> [  782.949093] RPC:       rpcrdma_ep_post: posting 2 s/g entries
>>>>> 
>>>>> --
>>>>> Chuck Lever
>>>>> 
>>>>> 
>>>> 
>>>>   칻 & ~ &   +-  ݶ  w  ˛   m b  kvf   ^n r   z   h    &   G   h ( 階 ݢj"   m
>> z ޖ
>>>> f   h   ~ m
>> 
>> --
>> Chuck Lever

--
Chuck Lever

--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at  http://vger.kernel.org/majordomo-info.html