Re: Seeing this on a RHEL kernel with upstream backports wondering if this was ever fixed

Laurence Oberman <loberman@xxxxxxxxxx> · Tue, 07 Aug 2018 14:26:56 -0400

On Fri, 2018-07-27 at 09:21 -0400, Laurence Oberman wrote:
> On Fri, 2018-07-27 at 08:05 -0400, Laurence Oberman wrote:
> > On Thu, 2018-07-26 at 16:02 -0400, Laurence Oberman wrote:
> > > On Thu, 2018-07-26 at 10:28 -0400, Don Dutile wrote:
> > > > On 07/26/2018 08:48 AM, Laurence Oberman wrote:
> > > > > Hello
> > > > > 
> > > > > https://www.spinics.net/lists/linux-rdma/msg51334.html
> > > > > 
> > > > > A rhel 7.5 with backports from upstream is hitting this.
> > > > > Chuck Reported it and Sagi and Max responded but its not
> > > > > clear
> > > > > if
> > > > > we
> > > > > ever fixed this.
> > > > > 
> > > > 
> > > > RHEL-7.5 data point:
> > > > -- drivers/infiniband/* -r is backported to v4.14.
> > > >     i.e., includes the patch(es) mentioned in the above thread.
> > > > 
> > > > Laurence:
> > > > Please test with 7.6 kernel & report back.
> > > > if that passes, RH can bisect the bug fix btwn v4.14 &
> > > > v4.16(the
> > > > 7.6
> > > > update point for its rdma kernel core),
> > > > and backport to 7.5-zstream.  note: you'll have to update rdma-
> > > > core
> > > > pkg to the 7.6 version as well.
> > > > All functional & bug fix patches to mlx* (ib & enet) are in as
> > > > well
> > > > (same kernel references).
> > > > 
> > > > -dd
> > > > 
> > > > > In this case we land up in a panic, noty just messaging,
> > > > > although
> > > > > the
> > > > > messages logged for a long time over and over until we
> > > > > finally
> > > > > panicked.
> > > > > 
> > > > > crash> log | grep "memreg failure: memor" | wc -l
> > > > > 2414
> > > > > 
> > > > > crash> log
> > > > > [1635578.012721]  connection16:0: detected conn error (1011)
> > > > > [1635587.050688] mlx5_0:dump_cqe:262:(pid 93128): dump error
> > > > > cqe
> > > > > [1635587.089686] 00000000 00000000 00000000 00000000
> > > > > [1635587.123989] 00000000 00000000 00000000 00000000
> > > > > [1635587.157494] 00000000 00000000 00000000 00000000
> > > > > [1635587.190968] 00000000 08007806 250002ad ba6115d3
> > > > > 
> > > > > [1635587.224331] iser: iser_err_comp: memreg failure: memory
> > > > > management
> > > > > operation error (6) vend_err 78
> > > > > [1635587.278876]  connection15:0: detected conn error (1011)
> > > > > [1635590.986286] mlx5_1:dump_cqe:262:(pid 0): dump error cqe
> > > > > [1635591.021891] 00000000 00000000 00000000 00000000
> > > > > [1635591.053944] 00000000 00000000 00000000 00000000
> > > > > 
> > > > > [1657077.997960] BUG: unable to handle kernel NULL pointer
> > > > > dereference
> > > > > at 0000000000000010
> > > > > [1657077.997967] IP: [<ffffffffc08a541e>]
> > > > > iscsi_verify_itt+0x1e/0x110
> > > > > [libiscsi]
> > > > > [1657077.997970] PGD 80000098de387067 PUD b8d9ffa067 PMD 0
> > > > > [1657077.997971] Oops: 0000 [#1] SMP
> > > > > [1657077.998009] Modules linked in: oracleasm(O) nfsv3
> > > > > rpcsec_gss_krb5
> > > > > nfsv4 dns_resolver nfs fscache dm_round_robin bonding rpcrdma
> > > > > ib_isert
> > > > > iscsi_target_mod ib_iser libiscsi scsi_transport_iscsi
> > > > > ib_srpt
> > > > > target_core_mod ib_srp scsi_transport_srp scsi_tgt ib_ipoib
> > > > > rdma_ucm
> > > > > ib_ucm ib_uverbs ib_umad rdma_cm ib_cm iw_cm mlx5_ib ib_core
> > > > > vfat
> > > > > fat
> > > > > xfs sb_edac edac_core intel_powerclamp coretemp intel_rapl
> > > > > iosf_mbi
> > > > > kvm_intel kvm irqbypass iTCO_wdt crc32_pclmul ipmi_ssif
> > > > > iTCO_vendor_support ghash_clmulni_intel aesni_intel lrw
> > > > > gf128mul
> > > > > ipmi_si glue_helper ablk_helper cryptd sg hpwdt hpilo pcspkr
> > > > > ipmi_devintf ioatdma dm_multipath i2c_i801 lpc_ich shpchp dca
> > > > > wmi
> > > > > ipmi_msghandler pcc_cpufreq acpi_power_meter nfsd binfmt_misc
> > > > > auth_rpcgss nfs_acl lockd grace sunrpc ip_tables ext4 mbcache
> > > > > jbd2
> > > > > sd_mod crc_t10dif crct10dif_generic
> > > > > [1657077.998020]  i2c_algo_bit drm_kms_helper syscopyarea
> > > > > sysfillrect
> > > > > sysimgblt fb_sys_fops ttm bnx2x mlx5_core crct10dif_pclmul
> > > > > mdio
> > > > > tg3(OE)
> > > > > devlink libcrc32c crct10dif_common drm hpsa(OE) ptp i2c_core
> > > > > crc32c_intel scsi_transport_sas pps_core dm_mirror
> > > > > dm_region_hash
> > > > > dm_log dm_mod
> > > > > [1657077.998023] CPU: 20 PID: 41538 Comm: sh Tainted:
> > > > > G           OE  -
> > > > > -----------   3.10.0-693.34.1.el7_bz1582551.x86_64 #1
> > > > > [1657077.998024] Hardware name: HP ProLiant DL380
> > > > > Gen9/ProLiant
> > > > > DL380
> > > > > Gen9, BIOS P89 05/21/2018
> > > > > [1657077.998025] task: ffff88587ce38fd0 ti: ffff884dd0af0000
> > > > > task.ti:
> > > > > ffff884dd0af0000
> > > > > [1657077.998029] RIP:
> > > > > 0010:[<ffffffffc08a541e>]  [<ffffffffc08a541e>]
> > > > > iscsi_verify_itt+0x1e/0x110 [libiscsi]
> > > > > [1657077.998030] RSP: 0000:ffff88beff403d78  EFLAGS: 00010286
> > > > > [1657077.998031] RAX: 000000000000004c RBX: 00000000b0000036
> > > > > RCX:
> > > > > 0000000000000002
> > > > > [1657077.998032] RDX: 00000000000000cc RSI: 00000000b0000036
> > > > > RDI:
> > > > > 0000000000000000
> > > > > [1657077.998033] RBP: ffff88beff403da0 R08: 0000000040032a20
> > > > > R09:
> > > > > ffff8896e4eaf91c
> > > > > [1657077.998034] R10: 0000000000000000 R11: 00007ffff7763ca0
> > > > > R12:
> > > > > 0000000000000000
> > > > > [1657077.998035] R13: ffff8896e4eaf9e4 R14: ffff8896e4eaf900
> > > > > R15:
> > > > > 0000000000000000
> > > > > [1657077.998036] FS:  00007ffff7fe6740(0000)
> > > > > GS:ffff88beff400000(0000)
> > > > > knlGS:0000000000000000
> > > > > [1657077.998038] CS:  0010 DS: 0000 ES: 0000 CR0:
> > > > > 0000000080050033
> > > > > [1657077.998039] CR2: 0000000000000010 CR3: 000000ad92eba000
> > > > > CR4:
> > > > > 00000000003607e0
> > > > > [1657077.998040] DR0: 0000000000000000 DR1: 0000000000000000
> > > > > DR2:
> > > > > 0000000000000000
> > > > > [1657077.998041] DR3: 0000000000000000 DR6: 00000000fffe0ff0
> > > > > DR7:
> > > > > 0000000000000400
> > > > > [1657077.998042] Call Trace:
> > > > > [1657077.998044]  <IRQ>
> > > > > [1657077.998046]  [<ffffffffc08a5527>]
> > > > > iscsi_itt_to_ctask+0x17/0x80
> > > > > [libiscsi]
> > > > > [1657077.998050]  [<ffffffffc05eefea>]
> > > > > iser_task_rsp+0xca/0x360
> > > > > [ib_iser]
> > > > > [1657077.998061]  [<ffffffffc0587fbb>]
> > > > > __ib_process_cq+0x6b/0xe0
> > > > > [ib_core]
> > > > > [1657077.998066]  [<ffffffffc0588122>]
> > > > > ib_poll_handler+0x22/0x80
> > > > > [ib_core]
> > > > > [1657077.998070]  [<ffffffff81358507>]
> > > > > irq_poll_softirq+0xc7/0x100
> > > > > [1657077.998076]  [<ffffffff81095195>]
> > > > > __do_softirq+0xf5/0x280
> > > > > [1657077.998081]  [<ffffffff816c4e8c>] call_softirq+0x1c/0x30
> > > > > [1657077.998086]  [<ffffffff8102d435>] do_softirq+0x65/0xa0
> > > > > [1657077.998088]  [<ffffffff81095515>] irq_exit+0x105/0x110
> > > > > [1657077.998091]  [<ffffffff816c61d6>] do_IRQ+0x56/0xf0
> > > > > [1657077.998098]  [<ffffffff816b837c>]
> > > > > common_interrupt+0x17c/0x17c
> > > > > [1657077.998099]  <EOI>
> > > > > [1657077.998113] Code: ff ff ff eb a9 41 be 95 ff ff ff eb a1
> > > > > 0f
> > > > > 1f
> > > > > 44
> > > > > 00 00 55 48 89 e5 41 55 41 54 49 89 fc 53 89 f3 48 83 ec 10
> > > > > c7
> > > > > 45
> > > > > d8 00
> > > > > 00 00 00 <4c> 8b 6f 10 65 48 8b 04 25 28 00 00 00 48 89 45 e0
> > > > > 31
> > > > > c0
> > > > > 83
> > > > > fe
> > > > > [1657077.998116] RIP  [<ffffffffc08a541e>]
> > > > > iscsi_verify_itt+0x1e/0x110
> > > > > [libiscsi]
> > > > > [1657077.998116]  RSP <ffff88beff403d78>
> > > > > [1657077.998117] CR2: 0000000000000010
> > > > > crash>
> > > > > 
> > > > > crash> bt
> > > > > PID: 41538  TASK: ffff88587ce38fd0  CPU: 20  COMMAND: "sh"
> > > > >   #0 [ffff88beff403a18] machine_kexec at ffffffff8105ddeb
> > > > >   #1 [ffff88beff403a78] __crash_kexec at ffffffff81109902
> > > > >   #2 [ffff88beff403b48] crash_kexec at ffffffff811099f0
> > > > >   #3 [ffff88beff403b60] oops_end at ffffffff816b97a8
> > > > >   #4 [ffff88beff403b88] no_context at ffffffff816a8c96
> > > > >   #5 [ffff88beff403bd8] __bad_area_nosemaphore at
> > > > > ffffffff816a8d2c
> > > > >   #6 [ffff88beff403c20] bad_area_nosemaphore at
> > > > > ffffffff816a8e96
> > > > >   #7 [ffff88beff403c30] __do_page_fault at ffffffff816bc6be
> > > > >   #8 [ffff88beff403c90] do_page_fault at ffffffff816bc865
> > > > >   #9 [ffff88beff403cc0] page_fault at ffffffff816b8788
> > > > >      [exception RIP: iscsi_verify_itt+30]
> > > > >      RIP: ffffffffc08a541e  RSP: ffff88beff403d78  RFLAGS:
> > > > > 00010286
> > > > >      RAX: 000000000000004c  RBX: 00000000b0000036  RCX:
> > > > > 0000000000000002
> > > > >      RDX: 00000000000000cc  RSI: 00000000b0000036  RDI:
> > > > > 0000000000000000
> > > > >      RBP: ffff88beff403da0   R8: 0000000040032a20   R9:
> > > > > ffff8896e4eaf91c
> > > > >      R10: 0000000000000000  R11: 00007ffff7763ca0  R12:
> > > > > 0000000000000000
> > > > >      R13: ffff8896e4eaf9e4  R14: ffff8896e4eaf900  R15:
> > > > > 0000000000000000
> > > > >      ORIG_RAX: ffffffffffffffff  CS: 0010  SS: 0000
> > > > > #10 [ffff88beff403da8] iscsi_itt_to_ctask at ffffffffc08a5527
> > > > > [libiscsi]
> > > > > #11 [ffff88beff403dc8] iser_task_rsp at ffffffffc05eefea
> > > > > [ib_iser]
> > > > > #12 [ffff88beff403e10] __ib_process_cq at ffffffffc0587fbb
> > > > > [ib_core]
> > > > > #13 [ffff88beff403e50] ib_poll_handler at ffffffffc0588122
> > > > > [ib_core]
> > > > > #14 [ffff88beff403e80] irq_poll_softirq at ffffffff81358507
> > > > > #15 [ffff88beff403eb8] __do_softirq at ffffffff81095195
> > > > > #16 [ffff88beff403f28] call_softirq at ffffffff816c4e8c
> > > > > #17 [ffff88beff403f40] do_softirq at ffffffff8102d435
> > > > > #18 [ffff88beff403f60] irq_exit at ffffffff81095515
> > > > > #19 [ffff88beff403f78] do_IRQ at ffffffff816c61d6
> > > > > --- <IRQ stack> ---
> > > > > #20 [ffff884dd0af3f58] ret_from_intr at ffffffff816b837c
> > > > >      RIP: 000000000041b866  RSP: 00007fffffffea28  RFLAGS:
> > > > > 00000206
> > > > >      RAX: 0000000000000000  RBX: 00007fffffffef53  RCX:
> > > > > 00000000006f1a70
> > > > >      RDX: 00000000006f1a70  RSI: 00000000006f1a90  RDI:
> > > > > 0000000000000000
> > > > >      RBP: 0000000000000002   R8: 0000000000000001   R9:
> > > > > 0000000000000020
> > > > >      R10: 0000000000000003  R11: 00007ffff7763ca0  R12:
> > > > > ffff88beff4061e8
> > > > >      R13: 00000000ffffffff  R14: 0000000000000000  R15:
> > > > > 0000000000000063
> > > > >      ORIG_RAX: ffffffffffffffbb  CS: 0033  SS: 002b
> > > > > 
> > > > > crash> ps -p 41538
> > > > > PID: 0      TASK: ffffffff81a0e480  CPU: 0   COMMAND:
> > > > > "swapper/0"
> > > > >   PID: 1      TASK: ffff88012e4c8000  CPU: 7   COMMAND:
> > > > > "systemd"
> > > > >    PID: 2345   TASK: ffff885ef5eb8fd0  CPU: 14  COMMAND:
> > > > > "zabbix_agentd"
> > > > >     PID: 2349   TASK: ffff885efcbcaf70  CPU: 1   COMMAND:
> > > > > "zabbix_agentd"
> > > > >      PID: 41538  TASK: ffff88587ce38fd0  CPU: 20  COMMAND:
> > > > > "sh"
> > > > > 
> > > > 
> > > > 
> > > 
> > > Don
> > > I misspoke about the kernel version, its 7.4 
> > > 3.10.0-693.34.1.el7_bz1582551.x86_64
> > > Its the one we added the missing iscsi patches to but base is 7.4
> > > So I will test with 7.5
> > > 
> > 
> > Don, I had another look at this.
> > 
> > Its not the SG_GAPS issue causing a memory registration error I
> > reported and we fixed in 7.5 from upstream.
> > 
> > Which commit in 7.5 did we pull in for fix this from upstream.
> > 
> > I think this is different and not yet fixed ??
> > 
> > [14556.614551] iser: iser_err_comp: memreg failure: memory
> > management
> > operation error (6) vend_err 78
> > [14556.666134]  connection1:0: detected conn error (1011)
> > [14562.678414] mlx5_1:dump_cqe:262:(pid 0): dump error cqe
> > [14562.678529] mlx5_0:dump_cqe:262:(pid 0): dump error cqe
> > [14562.678530] 00000000 00000000 00000000 00000000
> > [14562.678531] 00000000 00000000 00000000 00000000
> > [14562.678531] 00000000 00000000 00000000 00000000
> > [14562.678532] 00000000 08007806 25000344 34681cd2
> > [14562.678535] iser: iser_err_comp: memreg failure: memory
> > management
> > operation error (6) vend_err 78
> > [14562.678544]  connection1:0: detected conn error (1011)
> > [14562.679098] BUG: unable to handle kernel NULL pointer
> > dereference
> > at
> > 0000000000000010
> > [14562.679105] IP: [<ffffffffc088141e>] iscsi_verify_itt+0x1e/0x110
> > [libiscsi]
> > [14562.679106] PGD 0
> > [14562.679107] Oops: 0000 [#1] SMP
> > [14562.679134] Modules linked in: ip6table_filter ip6_tables
> > iptable_filter sctp_diag sctp tcp_diag udp_diag inet_diag unix_diag
> > af_packet_diag netlink_diag bnx2i cnic uio ip_vs nf_conntrack
> > oracleadvm(POE) oracleoks(POE) oracleasm(O) nfsv3 rpcsec_gss_krb5
> > nfsv4
> > dns_resolver nfs fscache dm_round_robin bonding rpcrdma ib_isert
> > iscsi_target_mod ib_iser libiscsi scsi_transport_iscsi ib_srpt
> > target_core_mod ib_srp scsi_transport_srp scsi_tgt ib_ipoib
> > rdma_ucm
> > ib_ucm ib_uverbs ib_umad rdma_cm ib_cm iw_cm mlx5_ib ib_core xfs
> > vfat
> > fat sb_edac edac_core intel_powerclamp coretemp intel_rapl iosf_mbi
> > kvm_intel kvm irqbypass crc32_pclmul ghash_clmulni_intel
> > aesni_intel
> > lrw gf128mul glue_helper ablk_helper cryptd iTCO_wdt
> > iTCO_vendor_support ipmi_ssif pcspkr ipmi_si dm_multipath ioatdma
> > lpc_ich i2c_i801 sg hpilo
> > [14562.679152]  hpwdt dca ipmi_devintf ipmi_msghandler pcc_cpufreq
> > shpchp wmi acpi_power_meter binfmt_misc nfsd auth_rpcgss nfs_acl
> > lockd
> > grace sunrpc ip_tables ext4 mbcache jbd2 sd_mod crc_t10dif
> > crct10dif_generic i2c_algo_bit drm_kms_helper syscopyarea
> > sysfillrect
> > sysimgblt fb_sys_fops ttm bnx2x mlx5_core devlink mdio tg3(OE)
> > libcrc32c drm crct10dif_pclmul hpsa(OE) crct10dif_common ptp
> > i2c_core
> > crc32c_intel scsi_transport_sas pps_core dm_mirror dm_region_hash
> > dm_log dm_mod
> > [14562.679154] CPU: 9 PID: 0 Comm: swapper/9 Tainted:
> > P           OE  -
> > -----------   3.10.0-693.22.1.el7.x86_64 #1
> > [14562.679155] Hardware name: HP ProLiant DL380 Gen9/ProLiant DL380
> > Gen9, BIOS P89 05/21/2018
> > [14562.679156] task: ffff8860aefaaf70 ti: ffff8860ae440000 task.ti:
> > ffff8860ae440000
> > [14562.679158] RIP: 0010:[<ffffffffc088141e>]  [<ffffffffc088141e>]
> > iscsi_verify_itt+0x1e/0x110 [libiscsi]
> > [14562.679159] RSP: 0018:ffff88beff2c3d78  EFLAGS: 00010286
> > [14562.679160] RAX: 000000000000004c RBX: 00000000d0000041 RCX:
> > 0000000000000002
> > [14562.679161] RDX: 00000000000000cc RSI: 00000000d0000041 RDI:
> > 0000000000000000
> > [14562.679161] RBP: ffff88beff2c3da0 R08: 0000000040001038 R09:
> > ffff88ae496fe01c
> > [14562.679162] R10: 0000000000000000 R11: 7fffffffffffffff R12:
> > 0000000000000000
> > [14562.679162] R13: ffff88ae496fe0e4 R14: ffff88ae496fe000 R15:
> > 0000000000000000
> > [14562.679163] FS:  0000000000000000(0000)
> > GS:ffff88beff2c0000(0000)
> > knlGS:0000000000000000
> > [14562.679164] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
> > [14562.679164] CR2: 0000000000000010 CR3: 000000beede48000 CR4:
> > 00000000003607e0
> > [14562.679165] DR0: 0000000000000000 DR1: 0000000000000000 DR2:
> > 0000000000000000
> > [14562.679166] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7:
> > 0000000000000400
> > [14562.679166] Call Trace:
> > [14562.679168]  <IRQ>
> > [14562.679170]  [<ffffffffc0881527>] iscsi_itt_to_ctask+0x17/0x80
> > [libiscsi]
> > [14562.679173]  [<ffffffffc069ffea>] iser_task_rsp+0xca/0x360
> > [ib_iser]
> > [14562.679181]  [<ffffffffc0924fbb>] __ib_process_cq+0x6b/0xe0
> > [ib_core]
> 
> Starts with the memreg failures
> crash> log | grep "iser: iser_err_comp: memreg failure" | wc -l
> 1237
> 
> Then the panic
> 
> [14556.614551] iser: iser_err_comp: memreg failure: memory management
> operation error (6) vend_err 78
> [14556.666134]  connection1:0: detected conn error (1011)
> [14562.678414] mlx5_1:dump_cqe:262:(pid 0): dump error cqe
> [14562.678529] mlx5_0:dump_cqe:262:(pid 0): dump error cqe
> [14562.678530] 00000000 00000000 00000000 00000000
> [14562.678531] 00000000 00000000 00000000 00000000
> [14562.678531] 00000000 00000000 00000000 00000000
> [14562.678532] 00000000 08007806 25000344 34681cd2
> [14562.678535] iser: iser_err_comp: memreg failure: memory management
> operation error (6) vend_err 78
> [14562.678544]  connection1:0: detected conn error (1011)
> 
> [14562.679098] BUG: unable to handle kernel NULL pointer dereference
> at
> 0000000000000010
> [14562.679105] IP: [<ffffffffc088141e>] iscsi_verify_itt+0x1e/0x110
> [libiscsi]
> 
> crash> bt
> PID: 0      TASK: ffff8860aefaaf70  CPU: 9   COMMAND: "swapper/9"
>  #0 [ffff88beff2c3a18] machine_kexec at ffffffff8105d77b
>  #1 [ffff88beff2c3a78] __crash_kexec at ffffffff81108732
>  #2 [ffff88beff2c3b48] crash_kexec at ffffffff81108820
>  #3 [ffff88beff2c3b60] oops_end at ffffffff816b8778
>  #4 [ffff88beff2c3b88] no_context at ffffffff816a7c7a
>  #5 [ffff88beff2c3bd8] __bad_area_nosemaphore at ffffffff816a7d10
>  #6 [ffff88beff2c3c20] bad_area_nosemaphore at ffffffff816a7e7a
>  #7 [ffff88beff2c3c30] __do_page_fault at ffffffff816bb68e
>  #8 [ffff88beff2c3c90] do_page_fault at ffffffff816bb835
>  #9 [ffff88beff2c3cc0] page_fault at ffffffff816b7768
>     [exception RIP: iscsi_verify_itt+30]
>     RIP: ffffffffc088141e  RSP: ffff88beff2c3d78  RFLAGS: 00010286
>     RAX: 000000000000004c  RBX: 00000000d0000041  RCX:
> 0000000000000002
>     RDX: 00000000000000cc  RSI: 00000000d0000041  RDI:
> 0000000000000000
>     RBP: ffff88beff2c3da0   R8: 0000000040001038   R9:
> ffff88ae496fe01c
>     R10: 0000000000000000  R11: 7fffffffffffffff  R12:
> 0000000000000000
>     R13: ffff88ae496fe0e4  R14: ffff88ae496fe000  R15:
> 0000000000000000
>     ORIG_RAX: ffffffffffffffff  CS: 0010  SS: 0018
> #10 [ffff88beff2c3da8] iscsi_itt_to_ctask at ffffffffc0881527
> [libiscsi]
> #11 [ffff88beff2c3dc8] iser_task_rsp at ffffffffc069ffea [ib_iser]
> #12 [ffff88beff2c3e10] __ib_process_cq at ffffffffc0924fbb [ib_core]
> #13 [ffff88beff2c3e50] ib_poll_handler at ffffffffc0925122 [ib_core]
> #14 [ffff88beff2c3e80] irq_poll_softirq at ffffffff813572b7
> #15 [ffff88beff2c3eb8] __do_softirq at ffffffff81094035
> #16 [ffff88beff2c3f28] call_softirq at ffffffff816c3afc
> #17 [ffff88beff2c3f40] do_softirq at ffffffff8102d435
> #18 [ffff88beff2c3f60] irq_exit at ffffffff810943b5
> #19 [ffff88beff2c3f78] do_IRQ at ffffffff816c4d96
> --- <IRQ stack> ---
> #20 [ffff8860ae443db8] ret_from_intr at ffffffff816b7362
>     [exception RIP: cpuidle_enter_state+87]
>     RIP: ffffffff81530b07  RSP: ffff8860ae443e60  RFLAGS: 00000202
>     RAX: 00000d3e7d729de6  RBX: ffff8860ae443e40  RCX:
> 0000000000000018
>     RDX: 0000000225c17d03  RSI: ffff8860ae443fd8  RDI:
> 00000d3e7d729de6
>     RBP: ffff8860ae443e88   R8: 000000000000016c   R9:
> 000000000000001c
>     R10: 0000000000000043  R11: 7fffffffffffffff  R12:
> 0000000000000009
>     R13: ffff88beff2d39a0  R14: ffffffff810b77e5  R15:
> ffff8860ae443de0
>     ORIG_RAX: ffffffffffffff5d  CS: 0010  SS: 0018
> #21 [ffff8860ae443e90] cpuidle_idle_call at ffffffff81530c5e
> #22 [ffff8860ae443ed0] arch_cpu_idle at ffffffff81034f8e
> #23 [ffff8860ae443ee0] cpu_startup_entry at ffffffff810eb6da
> #24 [ffff8860ae443f28] start_secondary at ffffffff81052222
> 
> crash> dis -l iscsi_verify_itt+30
> /usr/src/debug/kernel-3.10.0-693.22.1.el7/linux-3.10.0-
> 693.22.1.el7.x86_64/drivers/scsi/libiscsi.c: 1292
> 0xffffffffc088141e
> <iscsi_verify_itt+30>:       mov    0x10(%rdi),%r13
> crash> 
> 
> 
> So fails here
> 
> int iscsi_verify_itt(struct iscsi_conn *conn, itt_t itt)
> {
>         struct iscsi_session *session = conn->session;  **** conn-
> > session is invalid
> 
> rdi had the struct iscsi_conn 
> 
> 0xffffffffc0881400 <iscsi_verify_itt>:  nopl   0x0(%rax,%rax,1)
> [FTRACE
> NOP]
> 0xffffffffc0881405 <iscsi_verify_itt+5>:        push   %rbp
> 0xffffffffc0881406 <iscsi_verify_itt+6>:        mov    %rsp,%rbp
> 0xffffffffc0881409 <iscsi_verify_itt+9>:        push   %r13
> 0xffffffffc088140b <iscsi_verify_itt+11>:       push   %r12
> 0xffffffffc088140d <iscsi_verify_itt+13>:       mov    %rdi,%r12
> 0xffffffffc0881410 <iscsi_verify_itt+16>:       push   %rbx
> 0xffffffffc0881411 <iscsi_verify_itt+17>:       mov    %esi,%ebx
> 0xffffffffc0881413 <iscsi_verify_itt+19>:       sub    $0x10,%rsp
> 0xffffffffc0881417 <iscsi_verify_itt+23>:       movl   $0x0,-
> 0x28(%rbp)
> 0xffffffffc088141e
> <iscsi_verify_itt+30>:       mov    0x10(%rdi),%r13
> 
>    RIP: ffffffffc088141e  RSP: ffff88beff2c3d78  RFLAGS: 00010286
>     RAX: 000000000000004c  RBX: 00000000d0000041  RCX:
> 0000000000000002
>     RDX: 00000000000000cc  RSI: 00000000d0000041  RDI:
> 0000000000000000
>     RBP: ffff88beff2c3da0   R8: 0000000040001038   R9:
> ffff88ae496fe01c
>     R10: 0000000000000000  R11: 7fffffffffffffff  R12:
> 0000000000000000
>     R13: ffff88ae496fe0e4  R14: ffff88ae496fe000  R15:
> 0000000000000000
> 
> Both RDI and R12 are null, offset by 10 get the bad address
> 
> So we have a race somehow that trashes the conn pointer under load.
> 
> The load clearly is seeing resource issues and repeatedly failing the
> memory registration.

So as I expected the memreg issues are gone won 7.5 which was rebased
against upstream.

We are now hitting this and I am unable to reproduce in-house after
multiple efforts.

Aug  7 06:47:30 xxxxxxx kernel: WARNING: CPU: 20 PID: 36881 at
lib/list_debug.c:36 __list_add+0x8a/0xc0
Aug  7 06:47:30 xxxxxxx kernel: list_add double add:
new=ffff9f01523b92c8, prev=ffff9f01523b92c8, next=ffff9f69e4216d88.
Aug  7 06:47:30 xxxxxxx kernel: Modules linked in: bnx2i cnic uio ip_vs
nf_conntrack ip6table_filter ip6_tables iptable_filter tcp_diag
udp_diag inet_diag unix_diag af_packet_diag n
etlink_diag oracleadvm(POE) oracleoks(POE) oracleasm(O) nfsv3
rpcsec_gss_krb5 nfsv4 dns_resolver nfs fscache dm_round_robin bonding
rpcrdma ib_isert iscsi_target_mod ib_iser libiscsi sc
si_transport_iscsi ib_srpt target_core_mod ib_srp scsi_transport_srp
scsi_tgt ib_ipoib rdma_ucm ib_ucm ib_uverbs ib_umad rdma_cm ib_cm iw_cm
mlx5_ib ib_core vfat fat xfs sb_edac intel_p
owerclamp coretemp intel_rapl iosf_mbi kvm_intel kvm irqbypass
crc32_pclmul ghash_clmulni_intel iTCO_wdt aesni_intel ipmi_ssif lrw
iTCO_vendor_support gf128mul glue_helper ablk_helper i
oatdma cryptd ipmi_si pcspkr joydev ipmi_devintf hpwdt i2c_i801 hpilo
sg lpc_ich wmi dca ipmi_msghandler
Aug  7 06:47:30 xxxxxxx kernel: acpi_power_meter pcc_cpufreq shpchp
dm_multipath binfmt_misc nfsd auth_rpcgss nfs_acl lockd grace sunrpc
ip_tables ext4 mbcache jbd2 sd_mod crc_t10di
f crct10dif_generic i2c_algo_bit drm_kms_helper syscopyarea sysfillrect
sysimgblt fb_sys_fops mlx5_core ttm mlxfw drm bnx2x tg3 devlink mdio
crct10dif_pclmul libcrc32c crct10dif_common 
hpsa ptp i2c_core crc32c_intel scsi_transport_sas pps_core dm_mirror
dm_region_hash dm_log dm_mod
Aug  7 06:47:30 xxxxxxx kernel: CPU: 20 PID: 36881 Comm: sh Tainted:
P        W  OE  ------------   3.10.0-862.9.1.el7.x86_64 #1
Aug  7 06:47:30 xxxxxxx kernel: Hardware name: HP ProLiant DL380
Gen9/ProLiant DL380 Gen9, BIOS P89 05/21/2018
Aug  7 06:47:30 xxxxxxx kernel: Call Trace:
Aug  7 06:47:30 xxxxxxx kernel: <IRQ>  [<ffffffffa650e84e>]
dump_stack+0x19/0x1b
Aug  7 06:47:30 xxxxxxx kernel: [<ffffffffa5e91e18>] __warn+0xd8/0x100
Aug  7 06:47:30 xxxxxxx kernel: [<ffffffffa5e91e9f>]
warn_slowpath_fmt+0x5f/0x80
Aug  7 06:47:30 xxxxxxx kernel: [<ffffffffa6168d8a>]
__list_add+0x8a/0xc0
Aug  7 06:47:30 xxxxxxx kernel: [<ffffffffc0ac3c75>]
ipoib_start_xmit+0x485/0x6d0 [ib_ipoib]
Aug  7 06:47:30 xxxxxxx kernel: [<ffffffffa63ec226>]
dev_hard_start_xmit+0x246/0x3b0
Aug  7 06:47:30 xxxxxxx kernel: [<ffffffffa6417aba>]
sch_direct_xmit+0x11a/0x250
Aug  7 06:47:30 xxxxxxx kernel: [<ffffffffa63ef111>]
__dev_queue_xmit+0x4a1/0x660
Aug  7 06:47:30 xxxxxxx kernel: [<ffffffffa63ef2e0>]
dev_queue_xmit+0x10/0x20
Aug  7 06:47:30 xxxxxxx kernel: [<ffffffffa63fad1d>]
neigh_resolve_output+0x11d/0x220
Aug  7 06:47:30 xxxxxxx kernel: [<ffffffffa60db10a>] ?
selinux_ipv4_postroute+0x1a/0x20
Aug  7 06:47:30 xxxxxxx kernel: [<ffffffffa643820c>]
ip_finish_output+0x2ac/0x7a0
Aug  7 06:47:30 xxxxxxx kernel: [<ffffffffa6438a03>]
ip_output+0x73/0xe0
Aug  7 06:47:30 xxxxxxx kernel: [<ffffffffa6437f60>] ?
__ip_append_data.isra.50+0xa50/0xa50
Aug  7 06:47:30 xxxxxxx kernel: [<ffffffffa64365f7>]
ip_local_out_sk+0x37/0x40
Aug  7 06:47:30 xxxxxxx kernel: [<ffffffffa6436963>]
ip_queue_xmit+0x143/0x3a0
Aug  7 06:47:30 xxxxxxx kernel: [<ffffffffa6450844>]
tcp_transmit_skb+0x4e4/0x9e0
Aug  7 06:47:30 xxxxxxx kernel: [<ffffffffa64528bf>]
tcp_send_ack+0x11f/0x170
Aug  7 06:47:30 xxxxxxx kernel: [<ffffffffa6445735>]
tcp_send_dupack+0x25/0xd0
Aug  7 06:47:30 xxxxxxx kernel: [<ffffffffa644ce86>]
tcp_validate_incoming+0x186/0x2d0
Aug  7 06:47:30 xxxxxxx kernel: [<ffffffffa644d18d>]
tcp_rcv_established+0x1bd/0x770
Aug  7 06:47:30 xxxxxxx kernel: [<ffffffffa6457e6a>]
tcp_v4_do_rcv+0x10a/0x350
Aug  7 06:47:30 xxxxxxx kernel: [<ffffffffa64595fc>]
tcp_v4_rcv+0x78c/0x990
Aug  7 06:47:30 xxxxxxx kernel: [<ffffffffc0feafc6>] ?
ip_vs_remote_request4+0x16/0x20 [ip_vs]
Aug  7 06:47:30 xxxxxxx kernel: [<ffffffffa643272d>]
ip_local_deliver_finish+0xbd/0x200
Aug  7 06:47:30 xxxxxxx kernel: [<ffffffffa6432a19>]
ip_local_deliver+0x59/0xd0
Aug  7 06:47:30 xxxxxxx kernel: [<ffffffffa6432670>] ?
ip_rcv_finish+0x370/0x370
Aug  7 06:47:30 xxxxxxx kernel: [<ffffffffa6432390>]
ip_rcv_finish+0x90/0x370
Aug  7 06:47:30 xxxxxxx kernel: [<ffffffffa6432d49>] ip_rcv+0x2b9/0x410
Aug  7 06:47:30 xxxxxxx kernel: [<ffffffffa6123411>] ?
blk_complete_request+0x21/0x30
Aug  7 06:47:30 xxxxxxx kernel: [<ffffffffa63ecab9>]
__netif_receive_skb_core+0x729/0xa20
Aug  7 06:47:30 xxxxxxx kernel: [<ffffffffa63ecdc8>]
__netif_receive_skb+0x18/0x60
Aug  7 06:47:30 xxxxxxx kernel: [<ffffffffa63ece50>]
netif_receive_skb_internal+0x40/0xc0
Aug  7 06:47:30 xxxxxxx kernel: [<ffffffffa63eda78>]
napi_gro_receive+0xd8/0x100
Aug  7 06:47:30 xxxxxxx kernel: [<ffffffffc0983183>]
mlx5i_handle_rx_cqe+0x2a3/0x460 [mlx5_core]
Aug  7 06:47:30 xxxxxxx kernel: [<ffffffffc09826f8>]
mlx5e_poll_rx_cq+0xc8/0x8b0 [mlx5_core]
Aug  7 06:47:30 xxxxxxx kernel: [<ffffffffc0983909>]
mlx5e_napi_poll+0x99/0x280 [mlx5_core]
Aug  7 06:47:30 xxxxxxx kernel: [<ffffffffa63ed46f>]
net_rx_action+0x26f/0x390
Aug  7 06:47:30 xxxxxxx kernel: [<ffffffffa5e9b085>]
__do_softirq+0xf5/0x280
Aug  7 06:47:30 xxxxxxx kernel: [<ffffffffa6523cec>]
call_softirq+0x1c/0x30
Aug  7 06:47:30 xxxxxxx kernel: [<ffffffffa5e2d625>]
do_softirq+0x65/0xa0
Aug  7 06:47:30 xxxxxxx kernel: [<ffffffffa5e9b405>]
irq_exit+0x105/0x110
Aug  7 06:47:30 xxxxxxx kernel: [<ffffffffa6524f86>] do_IRQ+0x56/0xf0
Aug  7 06:47:30 xxxxxxx kernel: [<ffffffffa6517362>]
common_interrupt+0x162/0x162
Aug  7 06:47:30 xxxxxxx kernel: <EOI>  [<ffffffffa5fc12d5>] ?
do_read_fault.isra.60+0x5/0x1a0
Aug  7 06:47:30 xxxxxxx kernel: [<ffffffffa5fc5a9c>] ?
handle_pte_fault+0x2dc/0xc30
Aug  7 06:47:30 xxxxxxx kernel: [<ffffffffa5fc7c3d>]
handle_mm_fault+0x39d/0x9b0
Aug  7 06:47:30 xxxxxxx kernel: [<ffffffffa651b547>]
__do_page_fault+0x197/0x4f0
Aug  7 06:47:30 xxxxxxx kernel: [<ffffffffa651b8d5>]
do_page_fault+0x35/0x90
Aug  7 06:47:30 xxxxxxx kernel: [<ffffffffa6517758>]
page_fault+0x28/0x30
Aug  7 06:47:30 xxxxxxx kernel: ---[ end trace 020d3cfb07217435 ]---

Then this very soon after

Aug  7 06:47:48 xxxxxxx kernel: ------------[ cut here ]------------
Aug  7 06:47:48 xxxxxxx kernel: WARNING: CPU: 10 PID: 89058 at
lib/list_debug.c:53 __list_del_entry+0x63/0xd0
Aug  7 06:47:48 xxxxxxx kernel: list_del corruption, ffff9f6fba35bb70-
>next is LIST_POISON1 (dead000000000100)
Aug  7 06:47:48 xxxxxxx kernel: Modules linked in: bnx2i cnic uio ip_vs
nf_conntrack ip6table_filter ip6_tables iptable_filter tcp_diag
udp_diag inet_diag unix_diag af_packet_diag n
etlink_diag oracleadvm(POE) oracleoks(POE) oracleasm(O) nfsv3
rpcsec_gss_krb5 nfsv4 dns_resolver nfs fscache dm_round_robin bonding
rpcrdma ib_isert iscsi_target_mod ib_iser libiscsi sc
si_transport_iscsi ib_srpt target_core_mod ib_srp scsi_transport_srp
scsi_tgt ib_ipoib rdma_ucm ib_ucm ib_uverbs ib_umad rdma_cm ib_cm iw_cm
mlx5_ib ib_core vfat fat xfs sb_edac intel_p
owerclamp coretemp intel_rapl iosf_mbi kvm_intel kvm irqbypass
crc32_pclmul ghash_clmulni_intel iTCO_wdt aesni_intel ipmi_ssif lrw
iTCO_vendor_support gf128mul glue_helper ablk_helper i
oatdma cryptd ipmi_si pcspkr joydev ipmi_devintf hpwdt i2c_i801 hpilo
sg lpc_ich wmi dca ipmi_msghandler
Aug  7 06:47:48 xxxxxxx kernel: acpi_power_meter pcc_cpufreq shpchp
dm_multipath binfmt_misc nfsd auth_rpcgss nfs_acl lockd grace sunrpc
ip_tables ext4 mbcache jbd2 sd_mod crc_t10di
f crct10dif_generic i2c_algo_bit drm_kms_helper syscopyarea sysfillrect
sysimgblt fb_sys_fops mlx5_core ttm mlxfw drm bnx2x tg3 devlink mdio
crct10dif_pclmul libcrc32c crct10dif_common 
hpsa ptp i2c_core crc32c_intel scsi_transport_sas pps_core dm_mirror
dm_region_hash dm_log dm_mod
Aug  7 06:47:48 xxxxxxx kernel: CPU: 10 PID: 89058 Comm: tnslsnr
Tainted: P        W  OE  ------------   3.10.0-862.9.1.el7.x86_64 #1
Aug  7 06:47:48 xxxxxxx kernel: Hardware name: HP ProLiant DL380
Gen9/ProLiant DL380 Gen9, BIOS P89 05/21/2018
Aug  7 06:47:48 xxxxxxx kernel: Call Trace:
Aug  7 06:47:48 xxxxxxx kernel: [<ffffffffa650e84e>]
dump_stack+0x19/0x1b
Aug  7 06:47:48 xxxxxxx kernel: [<ffffffffa5e91e18>] __warn+0xd8/0x100
Aug  7 06:47:48 xxxxxxx kernel: [<ffffffffa5e91e9f>]
warn_slowpath_fmt+0x5f/0x80
Aug  7 06:47:48 xxxxxxx kernel: [<ffffffffa6168e23>]
__list_del_entry+0x63/0xd0
Aug  7 06:47:48 xxxxxxx kernel: [<ffffffffa6168e9d>] list_del+0xd/0x30
Aug  7 06:47:48 xxxxxxx kernel: [<ffffffffa5ebc226>]
remove_wait_queue+0x26/0x40
Aug  7 06:47:48 xxxxxxx kernel: [<ffffffffa6067a5a>]
ep_unregister_pollwait.isra.6+0x3a/0x60
Aug  7 06:47:48 xxxxxxx kernel: [<ffffffffa6067aa2>]
ep_remove+0x22/0xc0
Aug  7 06:47:48 xxxxxxx kernel: [<ffffffffa6068f1f>]
SyS_epoll_ctl+0x4bf/0xc60
Aug  7 06:47:48 xxxxxxx kernel: [<ffffffffa651b56c>] ?
__do_page_fault+0x1bc/0x4f0
Aug  7 06:47:48 xxxxxxx kernel: [<ffffffffa6520795>]
system_call_fastpath+0x1c/0x21
Aug  7 06:47:48 xxxxxxx kernel: ---[ end trace 020d3cfb07217438 ]---

These started after 7.5

messages-20180806:Aug  4 20:10:54 xxxxxxx kernel: WARNING: CPU: 1 PID:
48632 at lib/list_debug.c:59 __list_del_entry+0xa1/0xd0
messages-20180806:Aug  4 20:10:54 xxxxxxx kernel: list_del corruption.
prev->next should be ffff9f6991eb6648, but was           (null)
messages-20180806:Aug  4 20:10:54 xxxxxxx kernel: [<ffffffffa6168e61>]
__list_del_entry+0xa1/0xd0
messages-20180806:Aug  5 00:25:39 xxxxxxx kernel: WARNING: CPU: 3 PID:
84714 at lib/list_debug.c:59 __list_del_entry+0xa1/0xd0
messages-20180806:Aug  5 00:25:39 xxxxxxx kernel: list_del corruption.
prev->next should be ffff9f0bb12206c8, but was           (null)
messages-20180806:Aug  5 00:25:39 xxxxxxx kernel: [<ffffffffa6168e61>]
__list_del_entry+0xa1/0xd0
messages-20180806:Aug  5 00:33:42 xxxxxxx kernel: WARNING: CPU: 4 PID:
80177 at lib/list_debug.c:59 __list_del_entry+0xa1/0xd0
messages-20180806:Aug  5 00:33:42 xxxxxxx kernel: list_del corruption.
prev->next should be ffff9f69546a7ac8, but was dead000000000200
messages-20180806:Aug  5 00:33:42 xxxxxxx kernel: [<ffffffffa6168e61>]
__list_del_entry+0xa1/0xd0
messages-20180806:Aug  5 00:40:14 xxxxxxx kernel: WARNING: CPU: 13 PID:
80177 at lib/list_debug.c:59 __list_del_entry+0xa1/0xd0
messages-20180806:Aug  5 00:40:14 xxxxxxx kernel: list_del corruption.
prev->next should be ffff9f44a5b9f248, but was           (null)
messages-20180806:Aug  5 00:40:14 xxxxxxx kernel: [<ffffffffa6168e61>]
__list_del_entry+0xa1/0xd0
messages-20180806:Aug  5 00:43:16 xxxxxxx kernel: WARNING: CPU: 0 PID:
51133 at lib/list_debug.c:59 __list_del_entry+0xa1/0xd0
messages-20180806:Aug  5 00:43:16 xxxxxxx kernel: list_del corruption.
prev->next should be ffff9f6792776d48, but was           (null)
messages-20180806:Aug  5 00:43:16 xxxxxxx kernel: [<ffffffffa6168e61>]
__list_del_entry+0xa1/0xd0
 will be toiugh to get upstream tested here so I am cont=inuing to try
reproduce.

Has anybody seen this list corruption before

--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at  http://vger.kernel.org/majordomo-info.html