Re: Seeing this on a RHEL kernel with upstream backports wondering if this was ever fixed

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 



On Tue, 2018-08-07 at 14:33 -0400, Laurence Oberman wrote:
> On Tue, 2018-08-07 at 14:26 -0400, Laurence Oberman wrote:
> > On Fri, 2018-07-27 at 09:21 -0400, Laurence Oberman wrote:
> > > On Fri, 2018-07-27 at 08:05 -0400, Laurence Oberman wrote:
> > > > On Thu, 2018-07-26 at 16:02 -0400, Laurence Oberman wrote:
> > > > > On Thu, 2018-07-26 at 10:28 -0400, Don Dutile wrote:
> > > > > > On 07/26/2018 08:48 AM, Laurence Oberman wrote:
> > > > > > > Hello
> > > > > > > 
> > > > > > > https://www.spinics.net/lists/linux-rdma/msg51334.html
> > > > > > > 
> > > > > > > A rhel 7.5 with backports from upstream is hitting this.
> > > > > > > Chuck Reported it and Sagi and Max responded but its not
> > > > > > > clear
> > > > > > > if
> > > > > > > we
> > > > > > > ever fixed this.
> > > > > > > 
> > > > > > 
> > > > > > RHEL-7.5 data point:
> > > > > > -- drivers/infiniband/* -r is backported to v4.14.
> > > > > >     i.e., includes the patch(es) mentioned in the above
> > > > > > thread.
> > > > > > 
> > > > > > Laurence:
> > > > > > Please test with 7.6 kernel & report back.
> > > > > > if that passes, RH can bisect the bug fix btwn v4.14 &
> > > > > > v4.16(the
> > > > > > 7.6
> > > > > > update point for its rdma kernel core),
> > > > > > and backport to 7.5-zstream.  note: you'll have to update
> > > > > > rdma-
> > > > > > core
> > > > > > pkg to the 7.6 version as well.
> > > > > > All functional & bug fix patches to mlx* (ib & enet) are in
> > > > > > as
> > > > > > well
> > > > > > (same kernel references).
> > > > > > 
> > > > > > -dd
> > > > > > 
> > > > > > > In this case we land up in a panic, noty just messaging,
> > > > > > > although
> > > > > > > the
> > > > > > > messages logged for a long time over and over until we
> > > > > > > finally
> > > > > > > panicked.
> > > > > > > 
> > > > > > > crash> log | grep "memreg failure: memor" | wc -l
> > > > > > > 2414
> > > > > > > 
> > > > > > > crash> log
> > > > > > > [1635578.012721]  connection16:0: detected conn error
> > > > > > > (1011)
> > > > > > > [1635587.050688] mlx5_0:dump_cqe:262:(pid 93128): dump
> > > > > > > error
> > > > > > > cqe
> > > > > > > [1635587.089686] 00000000 00000000 00000000 00000000
> > > > > > > [1635587.123989] 00000000 00000000 00000000 00000000
> > > > > > > [1635587.157494] 00000000 00000000 00000000 00000000
> > > > > > > [1635587.190968] 00000000 08007806 250002ad ba6115d3
> > > > > > > 
> > > > > > > [1635587.224331] iser: iser_err_comp: memreg failure:
> > > > > > > memory
> > > > > > > management
> > > > > > > operation error (6) vend_err 78
> > > > > > > [1635587.278876]  connection15:0: detected conn error
> > > > > > > (1011)
> > > > > > > [1635590.986286] mlx5_1:dump_cqe:262:(pid 0): dump error
> > > > > > > cqe
> > > > > > > [1635591.021891] 00000000 00000000 00000000 00000000
> > > > > > > [1635591.053944] 00000000 00000000 00000000 00000000
> > > > > > > 
> > > > > > > [1657077.997960] BUG: unable to handle kernel NULL
> > > > > > > pointer
> > > > > > > dereference
> > > > > > > at 0000000000000010
> > > > > > > [1657077.997967] IP: [<ffffffffc08a541e>]
> > > > > > > iscsi_verify_itt+0x1e/0x110
> > > > > > > [libiscsi]
> > > > > > > [1657077.997970] PGD 80000098de387067 PUD b8d9ffa067 PMD
> > > > > > > 0
> > > > > > > [1657077.997971] Oops: 0000 [#1] SMP
> > > > > > > [1657077.998009] Modules linked in: oracleasm(O) nfsv3
> > > > > > > rpcsec_gss_krb5
> > > > > > > nfsv4 dns_resolver nfs fscache dm_round_robin bonding
> > > > > > > rpcrdma
> > > > > > > ib_isert
> > > > > > > iscsi_target_mod ib_iser libiscsi scsi_transport_iscsi
> > > > > > > ib_srpt
> > > > > > > target_core_mod ib_srp scsi_transport_srp scsi_tgt
> > > > > > > ib_ipoib
> > > > > > > rdma_ucm
> > > > > > > ib_ucm ib_uverbs ib_umad rdma_cm ib_cm iw_cm mlx5_ib
> > > > > > > ib_core
> > > > > > > vfat
> > > > > > > fat
> > > > > > > xfs sb_edac edac_core intel_powerclamp coretemp
> > > > > > > intel_rapl
> > > > > > > iosf_mbi
> > > > > > > kvm_intel kvm irqbypass iTCO_wdt crc32_pclmul ipmi_ssif
> > > > > > > iTCO_vendor_support ghash_clmulni_intel aesni_intel lrw
> > > > > > > gf128mul
> > > > > > > ipmi_si glue_helper ablk_helper cryptd sg hpwdt hpilo
> > > > > > > pcspkr
> > > > > > > ipmi_devintf ioatdma dm_multipath i2c_i801 lpc_ich shpchp
> > > > > > > dca
> > > > > > > wmi
> > > > > > > ipmi_msghandler pcc_cpufreq acpi_power_meter nfsd
> > > > > > > binfmt_misc
> > > > > > > auth_rpcgss nfs_acl lockd grace sunrpc ip_tables ext4
> > > > > > > mbcache
> > > > > > > jbd2
> > > > > > > sd_mod crc_t10dif crct10dif_generic
> > > > > > > [1657077.998020]  i2c_algo_bit drm_kms_helper syscopyarea
> > > > > > > sysfillrect
> > > > > > > sysimgblt fb_sys_fops ttm bnx2x mlx5_core
> > > > > > > crct10dif_pclmul
> > > > > > > mdio
> > > > > > > tg3(OE)
> > > > > > > devlink libcrc32c crct10dif_common drm hpsa(OE) ptp
> > > > > > > i2c_core
> > > > > > > crc32c_intel scsi_transport_sas pps_core dm_mirror
> > > > > > > dm_region_hash
> > > > > > > dm_log dm_mod
> > > > > > > [1657077.998023] CPU: 20 PID: 41538 Comm: sh Tainted:
> > > > > > > G           OE  -
> > > > > > > -----------   3.10.0-693.34.1.el7_bz1582551.x86_64 #1
> > > > > > > [1657077.998024] Hardware name: HP ProLiant DL380
> > > > > > > Gen9/ProLiant
> > > > > > > DL380
> > > > > > > Gen9, BIOS P89 05/21/2018
> > > > > > > [1657077.998025] task: ffff88587ce38fd0 ti:
> > > > > > > ffff884dd0af0000
> > > > > > > task.ti:
> > > > > > > ffff884dd0af0000
> > > > > > > [1657077.998029] RIP:
> > > > > > > 0010:[<ffffffffc08a541e>]  [<ffffffffc08a541e>]
> > > > > > > iscsi_verify_itt+0x1e/0x110 [libiscsi]
> > > > > > > [1657077.998030] RSP: 0000:ffff88beff403d78  EFLAGS:
> > > > > > > 00010286
> > > > > > > [1657077.998031] RAX: 000000000000004c RBX:
> > > > > > > 00000000b0000036
> > > > > > > RCX:
> > > > > > > 0000000000000002
> > > > > > > [1657077.998032] RDX: 00000000000000cc RSI:
> > > > > > > 00000000b0000036
> > > > > > > RDI:
> > > > > > > 0000000000000000
> > > > > > > [1657077.998033] RBP: ffff88beff403da0 R08:
> > > > > > > 0000000040032a20
> > > > > > > R09:
> > > > > > > ffff8896e4eaf91c
> > > > > > > [1657077.998034] R10: 0000000000000000 R11:
> > > > > > > 00007ffff7763ca0
> > > > > > > R12:
> > > > > > > 0000000000000000
> > > > > > > [1657077.998035] R13: ffff8896e4eaf9e4 R14:
> > > > > > > ffff8896e4eaf900
> > > > > > > R15:
> > > > > > > 0000000000000000
> > > > > > > [1657077.998036] FS:  00007ffff7fe6740(0000)
> > > > > > > GS:ffff88beff400000(0000)
> > > > > > > knlGS:0000000000000000
> > > > > > > [1657077.998038] CS:  0010 DS: 0000 ES: 0000 CR0:
> > > > > > > 0000000080050033
> > > > > > > [1657077.998039] CR2: 0000000000000010 CR3:
> > > > > > > 000000ad92eba000
> > > > > > > CR4:
> > > > > > > 00000000003607e0
> > > > > > > [1657077.998040] DR0: 0000000000000000 DR1:
> > > > > > > 0000000000000000
> > > > > > > DR2:
> > > > > > > 0000000000000000
> > > > > > > [1657077.998041] DR3: 0000000000000000 DR6:
> > > > > > > 00000000fffe0ff0
> > > > > > > DR7:
> > > > > > > 0000000000000400
> > > > > > > [1657077.998042] Call Trace:
> > > > > > > [1657077.998044]  <IRQ>
> > > > > > > [1657077.998046]  [<ffffffffc08a5527>]
> > > > > > > iscsi_itt_to_ctask+0x17/0x80
> > > > > > > [libiscsi]
> > > > > > > [1657077.998050]  [<ffffffffc05eefea>]
> > > > > > > iser_task_rsp+0xca/0x360
> > > > > > > [ib_iser]
> > > > > > > [1657077.998061]  [<ffffffffc0587fbb>]
> > > > > > > __ib_process_cq+0x6b/0xe0
> > > > > > > [ib_core]
> > > > > > > [1657077.998066]  [<ffffffffc0588122>]
> > > > > > > ib_poll_handler+0x22/0x80
> > > > > > > [ib_core]
> > > > > > > [1657077.998070]  [<ffffffff81358507>]
> > > > > > > irq_poll_softirq+0xc7/0x100
> > > > > > > [1657077.998076]  [<ffffffff81095195>]
> > > > > > > __do_softirq+0xf5/0x280
> > > > > > > [1657077.998081]  [<ffffffff816c4e8c>]
> > > > > > > call_softirq+0x1c/0x30
> > > > > > > [1657077.998086]  [<ffffffff8102d435>]
> > > > > > > do_softirq+0x65/0xa0
> > > > > > > [1657077.998088]  [<ffffffff81095515>]
> > > > > > > irq_exit+0x105/0x110
> > > > > > > [1657077.998091]  [<ffffffff816c61d6>] do_IRQ+0x56/0xf0
> > > > > > > [1657077.998098]  [<ffffffff816b837c>]
> > > > > > > common_interrupt+0x17c/0x17c
> > > > > > > [1657077.998099]  <EOI>
> > > > > > > [1657077.998113] Code: ff ff ff eb a9 41 be 95 ff ff ff
> > > > > > > eb
> > > > > > > a1
> > > > > > > 0f
> > > > > > > 1f
> > > > > > > 44
> > > > > > > 00 00 55 48 89 e5 41 55 41 54 49 89 fc 53 89 f3 48 83 ec
> > > > > > > 10
> > > > > > > c7
> > > > > > > 45
> > > > > > > d8 00
> > > > > > > 00 00 00 <4c> 8b 6f 10 65 48 8b 04 25 28 00 00 00 48 89
> > > > > > > 45
> > > > > > > e0
> > > > > > > 31
> > > > > > > c0
> > > > > > > 83
> > > > > > > fe
> > > > > > > [1657077.998116] RIP  [<ffffffffc08a541e>]
> > > > > > > iscsi_verify_itt+0x1e/0x110
> > > > > > > [libiscsi]
> > > > > > > [1657077.998116]  RSP <ffff88beff403d78>
> > > > > > > [1657077.998117] CR2: 0000000000000010
> > > > > > > crash>
> > > > > > > 
> > > > > > > crash> bt
> > > > > > > PID: 41538  TASK: ffff88587ce38fd0  CPU: 20  COMMAND:
> > > > > > > "sh"
> > > > > > >   #0 [ffff88beff403a18] machine_kexec at ffffffff8105ddeb
> > > > > > >   #1 [ffff88beff403a78] __crash_kexec at ffffffff81109902
> > > > > > >   #2 [ffff88beff403b48] crash_kexec at ffffffff811099f0
> > > > > > >   #3 [ffff88beff403b60] oops_end at ffffffff816b97a8
> > > > > > >   #4 [ffff88beff403b88] no_context at ffffffff816a8c96
> > > > > > >   #5 [ffff88beff403bd8] __bad_area_nosemaphore at
> > > > > > > ffffffff816a8d2c
> > > > > > >   #6 [ffff88beff403c20] bad_area_nosemaphore at
> > > > > > > ffffffff816a8e96
> > > > > > >   #7 [ffff88beff403c30] __do_page_fault at
> > > > > > > ffffffff816bc6be
> > > > > > >   #8 [ffff88beff403c90] do_page_fault at ffffffff816bc865
> > > > > > >   #9 [ffff88beff403cc0] page_fault at ffffffff816b8788
> > > > > > >      [exception RIP: iscsi_verify_itt+30]
> > > > > > >      RIP: ffffffffc08a541e  RSP:
> > > > > > > ffff88beff403d78  RFLAGS:
> > > > > > > 00010286
> > > > > > >      RAX: 000000000000004c  RBX: 00000000b0000036  RCX:
> > > > > > > 0000000000000002
> > > > > > >      RDX: 00000000000000cc  RSI: 00000000b0000036  RDI:
> > > > > > > 0000000000000000
> > > > > > >      RBP: ffff88beff403da0   R8: 0000000040032a20   R9:
> > > > > > > ffff8896e4eaf91c
> > > > > > >      R10: 0000000000000000  R11: 00007ffff7763ca0  R12:
> > > > > > > 0000000000000000
> > > > > > >      R13: ffff8896e4eaf9e4  R14: ffff8896e4eaf900  R15:
> > > > > > > 0000000000000000
> > > > > > >      ORIG_RAX: ffffffffffffffff  CS: 0010  SS: 0000
> > > > > > > #10 [ffff88beff403da8] iscsi_itt_to_ctask at
> > > > > > > ffffffffc08a5527
> > > > > > > [libiscsi]
> > > > > > > #11 [ffff88beff403dc8] iser_task_rsp at ffffffffc05eefea
> > > > > > > [ib_iser]
> > > > > > > #12 [ffff88beff403e10] __ib_process_cq at
> > > > > > > ffffffffc0587fbb
> > > > > > > [ib_core]
> > > > > > > #13 [ffff88beff403e50] ib_poll_handler at
> > > > > > > ffffffffc0588122
> > > > > > > [ib_core]
> > > > > > > #14 [ffff88beff403e80] irq_poll_softirq at
> > > > > > > ffffffff81358507
> > > > > > > #15 [ffff88beff403eb8] __do_softirq at ffffffff81095195
> > > > > > > #16 [ffff88beff403f28] call_softirq at ffffffff816c4e8c
> > > > > > > #17 [ffff88beff403f40] do_softirq at ffffffff8102d435
> > > > > > > #18 [ffff88beff403f60] irq_exit at ffffffff81095515
> > > > > > > #19 [ffff88beff403f78] do_IRQ at ffffffff816c61d6
> > > > > > > --- <IRQ stack> ---
> > > > > > > #20 [ffff884dd0af3f58] ret_from_intr at ffffffff816b837c
> > > > > > >      RIP: 000000000041b866  RSP:
> > > > > > > 00007fffffffea28  RFLAGS:
> > > > > > > 00000206
> > > > > > >      RAX: 0000000000000000  RBX: 00007fffffffef53  RCX:
> > > > > > > 00000000006f1a70
> > > > > > >      RDX: 00000000006f1a70  RSI: 00000000006f1a90  RDI:
> > > > > > > 0000000000000000
> > > > > > >      RBP: 0000000000000002   R8: 0000000000000001   R9:
> > > > > > > 0000000000000020
> > > > > > >      R10: 0000000000000003  R11: 00007ffff7763ca0  R12:
> > > > > > > ffff88beff4061e8
> > > > > > >      R13: 00000000ffffffff  R14: 0000000000000000  R15:
> > > > > > > 0000000000000063
> > > > > > >      ORIG_RAX: ffffffffffffffbb  CS: 0033  SS: 002b
> > > > > > > 
> > > > > > > crash> ps -p 41538
> > > > > > > PID: 0      TASK: ffffffff81a0e480  CPU: 0   COMMAND:
> > > > > > > "swapper/0"
> > > > > > >   PID: 1      TASK: ffff88012e4c8000  CPU: 7   COMMAND:
> > > > > > > "systemd"
> > > > > > >    PID: 2345   TASK: ffff885ef5eb8fd0  CPU: 14  COMMAND:
> > > > > > > "zabbix_agentd"
> > > > > > >     PID: 2349   TASK: ffff885efcbcaf70  CPU: 1   COMMAND:
> > > > > > > "zabbix_agentd"
> > > > > > >      PID: 41538  TASK: ffff88587ce38fd0  CPU:
> > > > > > > 20  COMMAND:
> > > > > > > "sh"
> > > > > > > 
> > > > > > 
> > > > > > 
> > > > > 
> > > > > Don
> > > > > I misspoke about the kernel version, its 7.4 
> > > > > 3.10.0-693.34.1.el7_bz1582551.x86_64
> > > > > Its the one we added the missing iscsi patches to but base is
> > > > > 7.4
> > > > > So I will test with 7.5
> > > > > 
> > > > 
> > > > Don, I had another look at this.
> > > > 
> > > > Its not the SG_GAPS issue causing a memory registration error I
> > > > reported and we fixed in 7.5 from upstream.
> > > > 
> > > > Which commit in 7.5 did we pull in for fix this from upstream.
> > > > 
> > > > I think this is different and not yet fixed ??
> > > > 
> > > > [14556.614551] iser: iser_err_comp: memreg failure: memory
> > > > management
> > > > operation error (6) vend_err 78
> > > > [14556.666134]  connection1:0: detected conn error (1011)
> > > > [14562.678414] mlx5_1:dump_cqe:262:(pid 0): dump error cqe
> > > > [14562.678529] mlx5_0:dump_cqe:262:(pid 0): dump error cqe
> > > > [14562.678530] 00000000 00000000 00000000 00000000
> > > > [14562.678531] 00000000 00000000 00000000 00000000
> > > > [14562.678531] 00000000 00000000 00000000 00000000
> > > > [14562.678532] 00000000 08007806 25000344 34681cd2
> > > > [14562.678535] iser: iser_err_comp: memreg failure: memory
> > > > management
> > > > operation error (6) vend_err 78
> > > > [14562.678544]  connection1:0: detected conn error (1011)
> > > > [14562.679098] BUG: unable to handle kernel NULL pointer
> > > > dereference
> > > > at
> > > > 0000000000000010
> > > > [14562.679105] IP: [<ffffffffc088141e>]
> > > > iscsi_verify_itt+0x1e/0x110
> > > > [libiscsi]
> > > > [14562.679106] PGD 0
> > > > [14562.679107] Oops: 0000 [#1] SMP
> > > > [14562.679134] Modules linked in: ip6table_filter ip6_tables
> > > > iptable_filter sctp_diag sctp tcp_diag udp_diag inet_diag
> > > > unix_diag
> > > > af_packet_diag netlink_diag bnx2i cnic uio ip_vs nf_conntrack
> > > > oracleadvm(POE) oracleoks(POE) oracleasm(O) nfsv3
> > > > rpcsec_gss_krb5
> > > > nfsv4
> > > > dns_resolver nfs fscache dm_round_robin bonding rpcrdma
> > > > ib_isert
> > > > iscsi_target_mod ib_iser libiscsi scsi_transport_iscsi ib_srpt
> > > > target_core_mod ib_srp scsi_transport_srp scsi_tgt ib_ipoib
> > > > rdma_ucm
> > > > ib_ucm ib_uverbs ib_umad rdma_cm ib_cm iw_cm mlx5_ib ib_core
> > > > xfs
> > > > vfat
> > > > fat sb_edac edac_core intel_powerclamp coretemp intel_rapl
> > > > iosf_mbi
> > > > kvm_intel kvm irqbypass crc32_pclmul ghash_clmulni_intel
> > > > aesni_intel
> > > > lrw gf128mul glue_helper ablk_helper cryptd iTCO_wdt
> > > > iTCO_vendor_support ipmi_ssif pcspkr ipmi_si dm_multipath
> > > > ioatdma
> > > > lpc_ich i2c_i801 sg hpilo
> > > > [14562.679152]  hpwdt dca ipmi_devintf ipmi_msghandler
> > > > pcc_cpufreq
> > > > shpchp wmi acpi_power_meter binfmt_misc nfsd auth_rpcgss
> > > > nfs_acl
> > > > lockd
> > > > grace sunrpc ip_tables ext4 mbcache jbd2 sd_mod crc_t10dif
> > > > crct10dif_generic i2c_algo_bit drm_kms_helper syscopyarea
> > > > sysfillrect
> > > > sysimgblt fb_sys_fops ttm bnx2x mlx5_core devlink mdio tg3(OE)
> > > > libcrc32c drm crct10dif_pclmul hpsa(OE) crct10dif_common ptp
> > > > i2c_core
> > > > crc32c_intel scsi_transport_sas pps_core dm_mirror
> > > > dm_region_hash
> > > > dm_log dm_mod
> > > > [14562.679154] CPU: 9 PID: 0 Comm: swapper/9 Tainted:
> > > > P           OE  -
> > > > -----------   3.10.0-693.22.1.el7.x86_64 #1
> > > > [14562.679155] Hardware name: HP ProLiant DL380 Gen9/ProLiant
> > > > DL380
> > > > Gen9, BIOS P89 05/21/2018
> > > > [14562.679156] task: ffff8860aefaaf70 ti: ffff8860ae440000
> > > > task.ti:
> > > > ffff8860ae440000
> > > > [14562.679158] RIP:
> > > > 0010:[<ffffffffc088141e>]  [<ffffffffc088141e>]
> > > > iscsi_verify_itt+0x1e/0x110 [libiscsi]
> > > > [14562.679159] RSP: 0018:ffff88beff2c3d78  EFLAGS: 00010286
> > > > [14562.679160] RAX: 000000000000004c RBX: 00000000d0000041 RCX:
> > > > 0000000000000002
> > > > [14562.679161] RDX: 00000000000000cc RSI: 00000000d0000041 RDI:
> > > > 0000000000000000
> > > > [14562.679161] RBP: ffff88beff2c3da0 R08: 0000000040001038 R09:
> > > > ffff88ae496fe01c
> > > > [14562.679162] R10: 0000000000000000 R11: 7fffffffffffffff R12:
> > > > 0000000000000000
> > > > [14562.679162] R13: ffff88ae496fe0e4 R14: ffff88ae496fe000 R15:
> > > > 0000000000000000
> > > > [14562.679163] FS:  0000000000000000(0000)
> > > > GS:ffff88beff2c0000(0000)
> > > > knlGS:0000000000000000
> > > > [14562.679164] CS:  0010 DS: 0000 ES: 0000 CR0:
> > > > 0000000080050033
> > > > [14562.679164] CR2: 0000000000000010 CR3: 000000beede48000 CR4:
> > > > 00000000003607e0
> > > > [14562.679165] DR0: 0000000000000000 DR1: 0000000000000000 DR2:
> > > > 0000000000000000
> > > > [14562.679166] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7:
> > > > 0000000000000400
> > > > [14562.679166] Call Trace:
> > > > [14562.679168]  <IRQ>
> > > > [14562.679170]  [<ffffffffc0881527>]
> > > > iscsi_itt_to_ctask+0x17/0x80
> > > > [libiscsi]
> > > > [14562.679173]  [<ffffffffc069ffea>] iser_task_rsp+0xca/0x360
> > > > [ib_iser]
> > > > [14562.679181]  [<ffffffffc0924fbb>] __ib_process_cq+0x6b/0xe0
> > > > [ib_core]
> > > 
> > > Starts with the memreg failures
> > > crash> log | grep "iser: iser_err_comp: memreg failure" | wc -l
> > > 1237
> > > 
> > > Then the panic
> > > 
> > > [14556.614551] iser: iser_err_comp: memreg failure: memory
> > > management
> > > operation error (6) vend_err 78
> > > [14556.666134]  connection1:0: detected conn error (1011)
> > > [14562.678414] mlx5_1:dump_cqe:262:(pid 0): dump error cqe
> > > [14562.678529] mlx5_0:dump_cqe:262:(pid 0): dump error cqe
> > > [14562.678530] 00000000 00000000 00000000 00000000
> > > [14562.678531] 00000000 00000000 00000000 00000000
> > > [14562.678531] 00000000 00000000 00000000 00000000
> > > [14562.678532] 00000000 08007806 25000344 34681cd2
> > > [14562.678535] iser: iser_err_comp: memreg failure: memory
> > > management
> > > operation error (6) vend_err 78
> > > [14562.678544]  connection1:0: detected conn error (1011)
> > > 
> > > [14562.679098] BUG: unable to handle kernel NULL pointer
> > > dereference
> > > at
> > > 0000000000000010
> > > [14562.679105] IP: [<ffffffffc088141e>]
> > > iscsi_verify_itt+0x1e/0x110
> > > [libiscsi]
> > > 
> > > crash> bt
> > > PID: 0      TASK: ffff8860aefaaf70  CPU: 9   COMMAND: "swapper/9"
> > >  #0 [ffff88beff2c3a18] machine_kexec at ffffffff8105d77b
> > >  #1 [ffff88beff2c3a78] __crash_kexec at ffffffff81108732
> > >  #2 [ffff88beff2c3b48] crash_kexec at ffffffff81108820
> > >  #3 [ffff88beff2c3b60] oops_end at ffffffff816b8778
> > >  #4 [ffff88beff2c3b88] no_context at ffffffff816a7c7a
> > >  #5 [ffff88beff2c3bd8] __bad_area_nosemaphore at ffffffff816a7d10
> > >  #6 [ffff88beff2c3c20] bad_area_nosemaphore at ffffffff816a7e7a
> > >  #7 [ffff88beff2c3c30] __do_page_fault at ffffffff816bb68e
> > >  #8 [ffff88beff2c3c90] do_page_fault at ffffffff816bb835
> > >  #9 [ffff88beff2c3cc0] page_fault at ffffffff816b7768
> > >     [exception RIP: iscsi_verify_itt+30]
> > >     RIP: ffffffffc088141e  RSP: ffff88beff2c3d78  RFLAGS:
> > > 00010286
> > >     RAX: 000000000000004c  RBX: 00000000d0000041  RCX:
> > > 0000000000000002
> > >     RDX: 00000000000000cc  RSI: 00000000d0000041  RDI:
> > > 0000000000000000
> > >     RBP: ffff88beff2c3da0   R8: 0000000040001038   R9:
> > > ffff88ae496fe01c
> > >     R10: 0000000000000000  R11: 7fffffffffffffff  R12:
> > > 0000000000000000
> > >     R13: ffff88ae496fe0e4  R14: ffff88ae496fe000  R15:
> > > 0000000000000000
> > >     ORIG_RAX: ffffffffffffffff  CS: 0010  SS: 0018
> > > #10 [ffff88beff2c3da8] iscsi_itt_to_ctask at ffffffffc0881527
> > > [libiscsi]
> > > #11 [ffff88beff2c3dc8] iser_task_rsp at ffffffffc069ffea
> > > [ib_iser]
> > > #12 [ffff88beff2c3e10] __ib_process_cq at ffffffffc0924fbb
> > > [ib_core]
> > > #13 [ffff88beff2c3e50] ib_poll_handler at ffffffffc0925122
> > > [ib_core]
> > > #14 [ffff88beff2c3e80] irq_poll_softirq at ffffffff813572b7
> > > #15 [ffff88beff2c3eb8] __do_softirq at ffffffff81094035
> > > #16 [ffff88beff2c3f28] call_softirq at ffffffff816c3afc
> > > #17 [ffff88beff2c3f40] do_softirq at ffffffff8102d435
> > > #18 [ffff88beff2c3f60] irq_exit at ffffffff810943b5
> > > #19 [ffff88beff2c3f78] do_IRQ at ffffffff816c4d96
> > > --- <IRQ stack> ---
> > > #20 [ffff8860ae443db8] ret_from_intr at ffffffff816b7362
> > >     [exception RIP: cpuidle_enter_state+87]
> > >     RIP: ffffffff81530b07  RSP: ffff8860ae443e60  RFLAGS:
> > > 00000202
> > >     RAX: 00000d3e7d729de6  RBX: ffff8860ae443e40  RCX:
> > > 0000000000000018
> > >     RDX: 0000000225c17d03  RSI: ffff8860ae443fd8  RDI:
> > > 00000d3e7d729de6
> > >     RBP: ffff8860ae443e88   R8: 000000000000016c   R9:
> > > 000000000000001c
> > >     R10: 0000000000000043  R11: 7fffffffffffffff  R12:
> > > 0000000000000009
> > >     R13: ffff88beff2d39a0  R14: ffffffff810b77e5  R15:
> > > ffff8860ae443de0
> > >     ORIG_RAX: ffffffffffffff5d  CS: 0010  SS: 0018
> > > #21 [ffff8860ae443e90] cpuidle_idle_call at ffffffff81530c5e
> > > #22 [ffff8860ae443ed0] arch_cpu_idle at ffffffff81034f8e
> > > #23 [ffff8860ae443ee0] cpu_startup_entry at ffffffff810eb6da
> > > #24 [ffff8860ae443f28] start_secondary at ffffffff81052222
> > > 
> > > crash> dis -l iscsi_verify_itt+30
> > > /usr/src/debug/kernel-3.10.0-693.22.1.el7/linux-3.10.0-
> > > 693.22.1.el7.x86_64/drivers/scsi/libiscsi.c: 1292
> > > 0xffffffffc088141e
> > > <iscsi_verify_itt+30>:       mov    0x10(%rdi),%r13
> > > crash> 
> > > 
> > > 
> > > So fails here
> > > 
> > > int iscsi_verify_itt(struct iscsi_conn *conn, itt_t itt)
> > > {
> > >         struct iscsi_session *session = conn->session;  ****
> > > conn-
> > > > session is invalid
> > > 
> > > rdi had the struct iscsi_conn 
> > > 
> > > 0xffffffffc0881400 <iscsi_verify_itt>:  nopl   0x0(%rax,%rax,1)
> > > [FTRACE
> > > NOP]
> > > 0xffffffffc0881405 <iscsi_verify_itt+5>:        push   %rbp
> > > 0xffffffffc0881406 <iscsi_verify_itt+6>:        mov    %rsp,%rbp
> > > 0xffffffffc0881409 <iscsi_verify_itt+9>:        push   %r13
> > > 0xffffffffc088140b <iscsi_verify_itt+11>:       push   %r12
> > > 0xffffffffc088140d <iscsi_verify_itt+13>:       mov    %rdi,%r12
> > > 0xffffffffc0881410 <iscsi_verify_itt+16>:       push   %rbx
> > > 0xffffffffc0881411 <iscsi_verify_itt+17>:       mov    %esi,%ebx
> > > 0xffffffffc0881413 <iscsi_verify_itt+19>:       sub    $0x10,%rsp
> > > 0xffffffffc0881417 <iscsi_verify_itt+23>:       movl   $0x0,-
> > > 0x28(%rbp)
> > > 0xffffffffc088141e
> > > <iscsi_verify_itt+30>:       mov    0x10(%rdi),%r13
> > > 
> > >    RIP: ffffffffc088141e  RSP: ffff88beff2c3d78  RFLAGS: 00010286
> > >     RAX: 000000000000004c  RBX: 00000000d0000041  RCX:
> > > 0000000000000002
> > >     RDX: 00000000000000cc  RSI: 00000000d0000041  RDI:
> > > 0000000000000000
> > >     RBP: ffff88beff2c3da0   R8: 0000000040001038   R9:
> > > ffff88ae496fe01c
> > >     R10: 0000000000000000  R11: 7fffffffffffffff  R12:
> > > 0000000000000000
> > >     R13: ffff88ae496fe0e4  R14: ffff88ae496fe000  R15:
> > > 0000000000000000
> > > 
> > > Both RDI and R12 are null, offset by 10 get the bad address
> > > 
> > > So we have a race somehow that trashes the conn pointer under
> > > load.
> > > 
> > > The load clearly is seeing resource issues and repeatedly failing
> > > the
> > > memory registration.
> > 
> > So as I expected the memreg issues are gone won 7.5 which was
> > rebased
> > against upstream.
> > 
> > We are now hitting this and I am unable to reproduce in-house after
> > multiple efforts.
> > 
> > Aug  7 06:47:30 xxxxxxx kernel: WARNING: CPU: 20 PID: 36881 at
> > lib/list_debug.c:36 __list_add+0x8a/0xc0
> > Aug  7 06:47:30 xxxxxxx kernel: list_add double add:
> > new=ffff9f01523b92c8, prev=ffff9f01523b92c8, next=ffff9f69e4216d88.
> > Aug  7 06:47:30 xxxxxxx kernel: Modules linked in: bnx2i cnic uio
> > ip_vs
> > nf_conntrack ip6table_filter ip6_tables iptable_filter tcp_diag
> > udp_diag inet_diag unix_diag af_packet_diag n
> > etlink_diag oracleadvm(POE) oracleoks(POE) oracleasm(O) nfsv3
> > rpcsec_gss_krb5 nfsv4 dns_resolver nfs fscache dm_round_robin
> > bonding
> > rpcrdma ib_isert iscsi_target_mod ib_iser libiscsi sc
> > si_transport_iscsi ib_srpt target_core_mod ib_srp
> > scsi_transport_srp
> > scsi_tgt ib_ipoib rdma_ucm ib_ucm ib_uverbs ib_umad rdma_cm ib_cm
> > iw_cm
> > mlx5_ib ib_core vfat fat xfs sb_edac intel_p
> > owerclamp coretemp intel_rapl iosf_mbi kvm_intel kvm irqbypass
> > crc32_pclmul ghash_clmulni_intel iTCO_wdt aesni_intel ipmi_ssif lrw
> > iTCO_vendor_support gf128mul glue_helper ablk_helper i
> > oatdma cryptd ipmi_si pcspkr joydev ipmi_devintf hpwdt i2c_i801
> > hpilo
> > sg lpc_ich wmi dca ipmi_msghandler
> > Aug  7 06:47:30 xxxxxxx kernel: acpi_power_meter pcc_cpufreq shpchp
> > dm_multipath binfmt_misc nfsd auth_rpcgss nfs_acl lockd grace
> > sunrpc
> > ip_tables ext4 mbcache jbd2 sd_mod crc_t10di
> > f crct10dif_generic i2c_algo_bit drm_kms_helper syscopyarea
> > sysfillrect
> > sysimgblt fb_sys_fops mlx5_core ttm mlxfw drm bnx2x tg3 devlink
> > mdio
> > crct10dif_pclmul libcrc32c crct10dif_common 
> > hpsa ptp i2c_core crc32c_intel scsi_transport_sas pps_core
> > dm_mirror
> > dm_region_hash dm_log dm_mod
> > Aug  7 06:47:30 xxxxxxx kernel: CPU: 20 PID: 36881 Comm: sh
> > Tainted:
> > P        W  OE  ------------   3.10.0-862.9.1.el7.x86_64 #1
> > Aug  7 06:47:30 xxxxxxx kernel: Hardware name: HP ProLiant DL380
> > Gen9/ProLiant DL380 Gen9, BIOS P89 05/21/2018
> > Aug  7 06:47:30 xxxxxxx kernel: Call Trace:
> > Aug  7 06:47:30 xxxxxxx kernel: <IRQ>  [<ffffffffa650e84e>]
> > dump_stack+0x19/0x1b
> > Aug  7 06:47:30 xxxxxxx kernel: [<ffffffffa5e91e18>]
> > __warn+0xd8/0x100
> > Aug  7 06:47:30 xxxxxxx kernel: [<ffffffffa5e91e9f>]
> > warn_slowpath_fmt+0x5f/0x80
> > Aug  7 06:47:30 xxxxxxx kernel: [<ffffffffa6168d8a>]
> > __list_add+0x8a/0xc0
> > Aug  7 06:47:30 xxxxxxx kernel: [<ffffffffc0ac3c75>]
> > ipoib_start_xmit+0x485/0x6d0 [ib_ipoib]
> > Aug  7 06:47:30 xxxxxxx kernel: [<ffffffffa63ec226>]
> > dev_hard_start_xmit+0x246/0x3b0
> > Aug  7 06:47:30 xxxxxxx kernel: [<ffffffffa6417aba>]
> > sch_direct_xmit+0x11a/0x250
> > Aug  7 06:47:30 xxxxxxx kernel: [<ffffffffa63ef111>]
> > __dev_queue_xmit+0x4a1/0x660
> > Aug  7 06:47:30 xxxxxxx kernel: [<ffffffffa63ef2e0>]
> > dev_queue_xmit+0x10/0x20
> > Aug  7 06:47:30 xxxxxxx kernel: [<ffffffffa63fad1d>]
> > neigh_resolve_output+0x11d/0x220
> > Aug  7 06:47:30 xxxxxxx kernel: [<ffffffffa60db10a>] ?
> > selinux_ipv4_postroute+0x1a/0x20
> > Aug  7 06:47:30 xxxxxxx kernel: [<ffffffffa643820c>]
> > ip_finish_output+0x2ac/0x7a0
> > Aug  7 06:47:30 xxxxxxx kernel: [<ffffffffa6438a03>]
> > ip_output+0x73/0xe0
> > Aug  7 06:47:30 xxxxxxx kernel: [<ffffffffa6437f60>] ?
> > __ip_append_data.isra.50+0xa50/0xa50
> > Aug  7 06:47:30 xxxxxxx kernel: [<ffffffffa64365f7>]
> > ip_local_out_sk+0x37/0x40
> > Aug  7 06:47:30 xxxxxxx kernel: [<ffffffffa6436963>]
> > ip_queue_xmit+0x143/0x3a0
> > Aug  7 06:47:30 xxxxxxx kernel: [<ffffffffa6450844>]
> > tcp_transmit_skb+0x4e4/0x9e0
> > Aug  7 06:47:30 xxxxxxx kernel: [<ffffffffa64528bf>]
> > tcp_send_ack+0x11f/0x170
> > Aug  7 06:47:30 xxxxxxx kernel: [<ffffffffa6445735>]
> > tcp_send_dupack+0x25/0xd0
> > Aug  7 06:47:30 xxxxxxx kernel: [<ffffffffa644ce86>]
> > tcp_validate_incoming+0x186/0x2d0
> > Aug  7 06:47:30 xxxxxxx kernel: [<ffffffffa644d18d>]
> > tcp_rcv_established+0x1bd/0x770
> > Aug  7 06:47:30 xxxxxxx kernel: [<ffffffffa6457e6a>]
> > tcp_v4_do_rcv+0x10a/0x350
> > Aug  7 06:47:30 xxxxxxx kernel: [<ffffffffa64595fc>]
> > tcp_v4_rcv+0x78c/0x990
> > Aug  7 06:47:30 xxxxxxx kernel: [<ffffffffc0feafc6>] ?
> > ip_vs_remote_request4+0x16/0x20 [ip_vs]
> > Aug  7 06:47:30 xxxxxxx kernel: [<ffffffffa643272d>]
> > ip_local_deliver_finish+0xbd/0x200
> > Aug  7 06:47:30 xxxxxxx kernel: [<ffffffffa6432a19>]
> > ip_local_deliver+0x59/0xd0
> > Aug  7 06:47:30 xxxxxxx kernel: [<ffffffffa6432670>] ?
> > ip_rcv_finish+0x370/0x370
> > Aug  7 06:47:30 xxxxxxx kernel: [<ffffffffa6432390>]
> > ip_rcv_finish+0x90/0x370
> > Aug  7 06:47:30 xxxxxxx kernel: [<ffffffffa6432d49>]
> > ip_rcv+0x2b9/0x410
> > Aug  7 06:47:30 xxxxxxx kernel: [<ffffffffa6123411>] ?
> > blk_complete_request+0x21/0x30
> > Aug  7 06:47:30 xxxxxxx kernel: [<ffffffffa63ecab9>]
> > __netif_receive_skb_core+0x729/0xa20
> > Aug  7 06:47:30 xxxxxxx kernel: [<ffffffffa63ecdc8>]
> > __netif_receive_skb+0x18/0x60
> > Aug  7 06:47:30 xxxxxxx kernel: [<ffffffffa63ece50>]
> > netif_receive_skb_internal+0x40/0xc0
> > Aug  7 06:47:30 xxxxxxx kernel: [<ffffffffa63eda78>]
> > napi_gro_receive+0xd8/0x100
> > Aug  7 06:47:30 xxxxxxx kernel: [<ffffffffc0983183>]
> > mlx5i_handle_rx_cqe+0x2a3/0x460 [mlx5_core]
> > Aug  7 06:47:30 xxxxxxx kernel: [<ffffffffc09826f8>]
> > mlx5e_poll_rx_cq+0xc8/0x8b0 [mlx5_core]
> > Aug  7 06:47:30 xxxxxxx kernel: [<ffffffffc0983909>]
> > mlx5e_napi_poll+0x99/0x280 [mlx5_core]
> > Aug  7 06:47:30 xxxxxxx kernel: [<ffffffffa63ed46f>]
> > net_rx_action+0x26f/0x390
> > Aug  7 06:47:30 xxxxxxx kernel: [<ffffffffa5e9b085>]
> > __do_softirq+0xf5/0x280
> > Aug  7 06:47:30 xxxxxxx kernel: [<ffffffffa6523cec>]
> > call_softirq+0x1c/0x30
> > Aug  7 06:47:30 xxxxxxx kernel: [<ffffffffa5e2d625>]
> > do_softirq+0x65/0xa0
> > Aug  7 06:47:30 xxxxxxx kernel: [<ffffffffa5e9b405>]
> > irq_exit+0x105/0x110
> > Aug  7 06:47:30 xxxxxxx kernel: [<ffffffffa6524f86>]
> > do_IRQ+0x56/0xf0
> > Aug  7 06:47:30 xxxxxxx kernel: [<ffffffffa6517362>]
> > common_interrupt+0x162/0x162
> > Aug  7 06:47:30 xxxxxxx kernel: <EOI>  [<ffffffffa5fc12d5>] ?
> > do_read_fault.isra.60+0x5/0x1a0
> > Aug  7 06:47:30 xxxxxxx kernel: [<ffffffffa5fc5a9c>] ?
> > handle_pte_fault+0x2dc/0xc30
> > Aug  7 06:47:30 xxxxxxx kernel: [<ffffffffa5fc7c3d>]
> > handle_mm_fault+0x39d/0x9b0
> > Aug  7 06:47:30 xxxxxxx kernel: [<ffffffffa651b547>]
> > __do_page_fault+0x197/0x4f0
> > Aug  7 06:47:30 xxxxxxx kernel: [<ffffffffa651b8d5>]
> > do_page_fault+0x35/0x90
> > Aug  7 06:47:30 xxxxxxx kernel: [<ffffffffa6517758>]
> > page_fault+0x28/0x30
> > Aug  7 06:47:30 xxxxxxx kernel: ---[ end trace 020d3cfb07217435 ]
> > ---
> > 
> > Then this very soon after
> > 
> > Aug  7 06:47:48 xxxxxxx kernel: ------------[ cut here ]-----------
> > -
> > Aug  7 06:47:48 xxxxxxx kernel: WARNING: CPU: 10 PID: 89058 at
> > lib/list_debug.c:53 __list_del_entry+0x63/0xd0
> > Aug  7 06:47:48 xxxxxxx kernel: list_del corruption,
> > ffff9f6fba35bb70-
> > > next is LIST_POISON1 (dead000000000100)
> > 
> > Aug  7 06:47:48 xxxxxxx kernel: Modules linked in: bnx2i cnic uio
> > ip_vs
> > nf_conntrack ip6table_filter ip6_tables iptable_filter tcp_diag
> > udp_diag inet_diag unix_diag af_packet_diag n
> > etlink_diag oracleadvm(POE) oracleoks(POE) oracleasm(O) nfsv3
> > rpcsec_gss_krb5 nfsv4 dns_resolver nfs fscache dm_round_robin
> > bonding
> > rpcrdma ib_isert iscsi_target_mod ib_iser libiscsi sc
> > si_transport_iscsi ib_srpt target_core_mod ib_srp
> > scsi_transport_srp
> > scsi_tgt ib_ipoib rdma_ucm ib_ucm ib_uverbs ib_umad rdma_cm ib_cm
> > iw_cm
> > mlx5_ib ib_core vfat fat xfs sb_edac intel_p
> > owerclamp coretemp intel_rapl iosf_mbi kvm_intel kvm irqbypass
> > crc32_pclmul ghash_clmulni_intel iTCO_wdt aesni_intel ipmi_ssif lrw
> > iTCO_vendor_support gf128mul glue_helper ablk_helper i
> > oatdma cryptd ipmi_si pcspkr joydev ipmi_devintf hpwdt i2c_i801
> > hpilo
> > sg lpc_ich wmi dca ipmi_msghandler
> > Aug  7 06:47:48 xxxxxxx kernel: acpi_power_meter pcc_cpufreq shpchp
> > dm_multipath binfmt_misc nfsd auth_rpcgss nfs_acl lockd grace
> > sunrpc
> > ip_tables ext4 mbcache jbd2 sd_mod crc_t10di
> > f crct10dif_generic i2c_algo_bit drm_kms_helper syscopyarea
> > sysfillrect
> > sysimgblt fb_sys_fops mlx5_core ttm mlxfw drm bnx2x tg3 devlink
> > mdio
> > crct10dif_pclmul libcrc32c crct10dif_common 
> > hpsa ptp i2c_core crc32c_intel scsi_transport_sas pps_core
> > dm_mirror
> > dm_region_hash dm_log dm_mod
> > Aug  7 06:47:48 xxxxxxx kernel: CPU: 10 PID: 89058 Comm: tnslsnr
> > Tainted: P        W  OE  ------------   3.10.0-862.9.1.el7.x86_64
> > #1
> > Aug  7 06:47:48 xxxxxxx kernel: Hardware name: HP ProLiant DL380
> > Gen9/ProLiant DL380 Gen9, BIOS P89 05/21/2018
> > Aug  7 06:47:48 xxxxxxx kernel: Call Trace:
> > Aug  7 06:47:48 xxxxxxx kernel: [<ffffffffa650e84e>]
> > dump_stack+0x19/0x1b
> > Aug  7 06:47:48 xxxxxxx kernel: [<ffffffffa5e91e18>]
> > __warn+0xd8/0x100
> > Aug  7 06:47:48 xxxxxxx kernel: [<ffffffffa5e91e9f>]
> > warn_slowpath_fmt+0x5f/0x80
> > Aug  7 06:47:48 xxxxxxx kernel: [<ffffffffa6168e23>]
> > __list_del_entry+0x63/0xd0
> > Aug  7 06:47:48 xxxxxxx kernel: [<ffffffffa6168e9d>]
> > list_del+0xd/0x30
> > Aug  7 06:47:48 xxxxxxx kernel: [<ffffffffa5ebc226>]
> > remove_wait_queue+0x26/0x40
> > Aug  7 06:47:48 xxxxxxx kernel: [<ffffffffa6067a5a>]
> > ep_unregister_pollwait.isra.6+0x3a/0x60
> > Aug  7 06:47:48 xxxxxxx kernel: [<ffffffffa6067aa2>]
> > ep_remove+0x22/0xc0
> > Aug  7 06:47:48 xxxxxxx kernel: [<ffffffffa6068f1f>]
> > SyS_epoll_ctl+0x4bf/0xc60
> > Aug  7 06:47:48 xxxxxxx kernel: [<ffffffffa651b56c>] ?
> > __do_page_fault+0x1bc/0x4f0
> > Aug  7 06:47:48 xxxxxxx kernel: [<ffffffffa6520795>]
> > system_call_fastpath+0x1c/0x21
> > Aug  7 06:47:48 xxxxxxx kernel: ---[ end trace 020d3cfb07217438 ]
> > ---
> > 
> > 
> > These started after 7.5
> > 
> > messages-20180806:Aug  4 20:10:54 xxxxxxx kernel: WARNING: CPU: 1
> > PID:
> > 48632 at lib/list_debug.c:59 __list_del_entry+0xa1/0xd0
> > messages-20180806:Aug  4 20:10:54 xxxxxxx kernel: list_del
> > corruption.
> > prev->next should be ffff9f6991eb6648, but was           (null)
> > messages-20180806:Aug  4 20:10:54 xxxxxxx kernel:
> > [<ffffffffa6168e61>]
> > __list_del_entry+0xa1/0xd0
> > messages-20180806:Aug  5 00:25:39 xxxxxxx kernel: WARNING: CPU: 3
> > PID:
> > 84714 at lib/list_debug.c:59 __list_del_entry+0xa1/0xd0
> > messages-20180806:Aug  5 00:25:39 xxxxxxx kernel: list_del
> > corruption.
> > prev->next should be ffff9f0bb12206c8, but was           (null)
> > messages-20180806:Aug  5 00:25:39 xxxxxxx kernel:
> > [<ffffffffa6168e61>]
> > __list_del_entry+0xa1/0xd0
> > messages-20180806:Aug  5 00:33:42 xxxxxxx kernel: WARNING: CPU: 4
> > PID:
> > 80177 at lib/list_debug.c:59 __list_del_entry+0xa1/0xd0
> > messages-20180806:Aug  5 00:33:42 xxxxxxx kernel: list_del
> > corruption.
> > prev->next should be ffff9f69546a7ac8, but was dead000000000200
> > messages-20180806:Aug  5 00:33:42 xxxxxxx kernel:
> > [<ffffffffa6168e61>]
> > __list_del_entry+0xa1/0xd0
> > messages-20180806:Aug  5 00:40:14 xxxxxxx kernel: WARNING: CPU: 13
> > PID:
> > 80177 at lib/list_debug.c:59 __list_del_entry+0xa1/0xd0
> > messages-20180806:Aug  5 00:40:14 xxxxxxx kernel: list_del
> > corruption.
> > prev->next should be ffff9f44a5b9f248, but was           (null)
> > messages-20180806:Aug  5 00:40:14 xxxxxxx kernel:
> > [<ffffffffa6168e61>]
> > __list_del_entry+0xa1/0xd0
> > messages-20180806:Aug  5 00:43:16 xxxxxxx kernel: WARNING: CPU: 0
> > PID:
> > 51133 at lib/list_debug.c:59 __list_del_entry+0xa1/0xd0
> > messages-20180806:Aug  5 00:43:16 xxxxxxx kernel: list_del
> > corruption.
> > prev->next should be ffff9f6792776d48, but was           (null)
> > messages-20180806:Aug  5 00:43:16 xxxxxxx kernel:
> > [<ffffffffa6168e61>]
> > __list_del_entry+0xa1/0xd0
> >  will be toiugh to get upstream tested here so I am cont=inuing to
> > try
> > reproduce.
> > 
> > Has anybody seen this list corruption before
> 
> I forgot to include this which is important
> 
> Its a list corruption now in in ipoib code.
> 
> Aug  6 23:16:30 xxxxxxxxxx kernel: WARNING: CPU: 9 PID: 10865 at
> lib/list_debug.c:59 __list_del_entry+0xa1/0xd0
> Aug  6 23:16:30 xxxxxxxxxx kernel: list_del corruption. prev->next
> should be ffff9f63a3219fc8, but was dead000000000200
> Aug  6 23:16:30 xxxxxxxxxx kernel: Modules linked in: bnx2i cnic uio
> ip_vs nf_conntrack ip6table_filter ip6_tables iptable_filter tcp_diag
> udp_diag inet_diag unix_diag af_packet_diag netlink_diag
> oracleadvm(POE) oracleoks(POE) oracleasm(O) nfsv3 rpcsec_gss_krb5
> nfsv4
> dns_resolver nfs fscache dm_round_robin bonding rpcrdma ib_isert
> iscsi_target_mod ib_iser libiscsi scsi_transport_iscsi ib_srpt
> target_core_mod ib_srp scsi_transport_srp scsi_tgt ib_ipoib rdma_ucm
> ib_ucm ib_uverbs ib_umad rdma_cm ib_cm iw_cm mlx5_ib ib_core vfat fat
> xfs sb_edac intel_powerclamp coretemp intel_rapl iosf_mbi kvm_intel
> kvm
> irqbypass crc32_pclmul ghash_clmulni_intel iTCO_wdt aesni_intel
> ipmi_ssif lrw iTCO_vendor_support gf128mul glue_helper ablk_helper
> ioatdma cryptd ipmi_si pcspkr joydev ipmi_devintf hpwdt i2c_i801
> hpilo
> sg lpc_ich wmi dca ipmi_msghandler
> Aug  6 23:16:30 xxxxxxxxxx kernel: acpi_power_meter pcc_cpufreq
> shpchp
> dm_multipath binfmt_misc nfsd auth_rpcgss nfs_acl lockd grace sunrpc
> ip_tables ext4 mbcache jbd2 sd_mod crc_t10dif crct10dif_generic
> i2c_algo_bit drm_kms_helper syscopyarea sysfillrect sysimgblt
> fb_sys_fops mlx5_core ttm mlxfw drm bnx2x tg3 devlink mdio
> crct10dif_pclmul libcrc32c crct10dif_common hpsa ptp i2c_core
> crc32c_intel scsi_transport_sas pps_core dm_mirror dm_region_hash
> dm_log dm_mod
> Aug  6 23:16:30 xxxxxxxxxx kernel: CPU: 9 PID: 10865 Comm:
> kworker/u48:3 Tainted: P        W  OE  ------------   3.10.0-
> 862.9.1.el7.x86_64 #1
> Aug  6 23:16:30 xxxxxxxxxx kernel: Hardware name: HP ProLiant DL380
> Gen9/ProLiant DL380 Gen9, BIOS P89 05/21/2018
> Aug  6 23:16:30 xxxxxxxxxx kernel: Workqueue: ipoib_wq
> ipoib_reap_neigh
> [ib_ipoib]
> Aug  6 23:16:30 xxxxxxxxxx kernel: Call Trace:
> Aug  6 23:16:30 xxxxxxxxxx kernel: [<ffffffffa650e84e>]
> dump_stack+0x19/0x1b
> Aug  6 23:16:30 xxxxxxxxxx kernel: [<ffffffffa5e91e18>]
> __warn+0xd8/0x100
> Aug  6 23:16:30 xxxxxxxxxx kernel: [<ffffffffa5e91e9f>]
> warn_slowpath_fmt+0x5f/0x80
> Aug  6 23:16:30 xxxxxxxxxx kernel: [<ffffffffa5e2959e>] ?
> __switch_to+0xce/0x580
> Aug  6 23:16:30 xxxxxxxxxx kernel: [<ffffffffa6168e61>]
> __list_del_entry+0xa1/0xd0
> Aug  6 23:16:30 xxxxxxxxxx kernel: [<ffffffffc0ac2224>]
> ipoib_reap_neigh+0x174/0x1a0 [ib_ipoib]
> Aug  6 23:16:30 xxxxxxxxxx kernel: [<ffffffffa5eb35ef>]
> process_one_work+0x17f/0x440
> Aug  6 23:16:30 xxxxxxxxxx kernel: [<ffffffffa5eb4686>]
> worker_thread+0x126/0x3c0
> Aug  6 23:16:30 xxxxxxxxxx kernel: [<ffffffffa5eb4560>] ?
> manage_workers.isra.24+0x2a0/0x2a0
> Aug  6 23:16:30 xxxxxxxxxx kernel: [<ffffffffa5ebb621>]
> kthread+0xd1/0xe0
> Aug  6 23:16:30 xxxxxxxxxx kernel: [<ffffffffa5ebb550>] ?
> insert_kthread_work+0x40/0x40
> Aug  6 23:16:30 xxxxxxxxxx kernel: [<ffffffffa65205f7>]
> ret_from_fork_nospec_begin+0x21/0x21
> Aug  6 23:16:30 xxxxxxxxxx kernel: [<ffffffffa5ebb550>] ?
> insert_kthread_work+0x40/0x40
> Aug  6 23:16:30 xxxxxxxxxx kernel: ---[ end trace 020d3cfb07217423 ]-
> --
> 

Following up here, (I know I had no responses but this is in case
others see this issue)

Alaa Hleihel pointed out this commit as the fix for the list issues

commit 16ba3defb8bd01a9464ba4820a487f5b196b455b
Author: Erez Shitrit <erezsh@xxxxxxxxxxxx>
Date:   Sun Dec 31 15:33:15 2017 +0200

    IB/ipoib: Fix race condition in neigh creation
    
    When using enhanced mode for IPoIB, two threads may execute xmit in
    parallel to two different TX queues while the target is the same.
    In this case, both of them will add the same neighbor to the path's
    neigh link list and we might see the following message:
    
      list_add double add: new=ffff88024767a348,
prev=ffff88024767a348...
      WARNING: lib/list_debug.c:31__list_add_valid+0x4e/0x70
      ipoib_start_xmit+0x477/0x680 [ib_ipoib]
      dev_hard_start_xmit+0xb9/0x3e0
      sch_direct_xmit+0xf9/0x250
      __qdisc_run+0x176/0x5d0
      __dev_queue_xmit+0x1f5/0xb10
      __dev_queue_xmit+0x55/0xb10
--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at  http://vger.kernel.org/majordomo-info.html



[Index of Archives]     [Linux USB Devel]     [Video for Linux]     [Linux Audio Users]     [Photo]     [Yosemite News]     [Yosemite Photos]     [Linux Kernel]     [Linux SCSI]     [XFree86]

  Powered by Linux