Re: Seeing this on a RHEL kernel with upstream backports wondering if this was ever fixed

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 



On 07/26/2018 08:48 AM, Laurence Oberman wrote:
Hello

https://www.spinics.net/lists/linux-rdma/msg51334.html

A rhel 7.5 with backports from upstream is hitting this.
Chuck Reported it and Sagi and Max responded but its not clear if we
ever fixed this.

RHEL-7.5 data point:
-- drivers/infiniband/* -r is backported to v4.14.
   i.e., includes the patch(es) mentioned in the above thread.

Laurence:
Please test with 7.6 kernel & report back.
if that passes, RH can bisect the bug fix btwn v4.14 & v4.16(the 7.6 update point for its rdma kernel core),
and backport to 7.5-zstream.  note: you'll have to update rdma-core pkg to the 7.6 version as well.
All functional & bug fix patches to mlx* (ib & enet) are in as well (same kernel references).

-dd

In this case we land up in a panic, noty just messaging, although the
messages logged for a long time over and over until we finally
panicked.

crash> log | grep "memreg failure: memor" | wc -l
2414

crash> log
[1635578.012721]  connection16:0: detected conn error (1011)
[1635587.050688] mlx5_0:dump_cqe:262:(pid 93128): dump error cqe
[1635587.089686] 00000000 00000000 00000000 00000000
[1635587.123989] 00000000 00000000 00000000 00000000
[1635587.157494] 00000000 00000000 00000000 00000000
[1635587.190968] 00000000 08007806 250002ad ba6115d3

[1635587.224331] iser: iser_err_comp: memreg failure: memory management
operation error (6) vend_err 78
[1635587.278876]  connection15:0: detected conn error (1011)
[1635590.986286] mlx5_1:dump_cqe:262:(pid 0): dump error cqe
[1635591.021891] 00000000 00000000 00000000 00000000
[1635591.053944] 00000000 00000000 00000000 00000000

[1657077.997960] BUG: unable to handle kernel NULL pointer dereference
at 0000000000000010
[1657077.997967] IP: [<ffffffffc08a541e>] iscsi_verify_itt+0x1e/0x110
[libiscsi]
[1657077.997970] PGD 80000098de387067 PUD b8d9ffa067 PMD 0
[1657077.997971] Oops: 0000 [#1] SMP
[1657077.998009] Modules linked in: oracleasm(O) nfsv3 rpcsec_gss_krb5
nfsv4 dns_resolver nfs fscache dm_round_robin bonding rpcrdma ib_isert
iscsi_target_mod ib_iser libiscsi scsi_transport_iscsi ib_srpt
target_core_mod ib_srp scsi_transport_srp scsi_tgt ib_ipoib rdma_ucm
ib_ucm ib_uverbs ib_umad rdma_cm ib_cm iw_cm mlx5_ib ib_core vfat fat
xfs sb_edac edac_core intel_powerclamp coretemp intel_rapl iosf_mbi
kvm_intel kvm irqbypass iTCO_wdt crc32_pclmul ipmi_ssif
iTCO_vendor_support ghash_clmulni_intel aesni_intel lrw gf128mul
ipmi_si glue_helper ablk_helper cryptd sg hpwdt hpilo pcspkr
ipmi_devintf ioatdma dm_multipath i2c_i801 lpc_ich shpchp dca wmi
ipmi_msghandler pcc_cpufreq acpi_power_meter nfsd binfmt_misc
auth_rpcgss nfs_acl lockd grace sunrpc ip_tables ext4 mbcache jbd2
sd_mod crc_t10dif crct10dif_generic
[1657077.998020]  i2c_algo_bit drm_kms_helper syscopyarea sysfillrect
sysimgblt fb_sys_fops ttm bnx2x mlx5_core crct10dif_pclmul mdio tg3(OE)
devlink libcrc32c crct10dif_common drm hpsa(OE) ptp i2c_core
crc32c_intel scsi_transport_sas pps_core dm_mirror dm_region_hash
dm_log dm_mod
[1657077.998023] CPU: 20 PID: 41538 Comm: sh Tainted: G           OE  -
-----------   3.10.0-693.34.1.el7_bz1582551.x86_64 #1
[1657077.998024] Hardware name: HP ProLiant DL380 Gen9/ProLiant DL380
Gen9, BIOS P89 05/21/2018
[1657077.998025] task: ffff88587ce38fd0 ti: ffff884dd0af0000 task.ti:
ffff884dd0af0000
[1657077.998029] RIP: 0010:[<ffffffffc08a541e>]  [<ffffffffc08a541e>]
iscsi_verify_itt+0x1e/0x110 [libiscsi]
[1657077.998030] RSP: 0000:ffff88beff403d78  EFLAGS: 00010286
[1657077.998031] RAX: 000000000000004c RBX: 00000000b0000036 RCX:
0000000000000002
[1657077.998032] RDX: 00000000000000cc RSI: 00000000b0000036 RDI:
0000000000000000
[1657077.998033] RBP: ffff88beff403da0 R08: 0000000040032a20 R09:
ffff8896e4eaf91c
[1657077.998034] R10: 0000000000000000 R11: 00007ffff7763ca0 R12:
0000000000000000
[1657077.998035] R13: ffff8896e4eaf9e4 R14: ffff8896e4eaf900 R15:
0000000000000000
[1657077.998036] FS:  00007ffff7fe6740(0000) GS:ffff88beff400000(0000)
knlGS:0000000000000000
[1657077.998038] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[1657077.998039] CR2: 0000000000000010 CR3: 000000ad92eba000 CR4:
00000000003607e0
[1657077.998040] DR0: 0000000000000000 DR1: 0000000000000000 DR2:
0000000000000000
[1657077.998041] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7:
0000000000000400
[1657077.998042] Call Trace:
[1657077.998044]  <IRQ>
[1657077.998046]  [<ffffffffc08a5527>] iscsi_itt_to_ctask+0x17/0x80
[libiscsi]
[1657077.998050]  [<ffffffffc05eefea>] iser_task_rsp+0xca/0x360
[ib_iser]
[1657077.998061]  [<ffffffffc0587fbb>] __ib_process_cq+0x6b/0xe0
[ib_core]
[1657077.998066]  [<ffffffffc0588122>] ib_poll_handler+0x22/0x80
[ib_core]
[1657077.998070]  [<ffffffff81358507>] irq_poll_softirq+0xc7/0x100
[1657077.998076]  [<ffffffff81095195>] __do_softirq+0xf5/0x280
[1657077.998081]  [<ffffffff816c4e8c>] call_softirq+0x1c/0x30
[1657077.998086]  [<ffffffff8102d435>] do_softirq+0x65/0xa0
[1657077.998088]  [<ffffffff81095515>] irq_exit+0x105/0x110
[1657077.998091]  [<ffffffff816c61d6>] do_IRQ+0x56/0xf0
[1657077.998098]  [<ffffffff816b837c>] common_interrupt+0x17c/0x17c
[1657077.998099]  <EOI>
[1657077.998113] Code: ff ff ff eb a9 41 be 95 ff ff ff eb a1 0f 1f 44
00 00 55 48 89 e5 41 55 41 54 49 89 fc 53 89 f3 48 83 ec 10 c7 45 d8 00
00 00 00 <4c> 8b 6f 10 65 48 8b 04 25 28 00 00 00 48 89 45 e0 31 c0 83
fe
[1657077.998116] RIP  [<ffffffffc08a541e>] iscsi_verify_itt+0x1e/0x110
[libiscsi]
[1657077.998116]  RSP <ffff88beff403d78>
[1657077.998117] CR2: 0000000000000010
crash>

crash> bt
PID: 41538  TASK: ffff88587ce38fd0  CPU: 20  COMMAND: "sh"
  #0 [ffff88beff403a18] machine_kexec at ffffffff8105ddeb
  #1 [ffff88beff403a78] __crash_kexec at ffffffff81109902
  #2 [ffff88beff403b48] crash_kexec at ffffffff811099f0
  #3 [ffff88beff403b60] oops_end at ffffffff816b97a8
  #4 [ffff88beff403b88] no_context at ffffffff816a8c96
  #5 [ffff88beff403bd8] __bad_area_nosemaphore at ffffffff816a8d2c
  #6 [ffff88beff403c20] bad_area_nosemaphore at ffffffff816a8e96
  #7 [ffff88beff403c30] __do_page_fault at ffffffff816bc6be
  #8 [ffff88beff403c90] do_page_fault at ffffffff816bc865
  #9 [ffff88beff403cc0] page_fault at ffffffff816b8788
     [exception RIP: iscsi_verify_itt+30]
     RIP: ffffffffc08a541e  RSP: ffff88beff403d78  RFLAGS: 00010286
     RAX: 000000000000004c  RBX: 00000000b0000036  RCX: 0000000000000002
     RDX: 00000000000000cc  RSI: 00000000b0000036  RDI: 0000000000000000
     RBP: ffff88beff403da0   R8: 0000000040032a20   R9: ffff8896e4eaf91c
     R10: 0000000000000000  R11: 00007ffff7763ca0  R12: 0000000000000000
     R13: ffff8896e4eaf9e4  R14: ffff8896e4eaf900  R15: 0000000000000000
     ORIG_RAX: ffffffffffffffff  CS: 0010  SS: 0000
#10 [ffff88beff403da8] iscsi_itt_to_ctask at ffffffffc08a5527
[libiscsi]
#11 [ffff88beff403dc8] iser_task_rsp at ffffffffc05eefea [ib_iser]
#12 [ffff88beff403e10] __ib_process_cq at ffffffffc0587fbb [ib_core]
#13 [ffff88beff403e50] ib_poll_handler at ffffffffc0588122 [ib_core]
#14 [ffff88beff403e80] irq_poll_softirq at ffffffff81358507
#15 [ffff88beff403eb8] __do_softirq at ffffffff81095195
#16 [ffff88beff403f28] call_softirq at ffffffff816c4e8c
#17 [ffff88beff403f40] do_softirq at ffffffff8102d435
#18 [ffff88beff403f60] irq_exit at ffffffff81095515
#19 [ffff88beff403f78] do_IRQ at ffffffff816c61d6
--- <IRQ stack> ---
#20 [ffff884dd0af3f58] ret_from_intr at ffffffff816b837c
     RIP: 000000000041b866  RSP: 00007fffffffea28  RFLAGS: 00000206
     RAX: 0000000000000000  RBX: 00007fffffffef53  RCX: 00000000006f1a70
     RDX: 00000000006f1a70  RSI: 00000000006f1a90  RDI: 0000000000000000
     RBP: 0000000000000002   R8: 0000000000000001   R9: 0000000000000020
     R10: 0000000000000003  R11: 00007ffff7763ca0  R12: ffff88beff4061e8
     R13: 00000000ffffffff  R14: 0000000000000000  R15: 0000000000000063
     ORIG_RAX: ffffffffffffffbb  CS: 0033  SS: 002b

crash> ps -p 41538
PID: 0      TASK: ffffffff81a0e480  CPU: 0   COMMAND: "swapper/0"
  PID: 1      TASK: ffff88012e4c8000  CPU: 7   COMMAND: "systemd"
   PID: 2345   TASK: ffff885ef5eb8fd0  CPU: 14  COMMAND: "zabbix_agentd"
    PID: 2349   TASK: ffff885efcbcaf70  CPU: 1   COMMAND:
"zabbix_agentd"
     PID: 41538  TASK: ffff88587ce38fd0  CPU: 20  COMMAND: "sh"


--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at  http://vger.kernel.org/majordomo-info.html



[Index of Archives]     [Linux USB Devel]     [Video for Linux]     [Linux Audio Users]     [Photo]     [Yosemite News]     [Yosemite Photos]     [Linux Kernel]     [Linux SCSI]     [XFree86]

  Powered by Linux