On Thu, 2018-07-26 at 10:28 -0400, Don Dutile wrote: > On 07/26/2018 08:48 AM, Laurence Oberman wrote: > > Hello > > > > https://www.spinics.net/lists/linux-rdma/msg51334.html > > > > A rhel 7.5 with backports from upstream is hitting this. > > Chuck Reported it and Sagi and Max responded but its not clear if > > we > > ever fixed this. > > > > RHEL-7.5 data point: > -- drivers/infiniband/* -r is backported to v4.14. > i.e., includes the patch(es) mentioned in the above thread. > > Laurence: > Please test with 7.6 kernel & report back. > if that passes, RH can bisect the bug fix btwn v4.14 & v4.16(the 7.6 > update point for its rdma kernel core), > and backport to 7.5-zstream. note: you'll have to update rdma-core > pkg to the 7.6 version as well. > All functional & bug fix patches to mlx* (ib & enet) are in as well > (same kernel references). > > -dd > > > In this case we land up in a panic, noty just messaging, although > > the > > messages logged for a long time over and over until we finally > > panicked. > > > > crash> log | grep "memreg failure: memor" | wc -l > > 2414 > > > > crash> log > > [1635578.012721] connection16:0: detected conn error (1011) > > [1635587.050688] mlx5_0:dump_cqe:262:(pid 93128): dump error cqe > > [1635587.089686] 00000000 00000000 00000000 00000000 > > [1635587.123989] 00000000 00000000 00000000 00000000 > > [1635587.157494] 00000000 00000000 00000000 00000000 > > [1635587.190968] 00000000 08007806 250002ad ba6115d3 > > > > [1635587.224331] iser: iser_err_comp: memreg failure: memory > > management > > operation error (6) vend_err 78 > > [1635587.278876] connection15:0: detected conn error (1011) > > [1635590.986286] mlx5_1:dump_cqe:262:(pid 0): dump error cqe > > [1635591.021891] 00000000 00000000 00000000 00000000 > > [1635591.053944] 00000000 00000000 00000000 00000000 > > > > [1657077.997960] BUG: unable to handle kernel NULL pointer > > dereference > > at 0000000000000010 > > [1657077.997967] IP: [<ffffffffc08a541e>] > > iscsi_verify_itt+0x1e/0x110 > > [libiscsi] > > [1657077.997970] PGD 80000098de387067 PUD b8d9ffa067 PMD 0 > > [1657077.997971] Oops: 0000 [#1] SMP > > [1657077.998009] Modules linked in: oracleasm(O) nfsv3 > > rpcsec_gss_krb5 > > nfsv4 dns_resolver nfs fscache dm_round_robin bonding rpcrdma > > ib_isert > > iscsi_target_mod ib_iser libiscsi scsi_transport_iscsi ib_srpt > > target_core_mod ib_srp scsi_transport_srp scsi_tgt ib_ipoib > > rdma_ucm > > ib_ucm ib_uverbs ib_umad rdma_cm ib_cm iw_cm mlx5_ib ib_core vfat > > fat > > xfs sb_edac edac_core intel_powerclamp coretemp intel_rapl iosf_mbi > > kvm_intel kvm irqbypass iTCO_wdt crc32_pclmul ipmi_ssif > > iTCO_vendor_support ghash_clmulni_intel aesni_intel lrw gf128mul > > ipmi_si glue_helper ablk_helper cryptd sg hpwdt hpilo pcspkr > > ipmi_devintf ioatdma dm_multipath i2c_i801 lpc_ich shpchp dca wmi > > ipmi_msghandler pcc_cpufreq acpi_power_meter nfsd binfmt_misc > > auth_rpcgss nfs_acl lockd grace sunrpc ip_tables ext4 mbcache jbd2 > > sd_mod crc_t10dif crct10dif_generic > > [1657077.998020] i2c_algo_bit drm_kms_helper syscopyarea > > sysfillrect > > sysimgblt fb_sys_fops ttm bnx2x mlx5_core crct10dif_pclmul mdio > > tg3(OE) > > devlink libcrc32c crct10dif_common drm hpsa(OE) ptp i2c_core > > crc32c_intel scsi_transport_sas pps_core dm_mirror dm_region_hash > > dm_log dm_mod > > [1657077.998023] CPU: 20 PID: 41538 Comm: sh Tainted: > > G OE - > > ----------- 3.10.0-693.34.1.el7_bz1582551.x86_64 #1 > > [1657077.998024] Hardware name: HP ProLiant DL380 Gen9/ProLiant > > DL380 > > Gen9, BIOS P89 05/21/2018 > > [1657077.998025] task: ffff88587ce38fd0 ti: ffff884dd0af0000 > > task.ti: > > ffff884dd0af0000 > > [1657077.998029] RIP: > > 0010:[<ffffffffc08a541e>] [<ffffffffc08a541e>] > > iscsi_verify_itt+0x1e/0x110 [libiscsi] > > [1657077.998030] RSP: 0000:ffff88beff403d78 EFLAGS: 00010286 > > [1657077.998031] RAX: 000000000000004c RBX: 00000000b0000036 RCX: > > 0000000000000002 > > [1657077.998032] RDX: 00000000000000cc RSI: 00000000b0000036 RDI: > > 0000000000000000 > > [1657077.998033] RBP: ffff88beff403da0 R08: 0000000040032a20 R09: > > ffff8896e4eaf91c > > [1657077.998034] R10: 0000000000000000 R11: 00007ffff7763ca0 R12: > > 0000000000000000 > > [1657077.998035] R13: ffff8896e4eaf9e4 R14: ffff8896e4eaf900 R15: > > 0000000000000000 > > [1657077.998036] FS: 00007ffff7fe6740(0000) > > GS:ffff88beff400000(0000) > > knlGS:0000000000000000 > > [1657077.998038] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 > > [1657077.998039] CR2: 0000000000000010 CR3: 000000ad92eba000 CR4: > > 00000000003607e0 > > [1657077.998040] DR0: 0000000000000000 DR1: 0000000000000000 DR2: > > 0000000000000000 > > [1657077.998041] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: > > 0000000000000400 > > [1657077.998042] Call Trace: > > [1657077.998044] <IRQ> > > [1657077.998046] [<ffffffffc08a5527>] iscsi_itt_to_ctask+0x17/0x80 > > [libiscsi] > > [1657077.998050] [<ffffffffc05eefea>] iser_task_rsp+0xca/0x360 > > [ib_iser] > > [1657077.998061] [<ffffffffc0587fbb>] __ib_process_cq+0x6b/0xe0 > > [ib_core] > > [1657077.998066] [<ffffffffc0588122>] ib_poll_handler+0x22/0x80 > > [ib_core] > > [1657077.998070] [<ffffffff81358507>] irq_poll_softirq+0xc7/0x100 > > [1657077.998076] [<ffffffff81095195>] __do_softirq+0xf5/0x280 > > [1657077.998081] [<ffffffff816c4e8c>] call_softirq+0x1c/0x30 > > [1657077.998086] [<ffffffff8102d435>] do_softirq+0x65/0xa0 > > [1657077.998088] [<ffffffff81095515>] irq_exit+0x105/0x110 > > [1657077.998091] [<ffffffff816c61d6>] do_IRQ+0x56/0xf0 > > [1657077.998098] [<ffffffff816b837c>] common_interrupt+0x17c/0x17c > > [1657077.998099] <EOI> > > [1657077.998113] Code: ff ff ff eb a9 41 be 95 ff ff ff eb a1 0f 1f > > 44 > > 00 00 55 48 89 e5 41 55 41 54 49 89 fc 53 89 f3 48 83 ec 10 c7 45 > > d8 00 > > 00 00 00 <4c> 8b 6f 10 65 48 8b 04 25 28 00 00 00 48 89 45 e0 31 c0 > > 83 > > fe > > [1657077.998116] RIP [<ffffffffc08a541e>] > > iscsi_verify_itt+0x1e/0x110 > > [libiscsi] > > [1657077.998116] RSP <ffff88beff403d78> > > [1657077.998117] CR2: 0000000000000010 > > crash> > > > > crash> bt > > PID: 41538 TASK: ffff88587ce38fd0 CPU: 20 COMMAND: "sh" > > #0 [ffff88beff403a18] machine_kexec at ffffffff8105ddeb > > #1 [ffff88beff403a78] __crash_kexec at ffffffff81109902 > > #2 [ffff88beff403b48] crash_kexec at ffffffff811099f0 > > #3 [ffff88beff403b60] oops_end at ffffffff816b97a8 > > #4 [ffff88beff403b88] no_context at ffffffff816a8c96 > > #5 [ffff88beff403bd8] __bad_area_nosemaphore at ffffffff816a8d2c > > #6 [ffff88beff403c20] bad_area_nosemaphore at ffffffff816a8e96 > > #7 [ffff88beff403c30] __do_page_fault at ffffffff816bc6be > > #8 [ffff88beff403c90] do_page_fault at ffffffff816bc865 > > #9 [ffff88beff403cc0] page_fault at ffffffff816b8788 > > [exception RIP: iscsi_verify_itt+30] > > RIP: ffffffffc08a541e RSP: ffff88beff403d78 RFLAGS: 00010286 > > RAX: 000000000000004c RBX: 00000000b0000036 RCX: > > 0000000000000002 > > RDX: 00000000000000cc RSI: 00000000b0000036 RDI: > > 0000000000000000 > > RBP: ffff88beff403da0 R8: 0000000040032a20 R9: > > ffff8896e4eaf91c > > R10: 0000000000000000 R11: 00007ffff7763ca0 R12: > > 0000000000000000 > > R13: ffff8896e4eaf9e4 R14: ffff8896e4eaf900 R15: > > 0000000000000000 > > ORIG_RAX: ffffffffffffffff CS: 0010 SS: 0000 > > #10 [ffff88beff403da8] iscsi_itt_to_ctask at ffffffffc08a5527 > > [libiscsi] > > #11 [ffff88beff403dc8] iser_task_rsp at ffffffffc05eefea [ib_iser] > > #12 [ffff88beff403e10] __ib_process_cq at ffffffffc0587fbb > > [ib_core] > > #13 [ffff88beff403e50] ib_poll_handler at ffffffffc0588122 > > [ib_core] > > #14 [ffff88beff403e80] irq_poll_softirq at ffffffff81358507 > > #15 [ffff88beff403eb8] __do_softirq at ffffffff81095195 > > #16 [ffff88beff403f28] call_softirq at ffffffff816c4e8c > > #17 [ffff88beff403f40] do_softirq at ffffffff8102d435 > > #18 [ffff88beff403f60] irq_exit at ffffffff81095515 > > #19 [ffff88beff403f78] do_IRQ at ffffffff816c61d6 > > --- <IRQ stack> --- > > #20 [ffff884dd0af3f58] ret_from_intr at ffffffff816b837c > > RIP: 000000000041b866 RSP: 00007fffffffea28 RFLAGS: 00000206 > > RAX: 0000000000000000 RBX: 00007fffffffef53 RCX: > > 00000000006f1a70 > > RDX: 00000000006f1a70 RSI: 00000000006f1a90 RDI: > > 0000000000000000 > > RBP: 0000000000000002 R8: 0000000000000001 R9: > > 0000000000000020 > > R10: 0000000000000003 R11: 00007ffff7763ca0 R12: > > ffff88beff4061e8 > > R13: 00000000ffffffff R14: 0000000000000000 R15: > > 0000000000000063 > > ORIG_RAX: ffffffffffffffbb CS: 0033 SS: 002b > > > > crash> ps -p 41538 > > PID: 0 TASK: ffffffff81a0e480 CPU: 0 COMMAND: "swapper/0" > > PID: 1 TASK: ffff88012e4c8000 CPU: 7 COMMAND: "systemd" > > PID: 2345 TASK: ffff885ef5eb8fd0 CPU: 14 COMMAND: > > "zabbix_agentd" > > PID: 2349 TASK: ffff885efcbcaf70 CPU: 1 COMMAND: > > "zabbix_agentd" > > PID: 41538 TASK: ffff88587ce38fd0 CPU: 20 COMMAND: "sh" > > > > Don I misspoke about the kernel version, its 7.4 3.10.0-693.34.1.el7_bz1582551.x86_64 Its the one we added the missing iscsi patches to but base is 7.4 So I will test with 7.5 -- To unsubscribe from this list: send the line "unsubscribe linux-rdma" in the body of a message to majordomo@xxxxxxxxxxxxxxx More majordomo info at http://vger.kernel.org/majordomo-info.html