On Fri, 2018-07-27 at 09:21 -0400, Laurence Oberman wrote: > On Fri, 2018-07-27 at 08:05 -0400, Laurence Oberman wrote: > > On Thu, 2018-07-26 at 16:02 -0400, Laurence Oberman wrote: > > > On Thu, 2018-07-26 at 10:28 -0400, Don Dutile wrote: > > > > On 07/26/2018 08:48 AM, Laurence Oberman wrote: > > > > > Hello > > > > > > > > > > https://www.spinics.net/lists/linux-rdma/msg51334.html > > > > > > > > > > A rhel 7.5 with backports from upstream is hitting this. > > > > > Chuck Reported it and Sagi and Max responded but its not > > > > > clear > > > > > if > > > > > we > > > > > ever fixed this. > > > > > > > > > > > > > RHEL-7.5 data point: > > > > -- drivers/infiniband/* -r is backported to v4.14. > > > > i.e., includes the patch(es) mentioned in the above thread. > > > > > > > > Laurence: > > > > Please test with 7.6 kernel & report back. > > > > if that passes, RH can bisect the bug fix btwn v4.14 & > > > > v4.16(the > > > > 7.6 > > > > update point for its rdma kernel core), > > > > and backport to 7.5-zstream. note: you'll have to update rdma- > > > > core > > > > pkg to the 7.6 version as well. > > > > All functional & bug fix patches to mlx* (ib & enet) are in as > > > > well > > > > (same kernel references). > > > > > > > > -dd > > > > > > > > > In this case we land up in a panic, noty just messaging, > > > > > although > > > > > the > > > > > messages logged for a long time over and over until we > > > > > finally > > > > > panicked. > > > > > > > > > > crash> log | grep "memreg failure: memor" | wc -l > > > > > 2414 > > > > > > > > > > crash> log > > > > > [1635578.012721] connection16:0: detected conn error (1011) > > > > > [1635587.050688] mlx5_0:dump_cqe:262:(pid 93128): dump error > > > > > cqe > > > > > [1635587.089686] 00000000 00000000 00000000 00000000 > > > > > [1635587.123989] 00000000 00000000 00000000 00000000 > > > > > [1635587.157494] 00000000 00000000 00000000 00000000 > > > > > [1635587.190968] 00000000 08007806 250002ad ba6115d3 > > > > > > > > > > [1635587.224331] iser: iser_err_comp: memreg failure: memory > > > > > management > > > > > operation error (6) vend_err 78 > > > > > [1635587.278876] connection15:0: detected conn error (1011) > > > > > [1635590.986286] mlx5_1:dump_cqe:262:(pid 0): dump error cqe > > > > > [1635591.021891] 00000000 00000000 00000000 00000000 > > > > > [1635591.053944] 00000000 00000000 00000000 00000000 > > > > > > > > > > [1657077.997960] BUG: unable to handle kernel NULL pointer > > > > > dereference > > > > > at 0000000000000010 > > > > > [1657077.997967] IP: [<ffffffffc08a541e>] > > > > > iscsi_verify_itt+0x1e/0x110 > > > > > [libiscsi] > > > > > [1657077.997970] PGD 80000098de387067 PUD b8d9ffa067 PMD 0 > > > > > [1657077.997971] Oops: 0000 [#1] SMP > > > > > [1657077.998009] Modules linked in: oracleasm(O) nfsv3 > > > > > rpcsec_gss_krb5 > > > > > nfsv4 dns_resolver nfs fscache dm_round_robin bonding rpcrdma > > > > > ib_isert > > > > > iscsi_target_mod ib_iser libiscsi scsi_transport_iscsi > > > > > ib_srpt > > > > > target_core_mod ib_srp scsi_transport_srp scsi_tgt ib_ipoib > > > > > rdma_ucm > > > > > ib_ucm ib_uverbs ib_umad rdma_cm ib_cm iw_cm mlx5_ib ib_core > > > > > vfat > > > > > fat > > > > > xfs sb_edac edac_core intel_powerclamp coretemp intel_rapl > > > > > iosf_mbi > > > > > kvm_intel kvm irqbypass iTCO_wdt crc32_pclmul ipmi_ssif > > > > > iTCO_vendor_support ghash_clmulni_intel aesni_intel lrw > > > > > gf128mul > > > > > ipmi_si glue_helper ablk_helper cryptd sg hpwdt hpilo pcspkr > > > > > ipmi_devintf ioatdma dm_multipath i2c_i801 lpc_ich shpchp dca > > > > > wmi > > > > > ipmi_msghandler pcc_cpufreq acpi_power_meter nfsd binfmt_misc > > > > > auth_rpcgss nfs_acl lockd grace sunrpc ip_tables ext4 mbcache > > > > > jbd2 > > > > > sd_mod crc_t10dif crct10dif_generic > > > > > [1657077.998020] i2c_algo_bit drm_kms_helper syscopyarea > > > > > sysfillrect > > > > > sysimgblt fb_sys_fops ttm bnx2x mlx5_core crct10dif_pclmul > > > > > mdio > > > > > tg3(OE) > > > > > devlink libcrc32c crct10dif_common drm hpsa(OE) ptp i2c_core > > > > > crc32c_intel scsi_transport_sas pps_core dm_mirror > > > > > dm_region_hash > > > > > dm_log dm_mod > > > > > [1657077.998023] CPU: 20 PID: 41538 Comm: sh Tainted: > > > > > G OE - > > > > > ----------- 3.10.0-693.34.1.el7_bz1582551.x86_64 #1 > > > > > [1657077.998024] Hardware name: HP ProLiant DL380 > > > > > Gen9/ProLiant > > > > > DL380 > > > > > Gen9, BIOS P89 05/21/2018 > > > > > [1657077.998025] task: ffff88587ce38fd0 ti: ffff884dd0af0000 > > > > > task.ti: > > > > > ffff884dd0af0000 > > > > > [1657077.998029] RIP: > > > > > 0010:[<ffffffffc08a541e>] [<ffffffffc08a541e>] > > > > > iscsi_verify_itt+0x1e/0x110 [libiscsi] > > > > > [1657077.998030] RSP: 0000:ffff88beff403d78 EFLAGS: 00010286 > > > > > [1657077.998031] RAX: 000000000000004c RBX: 00000000b0000036 > > > > > RCX: > > > > > 0000000000000002 > > > > > [1657077.998032] RDX: 00000000000000cc RSI: 00000000b0000036 > > > > > RDI: > > > > > 0000000000000000 > > > > > [1657077.998033] RBP: ffff88beff403da0 R08: 0000000040032a20 > > > > > R09: > > > > > ffff8896e4eaf91c > > > > > [1657077.998034] R10: 0000000000000000 R11: 00007ffff7763ca0 > > > > > R12: > > > > > 0000000000000000 > > > > > [1657077.998035] R13: ffff8896e4eaf9e4 R14: ffff8896e4eaf900 > > > > > R15: > > > > > 0000000000000000 > > > > > [1657077.998036] FS: 00007ffff7fe6740(0000) > > > > > GS:ffff88beff400000(0000) > > > > > knlGS:0000000000000000 > > > > > [1657077.998038] CS: 0010 DS: 0000 ES: 0000 CR0: > > > > > 0000000080050033 > > > > > [1657077.998039] CR2: 0000000000000010 CR3: 000000ad92eba000 > > > > > CR4: > > > > > 00000000003607e0 > > > > > [1657077.998040] DR0: 0000000000000000 DR1: 0000000000000000 > > > > > DR2: > > > > > 0000000000000000 > > > > > [1657077.998041] DR3: 0000000000000000 DR6: 00000000fffe0ff0 > > > > > DR7: > > > > > 0000000000000400 > > > > > [1657077.998042] Call Trace: > > > > > [1657077.998044] <IRQ> > > > > > [1657077.998046] [<ffffffffc08a5527>] > > > > > iscsi_itt_to_ctask+0x17/0x80 > > > > > [libiscsi] > > > > > [1657077.998050] [<ffffffffc05eefea>] > > > > > iser_task_rsp+0xca/0x360 > > > > > [ib_iser] > > > > > [1657077.998061] [<ffffffffc0587fbb>] > > > > > __ib_process_cq+0x6b/0xe0 > > > > > [ib_core] > > > > > [1657077.998066] [<ffffffffc0588122>] > > > > > ib_poll_handler+0x22/0x80 > > > > > [ib_core] > > > > > [1657077.998070] [<ffffffff81358507>] > > > > > irq_poll_softirq+0xc7/0x100 > > > > > [1657077.998076] [<ffffffff81095195>] > > > > > __do_softirq+0xf5/0x280 > > > > > [1657077.998081] [<ffffffff816c4e8c>] call_softirq+0x1c/0x30 > > > > > [1657077.998086] [<ffffffff8102d435>] do_softirq+0x65/0xa0 > > > > > [1657077.998088] [<ffffffff81095515>] irq_exit+0x105/0x110 > > > > > [1657077.998091] [<ffffffff816c61d6>] do_IRQ+0x56/0xf0 > > > > > [1657077.998098] [<ffffffff816b837c>] > > > > > common_interrupt+0x17c/0x17c > > > > > [1657077.998099] <EOI> > > > > > [1657077.998113] Code: ff ff ff eb a9 41 be 95 ff ff ff eb a1 > > > > > 0f > > > > > 1f > > > > > 44 > > > > > 00 00 55 48 89 e5 41 55 41 54 49 89 fc 53 89 f3 48 83 ec 10 > > > > > c7 > > > > > 45 > > > > > d8 00 > > > > > 00 00 00 <4c> 8b 6f 10 65 48 8b 04 25 28 00 00 00 48 89 45 e0 > > > > > 31 > > > > > c0 > > > > > 83 > > > > > fe > > > > > [1657077.998116] RIP [<ffffffffc08a541e>] > > > > > iscsi_verify_itt+0x1e/0x110 > > > > > [libiscsi] > > > > > [1657077.998116] RSP <ffff88beff403d78> > > > > > [1657077.998117] CR2: 0000000000000010 > > > > > crash> > > > > > > > > > > crash> bt > > > > > PID: 41538 TASK: ffff88587ce38fd0 CPU: 20 COMMAND: "sh" > > > > > #0 [ffff88beff403a18] machine_kexec at ffffffff8105ddeb > > > > > #1 [ffff88beff403a78] __crash_kexec at ffffffff81109902 > > > > > #2 [ffff88beff403b48] crash_kexec at ffffffff811099f0 > > > > > #3 [ffff88beff403b60] oops_end at ffffffff816b97a8 > > > > > #4 [ffff88beff403b88] no_context at ffffffff816a8c96 > > > > > #5 [ffff88beff403bd8] __bad_area_nosemaphore at > > > > > ffffffff816a8d2c > > > > > #6 [ffff88beff403c20] bad_area_nosemaphore at > > > > > ffffffff816a8e96 > > > > > #7 [ffff88beff403c30] __do_page_fault at ffffffff816bc6be > > > > > #8 [ffff88beff403c90] do_page_fault at ffffffff816bc865 > > > > > #9 [ffff88beff403cc0] page_fault at ffffffff816b8788 > > > > > [exception RIP: iscsi_verify_itt+30] > > > > > RIP: ffffffffc08a541e RSP: ffff88beff403d78 RFLAGS: > > > > > 00010286 > > > > > RAX: 000000000000004c RBX: 00000000b0000036 RCX: > > > > > 0000000000000002 > > > > > RDX: 00000000000000cc RSI: 00000000b0000036 RDI: > > > > > 0000000000000000 > > > > > RBP: ffff88beff403da0 R8: 0000000040032a20 R9: > > > > > ffff8896e4eaf91c > > > > > R10: 0000000000000000 R11: 00007ffff7763ca0 R12: > > > > > 0000000000000000 > > > > > R13: ffff8896e4eaf9e4 R14: ffff8896e4eaf900 R15: > > > > > 0000000000000000 > > > > > ORIG_RAX: ffffffffffffffff CS: 0010 SS: 0000 > > > > > #10 [ffff88beff403da8] iscsi_itt_to_ctask at ffffffffc08a5527 > > > > > [libiscsi] > > > > > #11 [ffff88beff403dc8] iser_task_rsp at ffffffffc05eefea > > > > > [ib_iser] > > > > > #12 [ffff88beff403e10] __ib_process_cq at ffffffffc0587fbb > > > > > [ib_core] > > > > > #13 [ffff88beff403e50] ib_poll_handler at ffffffffc0588122 > > > > > [ib_core] > > > > > #14 [ffff88beff403e80] irq_poll_softirq at ffffffff81358507 > > > > > #15 [ffff88beff403eb8] __do_softirq at ffffffff81095195 > > > > > #16 [ffff88beff403f28] call_softirq at ffffffff816c4e8c > > > > > #17 [ffff88beff403f40] do_softirq at ffffffff8102d435 > > > > > #18 [ffff88beff403f60] irq_exit at ffffffff81095515 > > > > > #19 [ffff88beff403f78] do_IRQ at ffffffff816c61d6 > > > > > --- <IRQ stack> --- > > > > > #20 [ffff884dd0af3f58] ret_from_intr at ffffffff816b837c > > > > > RIP: 000000000041b866 RSP: 00007fffffffea28 RFLAGS: > > > > > 00000206 > > > > > RAX: 0000000000000000 RBX: 00007fffffffef53 RCX: > > > > > 00000000006f1a70 > > > > > RDX: 00000000006f1a70 RSI: 00000000006f1a90 RDI: > > > > > 0000000000000000 > > > > > RBP: 0000000000000002 R8: 0000000000000001 R9: > > > > > 0000000000000020 > > > > > R10: 0000000000000003 R11: 00007ffff7763ca0 R12: > > > > > ffff88beff4061e8 > > > > > R13: 00000000ffffffff R14: 0000000000000000 R15: > > > > > 0000000000000063 > > > > > ORIG_RAX: ffffffffffffffbb CS: 0033 SS: 002b > > > > > > > > > > crash> ps -p 41538 > > > > > PID: 0 TASK: ffffffff81a0e480 CPU: 0 COMMAND: > > > > > "swapper/0" > > > > > PID: 1 TASK: ffff88012e4c8000 CPU: 7 COMMAND: > > > > > "systemd" > > > > > PID: 2345 TASK: ffff885ef5eb8fd0 CPU: 14 COMMAND: > > > > > "zabbix_agentd" > > > > > PID: 2349 TASK: ffff885efcbcaf70 CPU: 1 COMMAND: > > > > > "zabbix_agentd" > > > > > PID: 41538 TASK: ffff88587ce38fd0 CPU: 20 COMMAND: > > > > > "sh" > > > > > > > > > > > > > > > > > > > Don > > > I misspoke about the kernel version, its 7.4 > > > 3.10.0-693.34.1.el7_bz1582551.x86_64 > > > Its the one we added the missing iscsi patches to but base is 7.4 > > > So I will test with 7.5 > > > > > > > Don, I had another look at this. > > > > Its not the SG_GAPS issue causing a memory registration error I > > reported and we fixed in 7.5 from upstream. > > > > Which commit in 7.5 did we pull in for fix this from upstream. > > > > I think this is different and not yet fixed ?? > > > > [14556.614551] iser: iser_err_comp: memreg failure: memory > > management > > operation error (6) vend_err 78 > > [14556.666134] connection1:0: detected conn error (1011) > > [14562.678414] mlx5_1:dump_cqe:262:(pid 0): dump error cqe > > [14562.678529] mlx5_0:dump_cqe:262:(pid 0): dump error cqe > > [14562.678530] 00000000 00000000 00000000 00000000 > > [14562.678531] 00000000 00000000 00000000 00000000 > > [14562.678531] 00000000 00000000 00000000 00000000 > > [14562.678532] 00000000 08007806 25000344 34681cd2 > > [14562.678535] iser: iser_err_comp: memreg failure: memory > > management > > operation error (6) vend_err 78 > > [14562.678544] connection1:0: detected conn error (1011) > > [14562.679098] BUG: unable to handle kernel NULL pointer > > dereference > > at > > 0000000000000010 > > [14562.679105] IP: [<ffffffffc088141e>] iscsi_verify_itt+0x1e/0x110 > > [libiscsi] > > [14562.679106] PGD 0 > > [14562.679107] Oops: 0000 [#1] SMP > > [14562.679134] Modules linked in: ip6table_filter ip6_tables > > iptable_filter sctp_diag sctp tcp_diag udp_diag inet_diag unix_diag > > af_packet_diag netlink_diag bnx2i cnic uio ip_vs nf_conntrack > > oracleadvm(POE) oracleoks(POE) oracleasm(O) nfsv3 rpcsec_gss_krb5 > > nfsv4 > > dns_resolver nfs fscache dm_round_robin bonding rpcrdma ib_isert > > iscsi_target_mod ib_iser libiscsi scsi_transport_iscsi ib_srpt > > target_core_mod ib_srp scsi_transport_srp scsi_tgt ib_ipoib > > rdma_ucm > > ib_ucm ib_uverbs ib_umad rdma_cm ib_cm iw_cm mlx5_ib ib_core xfs > > vfat > > fat sb_edac edac_core intel_powerclamp coretemp intel_rapl iosf_mbi > > kvm_intel kvm irqbypass crc32_pclmul ghash_clmulni_intel > > aesni_intel > > lrw gf128mul glue_helper ablk_helper cryptd iTCO_wdt > > iTCO_vendor_support ipmi_ssif pcspkr ipmi_si dm_multipath ioatdma > > lpc_ich i2c_i801 sg hpilo > > [14562.679152] hpwdt dca ipmi_devintf ipmi_msghandler pcc_cpufreq > > shpchp wmi acpi_power_meter binfmt_misc nfsd auth_rpcgss nfs_acl > > lockd > > grace sunrpc ip_tables ext4 mbcache jbd2 sd_mod crc_t10dif > > crct10dif_generic i2c_algo_bit drm_kms_helper syscopyarea > > sysfillrect > > sysimgblt fb_sys_fops ttm bnx2x mlx5_core devlink mdio tg3(OE) > > libcrc32c drm crct10dif_pclmul hpsa(OE) crct10dif_common ptp > > i2c_core > > crc32c_intel scsi_transport_sas pps_core dm_mirror dm_region_hash > > dm_log dm_mod > > [14562.679154] CPU: 9 PID: 0 Comm: swapper/9 Tainted: > > P OE - > > ----------- 3.10.0-693.22.1.el7.x86_64 #1 > > [14562.679155] Hardware name: HP ProLiant DL380 Gen9/ProLiant DL380 > > Gen9, BIOS P89 05/21/2018 > > [14562.679156] task: ffff8860aefaaf70 ti: ffff8860ae440000 task.ti: > > ffff8860ae440000 > > [14562.679158] RIP: 0010:[<ffffffffc088141e>] [<ffffffffc088141e>] > > iscsi_verify_itt+0x1e/0x110 [libiscsi] > > [14562.679159] RSP: 0018:ffff88beff2c3d78 EFLAGS: 00010286 > > [14562.679160] RAX: 000000000000004c RBX: 00000000d0000041 RCX: > > 0000000000000002 > > [14562.679161] RDX: 00000000000000cc RSI: 00000000d0000041 RDI: > > 0000000000000000 > > [14562.679161] RBP: ffff88beff2c3da0 R08: 0000000040001038 R09: > > ffff88ae496fe01c > > [14562.679162] R10: 0000000000000000 R11: 7fffffffffffffff R12: > > 0000000000000000 > > [14562.679162] R13: ffff88ae496fe0e4 R14: ffff88ae496fe000 R15: > > 0000000000000000 > > [14562.679163] FS: 0000000000000000(0000) > > GS:ffff88beff2c0000(0000) > > knlGS:0000000000000000 > > [14562.679164] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 > > [14562.679164] CR2: 0000000000000010 CR3: 000000beede48000 CR4: > > 00000000003607e0 > > [14562.679165] DR0: 0000000000000000 DR1: 0000000000000000 DR2: > > 0000000000000000 > > [14562.679166] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: > > 0000000000000400 > > [14562.679166] Call Trace: > > [14562.679168] <IRQ> > > [14562.679170] [<ffffffffc0881527>] iscsi_itt_to_ctask+0x17/0x80 > > [libiscsi] > > [14562.679173] [<ffffffffc069ffea>] iser_task_rsp+0xca/0x360 > > [ib_iser] > > [14562.679181] [<ffffffffc0924fbb>] __ib_process_cq+0x6b/0xe0 > > [ib_core] > > Starts with the memreg failures > crash> log | grep "iser: iser_err_comp: memreg failure" | wc -l > 1237 > > Then the panic > > [14556.614551] iser: iser_err_comp: memreg failure: memory management > operation error (6) vend_err 78 > [14556.666134] connection1:0: detected conn error (1011) > [14562.678414] mlx5_1:dump_cqe:262:(pid 0): dump error cqe > [14562.678529] mlx5_0:dump_cqe:262:(pid 0): dump error cqe > [14562.678530] 00000000 00000000 00000000 00000000 > [14562.678531] 00000000 00000000 00000000 00000000 > [14562.678531] 00000000 00000000 00000000 00000000 > [14562.678532] 00000000 08007806 25000344 34681cd2 > [14562.678535] iser: iser_err_comp: memreg failure: memory management > operation error (6) vend_err 78 > [14562.678544] connection1:0: detected conn error (1011) > > [14562.679098] BUG: unable to handle kernel NULL pointer dereference > at > 0000000000000010 > [14562.679105] IP: [<ffffffffc088141e>] iscsi_verify_itt+0x1e/0x110 > [libiscsi] > > crash> bt > PID: 0 TASK: ffff8860aefaaf70 CPU: 9 COMMAND: "swapper/9" > #0 [ffff88beff2c3a18] machine_kexec at ffffffff8105d77b > #1 [ffff88beff2c3a78] __crash_kexec at ffffffff81108732 > #2 [ffff88beff2c3b48] crash_kexec at ffffffff81108820 > #3 [ffff88beff2c3b60] oops_end at ffffffff816b8778 > #4 [ffff88beff2c3b88] no_context at ffffffff816a7c7a > #5 [ffff88beff2c3bd8] __bad_area_nosemaphore at ffffffff816a7d10 > #6 [ffff88beff2c3c20] bad_area_nosemaphore at ffffffff816a7e7a > #7 [ffff88beff2c3c30] __do_page_fault at ffffffff816bb68e > #8 [ffff88beff2c3c90] do_page_fault at ffffffff816bb835 > #9 [ffff88beff2c3cc0] page_fault at ffffffff816b7768 > [exception RIP: iscsi_verify_itt+30] > RIP: ffffffffc088141e RSP: ffff88beff2c3d78 RFLAGS: 00010286 > RAX: 000000000000004c RBX: 00000000d0000041 RCX: > 0000000000000002 > RDX: 00000000000000cc RSI: 00000000d0000041 RDI: > 0000000000000000 > RBP: ffff88beff2c3da0 R8: 0000000040001038 R9: > ffff88ae496fe01c > R10: 0000000000000000 R11: 7fffffffffffffff R12: > 0000000000000000 > R13: ffff88ae496fe0e4 R14: ffff88ae496fe000 R15: > 0000000000000000 > ORIG_RAX: ffffffffffffffff CS: 0010 SS: 0018 > #10 [ffff88beff2c3da8] iscsi_itt_to_ctask at ffffffffc0881527 > [libiscsi] > #11 [ffff88beff2c3dc8] iser_task_rsp at ffffffffc069ffea [ib_iser] > #12 [ffff88beff2c3e10] __ib_process_cq at ffffffffc0924fbb [ib_core] > #13 [ffff88beff2c3e50] ib_poll_handler at ffffffffc0925122 [ib_core] > #14 [ffff88beff2c3e80] irq_poll_softirq at ffffffff813572b7 > #15 [ffff88beff2c3eb8] __do_softirq at ffffffff81094035 > #16 [ffff88beff2c3f28] call_softirq at ffffffff816c3afc > #17 [ffff88beff2c3f40] do_softirq at ffffffff8102d435 > #18 [ffff88beff2c3f60] irq_exit at ffffffff810943b5 > #19 [ffff88beff2c3f78] do_IRQ at ffffffff816c4d96 > --- <IRQ stack> --- > #20 [ffff8860ae443db8] ret_from_intr at ffffffff816b7362 > [exception RIP: cpuidle_enter_state+87] > RIP: ffffffff81530b07 RSP: ffff8860ae443e60 RFLAGS: 00000202 > RAX: 00000d3e7d729de6 RBX: ffff8860ae443e40 RCX: > 0000000000000018 > RDX: 0000000225c17d03 RSI: ffff8860ae443fd8 RDI: > 00000d3e7d729de6 > RBP: ffff8860ae443e88 R8: 000000000000016c R9: > 000000000000001c > R10: 0000000000000043 R11: 7fffffffffffffff R12: > 0000000000000009 > R13: ffff88beff2d39a0 R14: ffffffff810b77e5 R15: > ffff8860ae443de0 > ORIG_RAX: ffffffffffffff5d CS: 0010 SS: 0018 > #21 [ffff8860ae443e90] cpuidle_idle_call at ffffffff81530c5e > #22 [ffff8860ae443ed0] arch_cpu_idle at ffffffff81034f8e > #23 [ffff8860ae443ee0] cpu_startup_entry at ffffffff810eb6da > #24 [ffff8860ae443f28] start_secondary at ffffffff81052222 > > crash> dis -l iscsi_verify_itt+30 > /usr/src/debug/kernel-3.10.0-693.22.1.el7/linux-3.10.0- > 693.22.1.el7.x86_64/drivers/scsi/libiscsi.c: 1292 > 0xffffffffc088141e > <iscsi_verify_itt+30>: mov 0x10(%rdi),%r13 > crash> > > > So fails here > > int iscsi_verify_itt(struct iscsi_conn *conn, itt_t itt) > { > struct iscsi_session *session = conn->session; **** conn- > > session is invalid > > rdi had the struct iscsi_conn > > 0xffffffffc0881400 <iscsi_verify_itt>: nopl 0x0(%rax,%rax,1) > [FTRACE > NOP] > 0xffffffffc0881405 <iscsi_verify_itt+5>: push %rbp > 0xffffffffc0881406 <iscsi_verify_itt+6>: mov %rsp,%rbp > 0xffffffffc0881409 <iscsi_verify_itt+9>: push %r13 > 0xffffffffc088140b <iscsi_verify_itt+11>: push %r12 > 0xffffffffc088140d <iscsi_verify_itt+13>: mov %rdi,%r12 > 0xffffffffc0881410 <iscsi_verify_itt+16>: push %rbx > 0xffffffffc0881411 <iscsi_verify_itt+17>: mov %esi,%ebx > 0xffffffffc0881413 <iscsi_verify_itt+19>: sub $0x10,%rsp > 0xffffffffc0881417 <iscsi_verify_itt+23>: movl $0x0,- > 0x28(%rbp) > 0xffffffffc088141e > <iscsi_verify_itt+30>: mov 0x10(%rdi),%r13 > > RIP: ffffffffc088141e RSP: ffff88beff2c3d78 RFLAGS: 00010286 > RAX: 000000000000004c RBX: 00000000d0000041 RCX: > 0000000000000002 > RDX: 00000000000000cc RSI: 00000000d0000041 RDI: > 0000000000000000 > RBP: ffff88beff2c3da0 R8: 0000000040001038 R9: > ffff88ae496fe01c > R10: 0000000000000000 R11: 7fffffffffffffff R12: > 0000000000000000 > R13: ffff88ae496fe0e4 R14: ffff88ae496fe000 R15: > 0000000000000000 > > Both RDI and R12 are null, offset by 10 get the bad address > > So we have a race somehow that trashes the conn pointer under load. > > The load clearly is seeing resource issues and repeatedly failing the > memory registration. So as I expected the memreg issues are gone won 7.5 which was rebased against upstream. We are now hitting this and I am unable to reproduce in-house after multiple efforts. Aug 7 06:47:30 xxxxxxx kernel: WARNING: CPU: 20 PID: 36881 at lib/list_debug.c:36 __list_add+0x8a/0xc0 Aug 7 06:47:30 xxxxxxx kernel: list_add double add: new=ffff9f01523b92c8, prev=ffff9f01523b92c8, next=ffff9f69e4216d88. Aug 7 06:47:30 xxxxxxx kernel: Modules linked in: bnx2i cnic uio ip_vs nf_conntrack ip6table_filter ip6_tables iptable_filter tcp_diag udp_diag inet_diag unix_diag af_packet_diag n etlink_diag oracleadvm(POE) oracleoks(POE) oracleasm(O) nfsv3 rpcsec_gss_krb5 nfsv4 dns_resolver nfs fscache dm_round_robin bonding rpcrdma ib_isert iscsi_target_mod ib_iser libiscsi sc si_transport_iscsi ib_srpt target_core_mod ib_srp scsi_transport_srp scsi_tgt ib_ipoib rdma_ucm ib_ucm ib_uverbs ib_umad rdma_cm ib_cm iw_cm mlx5_ib ib_core vfat fat xfs sb_edac intel_p owerclamp coretemp intel_rapl iosf_mbi kvm_intel kvm irqbypass crc32_pclmul ghash_clmulni_intel iTCO_wdt aesni_intel ipmi_ssif lrw iTCO_vendor_support gf128mul glue_helper ablk_helper i oatdma cryptd ipmi_si pcspkr joydev ipmi_devintf hpwdt i2c_i801 hpilo sg lpc_ich wmi dca ipmi_msghandler Aug 7 06:47:30 xxxxxxx kernel: acpi_power_meter pcc_cpufreq shpchp dm_multipath binfmt_misc nfsd auth_rpcgss nfs_acl lockd grace sunrpc ip_tables ext4 mbcache jbd2 sd_mod crc_t10di f crct10dif_generic i2c_algo_bit drm_kms_helper syscopyarea sysfillrect sysimgblt fb_sys_fops mlx5_core ttm mlxfw drm bnx2x tg3 devlink mdio crct10dif_pclmul libcrc32c crct10dif_common hpsa ptp i2c_core crc32c_intel scsi_transport_sas pps_core dm_mirror dm_region_hash dm_log dm_mod Aug 7 06:47:30 xxxxxxx kernel: CPU: 20 PID: 36881 Comm: sh Tainted: P W OE ------------ 3.10.0-862.9.1.el7.x86_64 #1 Aug 7 06:47:30 xxxxxxx kernel: Hardware name: HP ProLiant DL380 Gen9/ProLiant DL380 Gen9, BIOS P89 05/21/2018 Aug 7 06:47:30 xxxxxxx kernel: Call Trace: Aug 7 06:47:30 xxxxxxx kernel: <IRQ> [<ffffffffa650e84e>] dump_stack+0x19/0x1b Aug 7 06:47:30 xxxxxxx kernel: [<ffffffffa5e91e18>] __warn+0xd8/0x100 Aug 7 06:47:30 xxxxxxx kernel: [<ffffffffa5e91e9f>] warn_slowpath_fmt+0x5f/0x80 Aug 7 06:47:30 xxxxxxx kernel: [<ffffffffa6168d8a>] __list_add+0x8a/0xc0 Aug 7 06:47:30 xxxxxxx kernel: [<ffffffffc0ac3c75>] ipoib_start_xmit+0x485/0x6d0 [ib_ipoib] Aug 7 06:47:30 xxxxxxx kernel: [<ffffffffa63ec226>] dev_hard_start_xmit+0x246/0x3b0 Aug 7 06:47:30 xxxxxxx kernel: [<ffffffffa6417aba>] sch_direct_xmit+0x11a/0x250 Aug 7 06:47:30 xxxxxxx kernel: [<ffffffffa63ef111>] __dev_queue_xmit+0x4a1/0x660 Aug 7 06:47:30 xxxxxxx kernel: [<ffffffffa63ef2e0>] dev_queue_xmit+0x10/0x20 Aug 7 06:47:30 xxxxxxx kernel: [<ffffffffa63fad1d>] neigh_resolve_output+0x11d/0x220 Aug 7 06:47:30 xxxxxxx kernel: [<ffffffffa60db10a>] ? selinux_ipv4_postroute+0x1a/0x20 Aug 7 06:47:30 xxxxxxx kernel: [<ffffffffa643820c>] ip_finish_output+0x2ac/0x7a0 Aug 7 06:47:30 xxxxxxx kernel: [<ffffffffa6438a03>] ip_output+0x73/0xe0 Aug 7 06:47:30 xxxxxxx kernel: [<ffffffffa6437f60>] ? __ip_append_data.isra.50+0xa50/0xa50 Aug 7 06:47:30 xxxxxxx kernel: [<ffffffffa64365f7>] ip_local_out_sk+0x37/0x40 Aug 7 06:47:30 xxxxxxx kernel: [<ffffffffa6436963>] ip_queue_xmit+0x143/0x3a0 Aug 7 06:47:30 xxxxxxx kernel: [<ffffffffa6450844>] tcp_transmit_skb+0x4e4/0x9e0 Aug 7 06:47:30 xxxxxxx kernel: [<ffffffffa64528bf>] tcp_send_ack+0x11f/0x170 Aug 7 06:47:30 xxxxxxx kernel: [<ffffffffa6445735>] tcp_send_dupack+0x25/0xd0 Aug 7 06:47:30 xxxxxxx kernel: [<ffffffffa644ce86>] tcp_validate_incoming+0x186/0x2d0 Aug 7 06:47:30 xxxxxxx kernel: [<ffffffffa644d18d>] tcp_rcv_established+0x1bd/0x770 Aug 7 06:47:30 xxxxxxx kernel: [<ffffffffa6457e6a>] tcp_v4_do_rcv+0x10a/0x350 Aug 7 06:47:30 xxxxxxx kernel: [<ffffffffa64595fc>] tcp_v4_rcv+0x78c/0x990 Aug 7 06:47:30 xxxxxxx kernel: [<ffffffffc0feafc6>] ? ip_vs_remote_request4+0x16/0x20 [ip_vs] Aug 7 06:47:30 xxxxxxx kernel: [<ffffffffa643272d>] ip_local_deliver_finish+0xbd/0x200 Aug 7 06:47:30 xxxxxxx kernel: [<ffffffffa6432a19>] ip_local_deliver+0x59/0xd0 Aug 7 06:47:30 xxxxxxx kernel: [<ffffffffa6432670>] ? ip_rcv_finish+0x370/0x370 Aug 7 06:47:30 xxxxxxx kernel: [<ffffffffa6432390>] ip_rcv_finish+0x90/0x370 Aug 7 06:47:30 xxxxxxx kernel: [<ffffffffa6432d49>] ip_rcv+0x2b9/0x410 Aug 7 06:47:30 xxxxxxx kernel: [<ffffffffa6123411>] ? blk_complete_request+0x21/0x30 Aug 7 06:47:30 xxxxxxx kernel: [<ffffffffa63ecab9>] __netif_receive_skb_core+0x729/0xa20 Aug 7 06:47:30 xxxxxxx kernel: [<ffffffffa63ecdc8>] __netif_receive_skb+0x18/0x60 Aug 7 06:47:30 xxxxxxx kernel: [<ffffffffa63ece50>] netif_receive_skb_internal+0x40/0xc0 Aug 7 06:47:30 xxxxxxx kernel: [<ffffffffa63eda78>] napi_gro_receive+0xd8/0x100 Aug 7 06:47:30 xxxxxxx kernel: [<ffffffffc0983183>] mlx5i_handle_rx_cqe+0x2a3/0x460 [mlx5_core] Aug 7 06:47:30 xxxxxxx kernel: [<ffffffffc09826f8>] mlx5e_poll_rx_cq+0xc8/0x8b0 [mlx5_core] Aug 7 06:47:30 xxxxxxx kernel: [<ffffffffc0983909>] mlx5e_napi_poll+0x99/0x280 [mlx5_core] Aug 7 06:47:30 xxxxxxx kernel: [<ffffffffa63ed46f>] net_rx_action+0x26f/0x390 Aug 7 06:47:30 xxxxxxx kernel: [<ffffffffa5e9b085>] __do_softirq+0xf5/0x280 Aug 7 06:47:30 xxxxxxx kernel: [<ffffffffa6523cec>] call_softirq+0x1c/0x30 Aug 7 06:47:30 xxxxxxx kernel: [<ffffffffa5e2d625>] do_softirq+0x65/0xa0 Aug 7 06:47:30 xxxxxxx kernel: [<ffffffffa5e9b405>] irq_exit+0x105/0x110 Aug 7 06:47:30 xxxxxxx kernel: [<ffffffffa6524f86>] do_IRQ+0x56/0xf0 Aug 7 06:47:30 xxxxxxx kernel: [<ffffffffa6517362>] common_interrupt+0x162/0x162 Aug 7 06:47:30 xxxxxxx kernel: <EOI> [<ffffffffa5fc12d5>] ? do_read_fault.isra.60+0x5/0x1a0 Aug 7 06:47:30 xxxxxxx kernel: [<ffffffffa5fc5a9c>] ? handle_pte_fault+0x2dc/0xc30 Aug 7 06:47:30 xxxxxxx kernel: [<ffffffffa5fc7c3d>] handle_mm_fault+0x39d/0x9b0 Aug 7 06:47:30 xxxxxxx kernel: [<ffffffffa651b547>] __do_page_fault+0x197/0x4f0 Aug 7 06:47:30 xxxxxxx kernel: [<ffffffffa651b8d5>] do_page_fault+0x35/0x90 Aug 7 06:47:30 xxxxxxx kernel: [<ffffffffa6517758>] page_fault+0x28/0x30 Aug 7 06:47:30 xxxxxxx kernel: ---[ end trace 020d3cfb07217435 ]--- Then this very soon after Aug 7 06:47:48 xxxxxxx kernel: ------------[ cut here ]------------ Aug 7 06:47:48 xxxxxxx kernel: WARNING: CPU: 10 PID: 89058 at lib/list_debug.c:53 __list_del_entry+0x63/0xd0 Aug 7 06:47:48 xxxxxxx kernel: list_del corruption, ffff9f6fba35bb70- >next is LIST_POISON1 (dead000000000100) Aug 7 06:47:48 xxxxxxx kernel: Modules linked in: bnx2i cnic uio ip_vs nf_conntrack ip6table_filter ip6_tables iptable_filter tcp_diag udp_diag inet_diag unix_diag af_packet_diag n etlink_diag oracleadvm(POE) oracleoks(POE) oracleasm(O) nfsv3 rpcsec_gss_krb5 nfsv4 dns_resolver nfs fscache dm_round_robin bonding rpcrdma ib_isert iscsi_target_mod ib_iser libiscsi sc si_transport_iscsi ib_srpt target_core_mod ib_srp scsi_transport_srp scsi_tgt ib_ipoib rdma_ucm ib_ucm ib_uverbs ib_umad rdma_cm ib_cm iw_cm mlx5_ib ib_core vfat fat xfs sb_edac intel_p owerclamp coretemp intel_rapl iosf_mbi kvm_intel kvm irqbypass crc32_pclmul ghash_clmulni_intel iTCO_wdt aesni_intel ipmi_ssif lrw iTCO_vendor_support gf128mul glue_helper ablk_helper i oatdma cryptd ipmi_si pcspkr joydev ipmi_devintf hpwdt i2c_i801 hpilo sg lpc_ich wmi dca ipmi_msghandler Aug 7 06:47:48 xxxxxxx kernel: acpi_power_meter pcc_cpufreq shpchp dm_multipath binfmt_misc nfsd auth_rpcgss nfs_acl lockd grace sunrpc ip_tables ext4 mbcache jbd2 sd_mod crc_t10di f crct10dif_generic i2c_algo_bit drm_kms_helper syscopyarea sysfillrect sysimgblt fb_sys_fops mlx5_core ttm mlxfw drm bnx2x tg3 devlink mdio crct10dif_pclmul libcrc32c crct10dif_common hpsa ptp i2c_core crc32c_intel scsi_transport_sas pps_core dm_mirror dm_region_hash dm_log dm_mod Aug 7 06:47:48 xxxxxxx kernel: CPU: 10 PID: 89058 Comm: tnslsnr Tainted: P W OE ------------ 3.10.0-862.9.1.el7.x86_64 #1 Aug 7 06:47:48 xxxxxxx kernel: Hardware name: HP ProLiant DL380 Gen9/ProLiant DL380 Gen9, BIOS P89 05/21/2018 Aug 7 06:47:48 xxxxxxx kernel: Call Trace: Aug 7 06:47:48 xxxxxxx kernel: [<ffffffffa650e84e>] dump_stack+0x19/0x1b Aug 7 06:47:48 xxxxxxx kernel: [<ffffffffa5e91e18>] __warn+0xd8/0x100 Aug 7 06:47:48 xxxxxxx kernel: [<ffffffffa5e91e9f>] warn_slowpath_fmt+0x5f/0x80 Aug 7 06:47:48 xxxxxxx kernel: [<ffffffffa6168e23>] __list_del_entry+0x63/0xd0 Aug 7 06:47:48 xxxxxxx kernel: [<ffffffffa6168e9d>] list_del+0xd/0x30 Aug 7 06:47:48 xxxxxxx kernel: [<ffffffffa5ebc226>] remove_wait_queue+0x26/0x40 Aug 7 06:47:48 xxxxxxx kernel: [<ffffffffa6067a5a>] ep_unregister_pollwait.isra.6+0x3a/0x60 Aug 7 06:47:48 xxxxxxx kernel: [<ffffffffa6067aa2>] ep_remove+0x22/0xc0 Aug 7 06:47:48 xxxxxxx kernel: [<ffffffffa6068f1f>] SyS_epoll_ctl+0x4bf/0xc60 Aug 7 06:47:48 xxxxxxx kernel: [<ffffffffa651b56c>] ? __do_page_fault+0x1bc/0x4f0 Aug 7 06:47:48 xxxxxxx kernel: [<ffffffffa6520795>] system_call_fastpath+0x1c/0x21 Aug 7 06:47:48 xxxxxxx kernel: ---[ end trace 020d3cfb07217438 ]--- These started after 7.5 messages-20180806:Aug 4 20:10:54 xxxxxxx kernel: WARNING: CPU: 1 PID: 48632 at lib/list_debug.c:59 __list_del_entry+0xa1/0xd0 messages-20180806:Aug 4 20:10:54 xxxxxxx kernel: list_del corruption. prev->next should be ffff9f6991eb6648, but was (null) messages-20180806:Aug 4 20:10:54 xxxxxxx kernel: [<ffffffffa6168e61>] __list_del_entry+0xa1/0xd0 messages-20180806:Aug 5 00:25:39 xxxxxxx kernel: WARNING: CPU: 3 PID: 84714 at lib/list_debug.c:59 __list_del_entry+0xa1/0xd0 messages-20180806:Aug 5 00:25:39 xxxxxxx kernel: list_del corruption. prev->next should be ffff9f0bb12206c8, but was (null) messages-20180806:Aug 5 00:25:39 xxxxxxx kernel: [<ffffffffa6168e61>] __list_del_entry+0xa1/0xd0 messages-20180806:Aug 5 00:33:42 xxxxxxx kernel: WARNING: CPU: 4 PID: 80177 at lib/list_debug.c:59 __list_del_entry+0xa1/0xd0 messages-20180806:Aug 5 00:33:42 xxxxxxx kernel: list_del corruption. prev->next should be ffff9f69546a7ac8, but was dead000000000200 messages-20180806:Aug 5 00:33:42 xxxxxxx kernel: [<ffffffffa6168e61>] __list_del_entry+0xa1/0xd0 messages-20180806:Aug 5 00:40:14 xxxxxxx kernel: WARNING: CPU: 13 PID: 80177 at lib/list_debug.c:59 __list_del_entry+0xa1/0xd0 messages-20180806:Aug 5 00:40:14 xxxxxxx kernel: list_del corruption. prev->next should be ffff9f44a5b9f248, but was (null) messages-20180806:Aug 5 00:40:14 xxxxxxx kernel: [<ffffffffa6168e61>] __list_del_entry+0xa1/0xd0 messages-20180806:Aug 5 00:43:16 xxxxxxx kernel: WARNING: CPU: 0 PID: 51133 at lib/list_debug.c:59 __list_del_entry+0xa1/0xd0 messages-20180806:Aug 5 00:43:16 xxxxxxx kernel: list_del corruption. prev->next should be ffff9f6792776d48, but was (null) messages-20180806:Aug 5 00:43:16 xxxxxxx kernel: [<ffffffffa6168e61>] __list_del_entry+0xa1/0xd0 will be toiugh to get upstream tested here so I am cont=inuing to try reproduce. Has anybody seen this list corruption before -- To unsubscribe from this list: send the line "unsubscribe linux-rdma" in the body of a message to majordomo@xxxxxxxxxxxxxxx More majordomo info at http://vger.kernel.org/majordomo-info.html