On Wed, 2019-03-20 at 09:45 -0400, Laurence Oberman wrote: > Hello Bart, I hope all is well with you. > > Quick question > preparing to test v5.1-rc2 SRP, my usual method is first validate the > prior kernel I had in place. > This had passed tests previously (5.0.0-rc2) but I had not run the > target server reboot test, just the disconnect tests. > > Today with mapped SRP devices I rebooted the target server and the > client panicked. > > Its been a while and I have been so busy that have not kept up with > all > the fixes. Is this a known issue. > > Thanks > Laurence > > 5414228.917507] scsi host2: ib_srp: Path record query failed: sgid > fe80:0000:0000:0000:7cfe:9003:0072:6ed3, dgid > fe80:0000:0000:0000:7cfe:9003:0072:6e4f, pkey 0xffff, service_id > 0x7cfe900300726e4e > [5414229.014355] scsi host2: reconnect attempt 7 failed (-110) > [5414239.318161] scsi host2: ib_srp: Sending CM DREQ failed > [5414239.318165] scsi host2: ib_srp: Sending CM DREQ failed > [5414239.318167] scsi host2: ib_srp: Sending CM DREQ failed > [5414239.318168] scsi host2: ib_srp: Sending CM DREQ failed > [5414239.318170] scsi host2: ib_srp: Sending CM DREQ failed > [5414239.318172] scsi host2: ib_srp: Sending CM DREQ failed > [5414239.318173] scsi host2: ib_srp: Sending CM DREQ failed > [5414239.318175] scsi host2: ib_srp: Sending CM DREQ failed > [5414243.670072] scsi host2: ib_srp: Got failed path rec status -110 > [5414243.702179] scsi host2: ib_srp: Path record query failed: sgid > fe80:0000:0000:0000:7cfe:9003:0072:6ed3, dgid > fe80:0000:0000:0000:7cfe:9003:0072:6e4f, pkey 0xffff, service_id > 0x7cfe900300726e4e > [5414243.799313] scsi host2: reconnect attempt 8 failed (-110) > [5414247.510115] scsi host1: ib_srp: Sending CM REQ failed > [5414247.510140] scsi host1: reconnect attempt 1 failed (-104) > [5414247.849078] BUG: unable to handle kernel NULL pointer > dereference > at 00000000000000b8 > [5414247.893793] #PF error: [normal kernel read fault] > [5414247.921839] PGD 0 P4D 0 > [5414247.937280] Oops: 0000 [#1] SMP PTI > [5414247.958332] CPU: 4 PID: 7773 Comm: kworker/4:1H Kdump: loaded > Tainted: G I 5.0.0-rc2+ #2 > [5414248.012856] Hardware name: HP ProLiant DL380 G7, BIOS P67 > 08/16/2015 > [5414248.026174] device-mapper: multipath: Failing path 8:48. > > > [5414248.050003] Workqueue: kblockd blk_mq_run_work_fn > [5414248.108378] RIP: 0010:blk_mq_dispatch_rq_list+0xc9/0x590 > [5414248.139724] Code: 0f 85 c2 04 00 00 83 44 24 28 01 48 8b 45 00 > 48 > 39 c5 0f 84 ea 00 00 00 48 8b 5d 00 80 3c 24 00 4c 8d 6b b8 4c 8b 63 > c8 > 75 25 <49> 8b 84 24 b8 00 00 00 48 8b 40 40 48 8b 40 10 48 85 c0 74 > 10 > 4c > [5414248.246176] RSP: 0018:ffffb1cd8760fd90 EFLAGS: 00010246 > [5414248.275599] RAX: ffffa049d67a1308 RBX: ffffa049d67a1308 RCX: > 0000000000000004 > [5414248.316090] RDX: 0000000000000000 RSI: ffffb1cd8760fe20 RDI: > ffffa0552ca08000 > [5414248.356884] RBP: ffffb1cd8760fe20 R08: 0000000000000000 R09: > 8080808080808080 > [5414248.397632] R10: 0000000000000001 R11: 0000000000000001 R12: > 0000000000000000 > [5414248.439323] R13: ffffa049d67a12c0 R14: 0000000000000000 R15: > ffffa0552ca08000 > [5414248.481743] FS: 0000000000000000(0000) > GS:ffffa04a37880000(0000) > knlGS:0000000000000000 > [5414248.528310] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 > [5414248.561779] CR2: 00000000000000b8 CR3: 0000000e9d40e004 CR4: > 00000000000206e0 > [5414248.602420] Call Trace: > [5414248.616660] blk_mq_sched_dispatch_requests+0x15c/0x180 > [5414248.647066] __blk_mq_run_hw_queue+0x5f/0xf0 > [5414248.672633] process_one_work+0x171/0x370 > [5414248.695443] worker_thread+0x49/0x3f0 > [5414248.716730] kthread+0xf8/0x130 > [5414248.735085] ? max_active_store+0x80/0x80 > [5414248.758569] ? kthread_bind+0x10/0x10 > [5414248.779953] ret_from_fork+0x35/0x40 > > [5414248.801005] Modules linked in: ib_isert iscsi_target_mod > target_core_mod ib_srp rpcrdma scsi_transport_srp rdma_ucm ib_iser > ib_ipoib ib_umad rdma_cm libiscsi iw_cm scsi_transport_iscsi ib_cm > sunrpc mlx5_ib ib_uverbs ib_core intel_powerclamp coretemp kvm_intel > kvm irqbypass ipmi_ssif crct10dif_pclmul crc32_pclmul > ghash_clmulni_intel aesni_intel iTCO_wdt crypto_simd cryptd gpio_ich > iTCO_vendor_support glue_helper joydev ipmi_si dm_service_time pcspkr > ipmi_devintf hpilo hpwdt sg ipmi_msghandler acpi_power_meter lpc_ich > i7core_edac pcc_cpufreq dm_multipath ip_tables xfs libcrc32c radeon > sd_mod i2c_algo_bit drm_kms_helper syscopyarea sysfillrect sysimgblt > fb_sys_fops ttm drm mlx5_core crc32c_intel serio_raw i2c_core hpsa > bnx2 > scsi_transport_sas mlxfw devlink dm_mirror dm_region_hash dm_log > dm_mod > [5414249.199354] CR2: 00000000000000b8 > > Looking at the vmcore PID: 7773 TASK: ffffa04a2c1e2b80 CPU: 4 COMMAND: "kworker/4:1H" #0 [ffffb1cd8760fab0] machine_kexec at ffffffffaaa6003f #1 [ffffb1cd8760fb08] __crash_kexec at ffffffffaab373ed #2 [ffffb1cd8760fbd0] crash_kexec at ffffffffaab385b9 #3 [ffffb1cd8760fbe8] oops_end at ffffffffaaa31931 #4 [ffffb1cd8760fc08] no_context at ffffffffaaa6eb59 #5 [ffffb1cd8760fcb0] do_page_fault at ffffffffaaa6feb2 #6 [ffffb1cd8760fce0] page_fault at ffffffffab2010ee [exception RIP: blk_mq_dispatch_rq_list+201] RIP: ffffffffaad90589 RSP: ffffb1cd8760fd90 RFLAGS: 00010246 RAX: ffffa049d67a1308 RBX: ffffa049d67a1308 RCX: 0000000000000004 RDX: 0000000000000000 RSI: ffffb1cd8760fe20 RDI: ffffa0552ca08000 RBP: ffffb1cd8760fe20 R8: 0000000000000000 R9: 8080808080808080 R10: 0000000000000001 R11: 0000000000000001 R12: 0000000000000000 R13: ffffa049d67a12c0 R14: 0000000000000000 R15: ffffa0552ca08000 ORIG_RAX: ffffffffffffffff CS: 0010 SS: 0018 #7 [ffffb1cd8760fe18] blk_mq_sched_dispatch_requests at ffffffffaad9570c #8 [ffffb1cd8760fe60] __blk_mq_run_hw_queue at ffffffffaad8de3f #9 [ffffb1cd8760fe78] process_one_work at ffffffffaaab0ab1 #10 [ffffb1cd8760feb8] worker_thread at ffffffffaaab11d9 #11 [ffffb1cd8760ff10] kthread at ffffffffaaab6758 #12 [ffffb1cd8760ff50] ret_from_fork at ffffffffab200215 We were working on this request_queue for blk_mq_sched_dispatch_requests crash> dev -d | grep ffffa0552ca08000 8 ffffa055c81b5800 sdd ffffa0552ca08000 0 0 0 ] That device was no longer accessible sdev_state = SDEV_TRANSPORT_OFFLINE, So it looks like we tried to process a no longer valid list entry in blk_mq_dispatch_rq_list /home/loberman/rpmbuild/BUILD/kernel-5.0.0_rc2+/block/blk-mq.h: 211 0xffffffffaad90589 <blk_mq_dispatch_rq_list+201>: mov 0xb8(%r12),%rax R12 is NULL From static inline bool blk_mq_get_dispatch_budget(struct blk_mq_hw_ctx *hctx) { struct request_queue *q = hctx->queue; if (q->mq_ops->get_budget) return q->mq_ops->get_budget(hctx); return true; } Willw ait for a reply befaore i try the newer kernel, but looks like a use after free to me