Re: Panic when rebooting target server testing srp on 5.0.0-rc2

Laurence Oberman <loberman@xxxxxxxxxx> · Wed, 20 Mar 2019 16:35:20 -0400

On Wed, 2019-03-20 at 13:28 -0700, Bart Van Assche wrote:
> On Wed, 2019-03-20 at 11:11 -0400, Laurence Oberman wrote:
> > On Wed, 2019-03-20 at 09:45 -0400, Laurence Oberman wrote:
> > > Hello Bart, I hope all is well with you.
> > > 
> > > Quick question
> > > preparing to test v5.1-rc2 SRP, my usual method is first validate
> > > the
> > > prior kernel I had in place.
> > > This had passed tests previously (5.0.0-rc2) but I had not run
> > > the
> > > target server reboot test, just the disconnect tests.
> > > 
> > > Today with mapped SRP devices I rebooted the target server and
> > > the
> > > client panicked.
> > > 
> > > Its been a while and I have been so busy that have not kept up
> > > with
> > > all
> > > the fixes. Is this a known issue.
> > > 
> > > Thanks
> > > Laurence
> > > 
> > > 5414228.917507] scsi host2: ib_srp: Path record query failed:
> > > sgid
> > > fe80:0000:0000:0000:7cfe:9003:0072:6ed3, dgid
> > > fe80:0000:0000:0000:7cfe:9003:0072:6e4f, pkey 0xffff, service_id
> > > 0x7cfe900300726e4e
> > > [5414229.014355] scsi host2: reconnect attempt 7 failed (-110)
> > > [5414239.318161] scsi host2: ib_srp: Sending CM DREQ failed
> > > [5414239.318165] scsi host2: ib_srp: Sending CM DREQ failed
> > > [5414239.318167] scsi host2: ib_srp: Sending CM DREQ failed
> > > [5414239.318168] scsi host2: ib_srp: Sending CM DREQ failed
> > > [5414239.318170] scsi host2: ib_srp: Sending CM DREQ failed
> > > [5414239.318172] scsi host2: ib_srp: Sending CM DREQ failed
> > > [5414239.318173] scsi host2: ib_srp: Sending CM DREQ failed
> > > [5414239.318175] scsi host2: ib_srp: Sending CM DREQ failed
> > > [5414243.670072] scsi host2: ib_srp: Got failed path rec status
> > > -110
> > > [5414243.702179] scsi host2: ib_srp: Path record query failed:
> > > sgid
> > > fe80:0000:0000:0000:7cfe:9003:0072:6ed3, dgid
> > > fe80:0000:0000:0000:7cfe:9003:0072:6e4f, pkey 0xffff, service_id
> > > 0x7cfe900300726e4e
> > > [5414243.799313] scsi host2: reconnect attempt 8 failed (-110)
> > > [5414247.510115] scsi host1: ib_srp: Sending CM REQ failed
> > > [5414247.510140] scsi host1: reconnect attempt 1 failed (-104)
> > > [5414247.849078] BUG: unable to handle kernel NULL pointer
> > > dereference
> > > at 00000000000000b8
> > > [5414247.893793] #PF error: [normal kernel read fault]
> > > [5414247.921839] PGD 0 P4D 0 
> > > [5414247.937280] Oops: 0000 [#1] SMP PTI
> > > [5414247.958332] CPU: 4 PID: 7773 Comm: kworker/4:1H Kdump:
> > > loaded
> > > Tainted: G          I       5.0.0-rc2+ #2
> > > [5414248.012856] Hardware name: HP ProLiant DL380 G7, BIOS P67
> > > 08/16/2015
> > > [5414248.026174] device-mapper: multipath: Failing path 8:48.
> > > 
> > > 
> > > [5414248.050003] Workqueue: kblockd blk_mq_run_work_fn
> > > [5414248.108378] RIP: 0010:blk_mq_dispatch_rq_list+0xc9/0x590
> > > [5414248.139724] Code: 0f 85 c2 04 00 00 83 44 24 28 01 48 8b 45
> > > 00
> > > 48
> > > 39 c5 0f 84 ea 00 00 00 48 8b 5d 00 80 3c 24 00 4c 8d 6b b8 4c 8b
> > > 63
> > > c8
> > > 75 25 <49> 8b 84 24 b8 00 00 00 48 8b 40 40 48 8b 40 10 48 85 c0
> > > 74
> > > 10
> > > 4c
> > > [5414248.246176] RSP: 0018:ffffb1cd8760fd90 EFLAGS: 00010246
> > > [5414248.275599] RAX: ffffa049d67a1308 RBX: ffffa049d67a1308 RCX:
> > > 0000000000000004
> > > [5414248.316090] RDX: 0000000000000000 RSI: ffffb1cd8760fe20 RDI:
> > > ffffa0552ca08000
> > > [5414248.356884] RBP: ffffb1cd8760fe20 R08: 0000000000000000 R09:
> > > 8080808080808080
> > > [5414248.397632] R10: 0000000000000001 R11: 0000000000000001 R12:
> > > 0000000000000000
> > > [5414248.439323] R13: ffffa049d67a12c0 R14: 0000000000000000 R15:
> > > ffffa0552ca08000
> > > [5414248.481743] FS:  0000000000000000(0000)
> > > GS:ffffa04a37880000(0000)
> > > knlGS:0000000000000000
> > > [5414248.528310] CS:  0010 DS: 0000 ES: 0000 CR0:
> > > 0000000080050033
> > > [5414248.561779] CR2: 00000000000000b8 CR3: 0000000e9d40e004 CR4:
> > > 00000000000206e0
> > > [5414248.602420] Call Trace:
> > > [5414248.616660]  blk_mq_sched_dispatch_requests+0x15c/0x180
> > > [5414248.647066]  __blk_mq_run_hw_queue+0x5f/0xf0
> > > [5414248.672633]  process_one_work+0x171/0x370
> > > [5414248.695443]  worker_thread+0x49/0x3f0
> > > [5414248.716730]  kthread+0xf8/0x130
> > > [5414248.735085]  ? max_active_store+0x80/0x80
> > > [5414248.758569]  ? kthread_bind+0x10/0x10
> > > [5414248.779953]  ret_from_fork+0x35/0x40
> > > 
> > > [5414248.801005] Modules linked in: ib_isert iscsi_target_mod
> > > target_core_mod ib_srp rpcrdma scsi_transport_srp rdma_ucm
> > > ib_iser
> > > ib_ipoib ib_umad rdma_cm libiscsi iw_cm scsi_transport_iscsi
> > > ib_cm
> > > sunrpc mlx5_ib ib_uverbs ib_core intel_powerclamp coretemp
> > > kvm_intel
> > > kvm irqbypass ipmi_ssif crct10dif_pclmul crc32_pclmul
> > > ghash_clmulni_intel aesni_intel iTCO_wdt crypto_simd cryptd
> > > gpio_ich
> > > iTCO_vendor_support glue_helper joydev ipmi_si dm_service_time
> > > pcspkr
> > > ipmi_devintf hpilo hpwdt sg ipmi_msghandler acpi_power_meter
> > > lpc_ich
> > > i7core_edac pcc_cpufreq dm_multipath ip_tables xfs libcrc32c
> > > radeon
> > > sd_mod i2c_algo_bit drm_kms_helper syscopyarea sysfillrect
> > > sysimgblt
> > > fb_sys_fops ttm drm mlx5_core crc32c_intel serio_raw i2c_core
> > > hpsa
> > > bnx2
> > > scsi_transport_sas mlxfw devlink dm_mirror dm_region_hash dm_log
> > > dm_mod
> > > [5414249.199354] CR2: 00000000000000b8
> > > 
> > > 
> > 
> > Looking at the vmcore
> > 
> > PID: 7773   TASK: ffffa04a2c1e2b80  CPU: 4   COMMAND:
> > "kworker/4:1H"
> >  #0 [ffffb1cd8760fab0] machine_kexec at ffffffffaaa6003f
> >  #1 [ffffb1cd8760fb08] __crash_kexec at ffffffffaab373ed
> >  #2 [ffffb1cd8760fbd0] crash_kexec at ffffffffaab385b9
> >  #3 [ffffb1cd8760fbe8] oops_end at ffffffffaaa31931
> >  #4 [ffffb1cd8760fc08] no_context at ffffffffaaa6eb59
> >  #5 [ffffb1cd8760fcb0] do_page_fault at ffffffffaaa6feb2
> >  #6 [ffffb1cd8760fce0] page_fault at ffffffffab2010ee
> >     [exception RIP: blk_mq_dispatch_rq_list+201]
> >     RIP: ffffffffaad90589  RSP: ffffb1cd8760fd90  RFLAGS: 00010246
> >     RAX: ffffa049d67a1308  RBX: ffffa049d67a1308  RCX:
> > 0000000000000004
> >     RDX: 0000000000000000  RSI: ffffb1cd8760fe20  RDI:
> > ffffa0552ca08000
> >     RBP: ffffb1cd8760fe20   R8: 0000000000000000   R9:
> > 8080808080808080
> >     R10: 0000000000000001  R11: 0000000000000001  R12:
> > 0000000000000000
> >     R13: ffffa049d67a12c0  R14: 0000000000000000  R15:
> > ffffa0552ca08000
> >     ORIG_RAX: ffffffffffffffff  CS: 0010  SS: 0018
> >  #7 [ffffb1cd8760fe18] blk_mq_sched_dispatch_requests at
> > ffffffffaad9570c
> >  #8 [ffffb1cd8760fe60] __blk_mq_run_hw_queue at ffffffffaad8de3f
> >  #9 [ffffb1cd8760fe78] process_one_work at ffffffffaaab0ab1
> > #10 [ffffb1cd8760feb8] worker_thread at ffffffffaaab11d9
> > #11 [ffffb1cd8760ff10] kthread at ffffffffaaab6758
> > #12 [ffffb1cd8760ff50] ret_from_fork at ffffffffab200215
> > 
> > We were working on this request_queue for
> >  blk_mq_sched_dispatch_requests
> > 
> > crash> dev -d | grep ffffa0552ca08000
> >     8
> > ffffa055c81b5800   sdd        ffffa0552ca08000       0     0     0 
> > ]
> > 
> > That device was no longer accessible 
> > 
> > sdev_state = SDEV_TRANSPORT_OFFLINE,
> > 
> > So it looks like we tried to process a no longer valid list entry
> > in 
> > blk_mq_dispatch_rq_list
> > 
> > /home/loberman/rpmbuild/BUILD/kernel-5.0.0_rc2+/block/blk-mq.h: 211
> > 0xffffffffaad90589
> > <blk_mq_dispatch_rq_list+201>:       mov    0xb8(%r12),%rax
> > 
> > R12 is NULL
> > 
> > 
> > From
> > static inline bool blk_mq_get_dispatch_budget(struct blk_mq_hw_ctx
> > *hctx)
> > {
> >         struct request_queue *q = hctx->queue;
> > 
> >         if (q->mq_ops->get_budget)
> >                 return q->mq_ops->get_budget(hctx);
> >         return true;
> > }
> > 
> > Willw ait for a reply befaore i try the newer kernel, but looks
> > like a
> > use after free to me
> 
> Hi Laurence,
> 
> I don't think that any of the recent SRP initiator changes can be the
> root
> cause of this crash. However, significant changes went upstream in
> the block
> layer core during the v5.1-rc1 merge window, e.g. multi-page bvec
> support.
> Is it possible for you to bisect this kernel oops?
> 
> Thanks,
> 
> Bart.
> 

OK, I will see I can reproduce on the fly and I will bisect.
I do agree its not SRP, more likley some block layer race.
I was just able to reproduce it using SRP

Note that this was on 5.0.0-rc2+, prior to me trying 5.1.

I usually reboot the target server as part of my test series but when I
last tested 5.0.0-rc2+ I only reset the SRP interfaces and had devices
get rediscovered.

I did not see it during those tests.

Back when I have more to share.

Many Thanks for your time as always
Laurence