Re: Oops in blk_mq_get_request() (was Re: ppc64le kernel panic on 5.2.9-rc1)

Ming Lei <tom.leiming@xxxxxxxxx> · Fri, 16 Aug 2019 20:34:45 +0800

On Fri, Aug 16, 2019 at 7:15 PM Michael Ellerman <mpe@xxxxxxxxxxxxxx> wrote:
>
> Major Hayden <major@xxxxxxxxxx> writes:
> > Hello there,
> >
> > The CKI Project just found a kernel panic while running the blktests
> > test suite on stable 5.2.9-rc1[0]. Michael Ellerman requested for this
> > list to be copied on these ppc64le failures.
> >
> > We have some logs[1] for these failures and they start with
> > "ppc64le_host_2_Storage_blktests*". We hope this helps!
> >
> > [0] https://lore.kernel.org/stable/255f9af4-6087-7f56-5860-5aa0397a7631@xxxxxxxxxx/T/#t
> > [1] https://artifacts.cki-project.org/pipelines/100875/logs/
>
> Thanks for the report.
>
> It looks like you tested the stable queue yesterday, which AFAICS
> results in the exact same source tree as you tested above, and yet
> yesterday you didn't see the failure. So it's intermittent, which is
> annoying.
>
> Looking at the oops:
>
> [ 7101.930385] NIP:  c00000000067b230 LR: c00000000067b1d4 CTR: c000000000029140
> [ 7101.930400] REGS: c00020391ccc35c0 TRAP: 0300   Not tainted  (5.2.9-rc1-2440e48.cki)
> [ 7101.930413] MSR:  900000000280b033 <SF,HV,VEC,VSX,EE,FP,ME,IR,DR,RI,LE>  CR: 44002228  XER: 20040000
> [ 7101.930433] CFAR: c0000000001d9e28 DAR: 800a00066b9e7e28 DSISR: 40000000 IRQMASK: 0
>                GPR00: c00000000067b1d4 c00020391ccc3850 c0000000016cdb00 0000067578a4ab31
>                GPR04: 0000000000000000 ffffffffffffffff 0000000001f3fb0e 0000000000000001
>                GPR08: 800a00066b9e7d80 800a00066b9e7d80 0000000000000001 c00800000cd5db88
>                                        r9
>                GPR12: c000000000029140 c000203fff6b9800 0000000000000008 c00800000d5751e0
>                GPR16: c000203985a8a6c0 c00800000e013278 c000203985a8a700 0000000000000038
>                GPR20: 0000000000000030 0000000000000028 0000000000000020 fffffffffffff000
>                GPR24: 0000000000000001 0000000000000400 0000000000000000 0000000000000023
>                GPR28: 0000000000000000 0000000000000000 c00020391ccc38c8 c000003ef1b00000
> [ 7101.930544] NIP [c00000000067b230] blk_mq_get_request+0x260/0x4b0
> [ 7101.930557] LR [c00000000067b1d4] blk_mq_get_request+0x204/0x4b0
> [ 7101.930569] Call Trace:
> [ 7101.930577] [c00020391ccc3850] [c00000000067b1d4] blk_mq_get_request+0x204/0x4b0 (unreliable)
> [ 7101.930594] [c00020391ccc38a0] [c00000000067b688] blk_mq_alloc_request_hctx+0x108/0x1b0
> [ 7101.930617] [c00020391ccc3910] [c00800000cd51aac] nvme_alloc_request+0x54/0xe0 [nvme_core]
> [ 7101.930633] [c00020391ccc3940] [c00800000cd5641c] __nvme_submit_sync_cmd+0x64/0x290 [nvme_core]
> [ 7101.930651] [c00020391ccc39c0] [c00800000d571650] nvmf_connect_io_queue+0x148/0x1e0 [nvme_fabrics]
> [ 7101.930668] [c00020391ccc3ab0] [c00800000e0106b0] nvme_loop_connect_io_queues+0x98/0xf8 [nvme_loop]
> [ 7101.930684] [c00020391ccc3af0] [c00800000e01116c] nvme_loop_create_ctrl+0x434/0x6a0 [nvme_loop]
> [ 7101.930700] [c00020391ccc3bd0] [c00800000d5724f0] nvmf_dev_write+0xd38/0x124c [nvme_fabrics]
> [ 7101.930719] [c00020391ccc3d60] [c000000000421e58] __vfs_write+0x38/0x70
> [ 7101.930731] [c00020391ccc3d80] [c000000000426188] vfs_write+0xd8/0x250
> [ 7101.930744] [c00020391ccc3dd0] [c000000000426558] ksys_write+0x78/0x130
> [ 7101.930758] [c00020391ccc3e20] [c00000000000bde4] system_call+0x5c/0x70
>
> And then the disassembly:
>
> x = op_is_sync(op)
> r27 = op = 0x0000000000000023
> c00000000067b1e0:       3e 06 67 57     clrlwi  r7,r27,24       # r7 = r27 & REQ_OP_MASK
>
> c00000000067b1e4:       00 00 07 2c     cmpwi   r7,0            # if r7 == REQ_OP_READ, x = 1 then goto label2
> x = 1
> c00000000067b1e8:       01 00 20 39     li      r9,1
>
> ...
>
> r30 = data = c00020391ccc38c8
> r8 = data->ctx = 800a00066b9e7d80
> c00000000067b208:       18 00 1e e9     ld      r8,24(r30)
> c00000000067b20c:       18 00 82 41     beq     c00000000067b224 <blk_mq_get_request+0x254>     ->      label2
>
>                 (op & (REQ_SYNC | REQ_FUA | REQ_PREFLUSH))
> c00000000067b210:       1c 05 6a 57     rlwinm  r10,r27,0,20,14
> c00000000067b214:       68 03 4a 55     rlwinm  r10,r10,0,13,20
> c00000000067b218:       34 00 4a 7d     cntlzw  r10,r10
> c00000000067b21c:       7e d9 4a 55     rlwinm  r10,r10,27,5,31
> c00000000067b220:       01 00 49 69     xori    r9,r10,1
>
> c00000000067b224:       24 1f 29 79     rldicr  r9,r9,3,60                                      <-      label2
> r9 = x * 8
>
> c00000000067b228:       01 00 e0 38     li      r7,1    # for refcount_set
>
> r9 = data->ctx + x
> c00000000067b22c:       14 4a 28 7d     add     r9,r8,r9
>
>         data->ctx->rq_dispatched[op_is_sync(op)]++;
>
> r10 = data->ctx->rq_dispatched[x]
> c00000000067b230:       a8 00 49 e9     ld      r10,168(r9)                                     <-      NIP
>
> x++
> c00000000067b234:       01 00 4a 39     addi    r10,r10,1
>
> data->ctx->rq_dispatched[x] = r10
> c00000000067b238:       a8 00 49 f9     std     r10,168(r9)
>
>         refcount_set(&rq->ref, 1);
> c00000000067b23c:       d4 00 ff 90     stw     r7,212(r31)
>
>
> So we're oopsing at data->ctx->rq_dispatched[op_is_sync(op)]++.
>
> data->ctx looks completely bogus, ie. 800a00066b9e7d80, that's not
> anything like a valid kernel address.
>
> And also op doesn't look like a valid op value, it's 0x23, which has no
> flag bits set, but also doesn't match any of the values in req_opf.
>
> So I suspect data is pointing somewhere bogus. Or possibly it used to
> point at a blk_mq_alloc_data but doesn't anymore.
>
> Why that's happened I have no idea. I can't see any obvious commits in
> mainline or stable that mention anything similar, maybe someone on
> linux-block recognises it?
>
> cheers

Please try:

https://git.kernel.org/pub/scm/linux/kernel/git/axboe/linux-block.git/commit/?h=for-5.4/block&id=556f36e90dbe7dded81f4fac084d2bc8a2458330

Strictly speaking, it is still a workaround, but it works in case that
CPU hotplug isn't
involved.

Thanks,
Ming Lei