Major Hayden <major@xxxxxxxxxx> writes: > Hello there, > > The CKI Project just found a kernel panic while running the blktests > test suite on stable 5.2.9-rc1[0]. Michael Ellerman requested for this > list to be copied on these ppc64le failures. > > We have some logs[1] for these failures and they start with > "ppc64le_host_2_Storage_blktests*". We hope this helps! > > [0] https://lore.kernel.org/stable/255f9af4-6087-7f56-5860-5aa0397a7631@xxxxxxxxxx/T/#t > [1] https://artifacts.cki-project.org/pipelines/100875/logs/ Thanks for the report. It looks like you tested the stable queue yesterday, which AFAICS results in the exact same source tree as you tested above, and yet yesterday you didn't see the failure. So it's intermittent, which is annoying. Looking at the oops: [ 7101.930385] NIP: c00000000067b230 LR: c00000000067b1d4 CTR: c000000000029140 [ 7101.930400] REGS: c00020391ccc35c0 TRAP: 0300 Not tainted (5.2.9-rc1-2440e48.cki) [ 7101.930413] MSR: 900000000280b033 <SF,HV,VEC,VSX,EE,FP,ME,IR,DR,RI,LE> CR: 44002228 XER: 20040000 [ 7101.930433] CFAR: c0000000001d9e28 DAR: 800a00066b9e7e28 DSISR: 40000000 IRQMASK: 0 GPR00: c00000000067b1d4 c00020391ccc3850 c0000000016cdb00 0000067578a4ab31 GPR04: 0000000000000000 ffffffffffffffff 0000000001f3fb0e 0000000000000001 GPR08: 800a00066b9e7d80 800a00066b9e7d80 0000000000000001 c00800000cd5db88 r9 GPR12: c000000000029140 c000203fff6b9800 0000000000000008 c00800000d5751e0 GPR16: c000203985a8a6c0 c00800000e013278 c000203985a8a700 0000000000000038 GPR20: 0000000000000030 0000000000000028 0000000000000020 fffffffffffff000 GPR24: 0000000000000001 0000000000000400 0000000000000000 0000000000000023 GPR28: 0000000000000000 0000000000000000 c00020391ccc38c8 c000003ef1b00000 [ 7101.930544] NIP [c00000000067b230] blk_mq_get_request+0x260/0x4b0 [ 7101.930557] LR [c00000000067b1d4] blk_mq_get_request+0x204/0x4b0 [ 7101.930569] Call Trace: [ 7101.930577] [c00020391ccc3850] [c00000000067b1d4] blk_mq_get_request+0x204/0x4b0 (unreliable) [ 7101.930594] [c00020391ccc38a0] [c00000000067b688] blk_mq_alloc_request_hctx+0x108/0x1b0 [ 7101.930617] [c00020391ccc3910] [c00800000cd51aac] nvme_alloc_request+0x54/0xe0 [nvme_core] [ 7101.930633] [c00020391ccc3940] [c00800000cd5641c] __nvme_submit_sync_cmd+0x64/0x290 [nvme_core] [ 7101.930651] [c00020391ccc39c0] [c00800000d571650] nvmf_connect_io_queue+0x148/0x1e0 [nvme_fabrics] [ 7101.930668] [c00020391ccc3ab0] [c00800000e0106b0] nvme_loop_connect_io_queues+0x98/0xf8 [nvme_loop] [ 7101.930684] [c00020391ccc3af0] [c00800000e01116c] nvme_loop_create_ctrl+0x434/0x6a0 [nvme_loop] [ 7101.930700] [c00020391ccc3bd0] [c00800000d5724f0] nvmf_dev_write+0xd38/0x124c [nvme_fabrics] [ 7101.930719] [c00020391ccc3d60] [c000000000421e58] __vfs_write+0x38/0x70 [ 7101.930731] [c00020391ccc3d80] [c000000000426188] vfs_write+0xd8/0x250 [ 7101.930744] [c00020391ccc3dd0] [c000000000426558] ksys_write+0x78/0x130 [ 7101.930758] [c00020391ccc3e20] [c00000000000bde4] system_call+0x5c/0x70 And then the disassembly: x = op_is_sync(op) r27 = op = 0x0000000000000023 c00000000067b1e0: 3e 06 67 57 clrlwi r7,r27,24 # r7 = r27 & REQ_OP_MASK c00000000067b1e4: 00 00 07 2c cmpwi r7,0 # if r7 == REQ_OP_READ, x = 1 then goto label2 x = 1 c00000000067b1e8: 01 00 20 39 li r9,1 ... r30 = data = c00020391ccc38c8 r8 = data->ctx = 800a00066b9e7d80 c00000000067b208: 18 00 1e e9 ld r8,24(r30) c00000000067b20c: 18 00 82 41 beq c00000000067b224 <blk_mq_get_request+0x254> -> label2 (op & (REQ_SYNC | REQ_FUA | REQ_PREFLUSH)) c00000000067b210: 1c 05 6a 57 rlwinm r10,r27,0,20,14 c00000000067b214: 68 03 4a 55 rlwinm r10,r10,0,13,20 c00000000067b218: 34 00 4a 7d cntlzw r10,r10 c00000000067b21c: 7e d9 4a 55 rlwinm r10,r10,27,5,31 c00000000067b220: 01 00 49 69 xori r9,r10,1 c00000000067b224: 24 1f 29 79 rldicr r9,r9,3,60 <- label2 r9 = x * 8 c00000000067b228: 01 00 e0 38 li r7,1 # for refcount_set r9 = data->ctx + x c00000000067b22c: 14 4a 28 7d add r9,r8,r9 data->ctx->rq_dispatched[op_is_sync(op)]++; r10 = data->ctx->rq_dispatched[x] c00000000067b230: a8 00 49 e9 ld r10,168(r9) <- NIP x++ c00000000067b234: 01 00 4a 39 addi r10,r10,1 data->ctx->rq_dispatched[x] = r10 c00000000067b238: a8 00 49 f9 std r10,168(r9) refcount_set(&rq->ref, 1); c00000000067b23c: d4 00 ff 90 stw r7,212(r31) So we're oopsing at data->ctx->rq_dispatched[op_is_sync(op)]++. data->ctx looks completely bogus, ie. 800a00066b9e7d80, that's not anything like a valid kernel address. And also op doesn't look like a valid op value, it's 0x23, which has no flag bits set, but also doesn't match any of the values in req_opf. So I suspect data is pointing somewhere bogus. Or possibly it used to point at a blk_mq_alloc_data but doesn't anymore. Why that's happened I have no idea. I can't see any obvious commits in mainline or stable that mention anything similar, maybe someone on linux-block recognises it? cheers