Hi Martin, > On Feb 14, 2018, at 11:58 AM, Madhani, Himanshu <Himanshu.Madhani@xxxxxxxxxx> wrote: > > Hi John, > >> On Feb 13, 2018, at 2:54 AM, John Garry <john.garry@xxxxxxxxxx> wrote: >> >> On 12/02/2018 18:28, Himanshu Madhani wrote: >>> This patch fixes NULL pointer crash due to active timer running >>> for abort IOCB. >>> >>>> From crash dump analysis it was discoverd that get_next_timer_interrupt() >>> encountered a corrupted entry on the timer list. >>> >>> #9 [ffff95e1f6f0fd40] page_fault at ffffffff914fe8f8 >>> [exception RIP: get_next_timer_interrupt+440] >>> RIP: ffffffff90ea3088 RSP: ffff95e1f6f0fdf0 RFLAGS: 00010013 >>> RAX: ffff95e1f6451028 RBX: 000218e2389e5f40 RCX: 00000001232ad600 >>> RDX: 0000000000000001 RSI: ffff95e1f6f0fdf0 RDI: 0000000001232ad6 >>> RBP: ffff95e1f6f0fe40 R8: ffff95e1f6451188 R9: 0000000000000001 >>> R10: 0000000000000016 R11: 0000000000000016 R12: 00000001232ad5f6 >>> R13: ffff95e1f6450000 R14: ffff95e1f6f0fdf8 R15: ffff95e1f6f0fe10 >>> ORIG_RAX: ffffffffffffffff CS: 0010 SS: 0018 >>> >>> Looking at the assembly of get_next_timer_interrupt(), address came >>> from %r8 (ffff95e1f6451188) which is pointing to list_head with single >>> entry at ffff95e5ff621178. >>> >>> 0xffffffff90ea307a <get_next_timer_interrupt+426>: mov (%r8),%rdx >>> 0xffffffff90ea307d <get_next_timer_interrupt+429>: cmp %r8,%rdx >>> 0xffffffff90ea3080 <get_next_timer_interrupt+432>: je 0xffffffff90ea30a7 <get_next_timer_interrupt+471> >>> 0xffffffff90ea3082 <get_next_timer_interrupt+434>: nopw 0x0(%rax,%rax,1) >>> 0xffffffff90ea3088 <get_next_timer_interrupt+440>: testb $0x1,0x18(%rdx) >>> >>> crash> rd ffff95e1f6451188 10 >>> ffff95e1f6451188: ffff95e5ff621178 ffff95e5ff621178 x.b.....x.b..... >>> ffff95e1f6451198: ffff95e1f6451198 ffff95e1f6451198 ..E.......E..... >>> ffff95e1f64511a8: ffff95e1f64511a8 ffff95e1f64511a8 ..E.......E..... >>> ffff95e1f64511b8: ffff95e77cf509a0 ffff95e77cf509a0 ...|.......|.... >>> ffff95e1f64511c8: ffff95e1f64511c8 ffff95e1f64511c8 ..E.......E..... >>> >>> crash> rd ffff95e5ff621178 10 >>> ffff95e5ff621178: 0000000000000001 ffff95e15936aa00 ..........6Y.... >>> ffff95e5ff621188: 0000000000000000 00000000ffffffff ................ >>> ffff95e5ff621198: 00000000000000a0 0000000000000010 ................ >>> ffff95e5ff6211a8: ffff95e5ff621198 000000000000000c ..b............. >>> ffff95e5ff6211b8: 00000f5800000000 ffff95e751f8d720 ....X... ..Q.... >>> >>> ffff95e5ff621178 belongs to freed mempool object at ffff95e5ff621080. >>> >>> CACHE NAME OBJSIZE ALLOCATED TOTAL SLABS SSIZE >>> ffff95dc7fd74d00 mnt_cache 384 19785 24948 594 16k >>> SLAB MEMORY NODE TOTAL ALLOCATED FREE >>> ffffdc5dabfd8800 ffff95e5ff620000 1 42 29 13 >>> FREE / [ALLOCATED] >>> ffff95e5ff621080 (cpu 6 cache) >>> >>> Examining the contents of that memory reveals a pointer to a constant >>> string in the driver, "abort\0", which is set by qla24xx_async_abort_cmd(). >>> >>> crash> rd ffffffffc059277c 20 >>> ffffffffc059277c: 6e490074726f6261 0074707572726574 abort.Interrupt. >>> ffffffffc059278c: 00676e696c6c6f50 6920726576697244 Polling.Driver i >>> ffffffffc059279c: 646f6d207325206e 6974736554000a65 n %s mode..Testi >>> ffffffffc05927ac: 636976656420676e 786c252074612065 ng device at %lx >>> ffffffffc05927bc: 6b63656843000a2e 646f727020676e69 ...Checking prod >>> ffffffffc05927cc: 6f20444920746375 0a2e706968632066 uct ID of chip.. >>> ffffffffc05927dc: 5120646e756f4600 204130303232414c .Found QLA2200A >>> ffffffffc05927ec: 43000a2e70696843 20676e696b636568 Chip...Checking >>> ffffffffc05927fc: 65786f626c69616d 6c636e69000a2e73 mailboxes...incl >>> ffffffffc059280c: 756e696c2f656475 616d2d616d642f78 ude/linux/dma-ma >>> >>> crash> struct -ox srb_iocb >>> struct srb_iocb { >>> union { >>> struct {...} logio; >>> struct {...} els_logo; >>> struct {...} tmf; >>> struct {...} fxiocb; >>> struct {...} abt; >>> struct ct_arg ctarg; >>> struct {...} mbx; >>> struct {...} nack; >>> [0x0 ] } u; >>> [0xb8] struct timer_list timer; >>> [0x108] void (*timeout)(void *); >>> } >>> SIZE: 0x110 >>> >>> crash> ! bc >>> ibase=16 >>> obase=10 >>> B8+40 >>> F8 >>> >>> The object is a srb_t, and at offset 0xf8 within that structure >>> (i.e. ffff95e5ff621080 + f8 -> ffff95e5ff621178) is a struct timer_list. >>> >>> Cc: <stable@xxxxxxxxxxxxxxx> #4.4+ >>> Signed-off-by: Himanshu Madhani <himanshu.madhani@xxxxxxxxxx> >>> --- >>> Hi Martin, >>> >>> This patch addresses crash due to NULL pointer access because driver >>> left active timer running for abort IOCB. >>> >>> Please apply this patch to 4.16/scsi-fixes at your earliest convenience. >>> >>> Thanks, >>> Himanshu >>> --- >>> drivers/scsi/qla2xxx/qla_init.c | 1 + >>> 1 file changed, 1 insertion(+) >>> >>> diff --git a/drivers/scsi/qla2xxx/qla_init.c b/drivers/scsi/qla2xxx/qla_init.c >>> index 2dea1129d396..04870621e712 100644 >>> --- a/drivers/scsi/qla2xxx/qla_init.c >>> +++ b/drivers/scsi/qla2xxx/qla_init.c >>> @@ -1527,6 +1527,7 @@ `(void *ptr, int res) >>> srb_t *sp = ptr; >>> struct srb_iocb *abt = &sp->u.iocb_cmd; >>> >>> + del_timer(&sp->u.iocb_cmd.timer); >> >> Hi, >> >> I am not familiar with this code. But I note that this abort seems to be "asynchronous" (I see it's setup in qla24xx_async_abort_cmd()), so do you need the "sync" variant of del_timer() for safety? This would be to ensure the timeout has not actually occurred and is racing with the "done" callback? >> > > after reviewing code for the usage of timer i see that qla24xx_async_abort_cmd() would not be adding timer back even though the call is > asynchronous. I think its safe to use only del_timer(). > >> Thanks, >> John >> >>> complete(&abt->u.abt.comp); >>> } > > Thanks, > - Himanshu Can you please queue this patch for 4.16-rc3 Thanks, - Himanshu