On 14-09-10 11:41 AM, Christoph Hellwig wrote:
While it might not help with a blown stack, can you give the patch below a try? I tries to solve a problem where the timeout handler hits before we've fully set up a command. While I'd like to understand the root cause of why we're hitting it as well, I'd also really to fix that race. It would also be good to get a gdb listing of the exact area in scsi_times_out listed in the oops.
RIP: 0010:[<ffffffff8127cd2e>] [<ffffffff8127cd2e>] scsi_times_out+0xe/0x2e0 (gdb) disassemble scsi_times_out Dump of assembler code for function scsi_times_out: 0xffffffff8127d030 <+0>: push %rbp 0xffffffff8127d031 <+1>: mov $0x2007,%esi 0xffffffff8127d036 <+6>: push %rbx 0xffffffff8127d037 <+7>: mov 0xf8(%rdi),%rbx 0xffffffff8127d03e <+14>: mov (%rbx),%rax 0xffffffff8127d041 <+17>: mov %rbx,%rdi 0xffffffff8127d044 <+20>: mov (%rax),%rbp 0xffffffff8127d047 <+23>: callq 0xffffffff81277c70 <scsi_log_completion> 0xffffffff8127d04c <+28>: cmpl $0xffffffff,0x154(%rbp) 0xffffffff8127d053 <+35>: je 0xffffffff8127d05f <scsi_times_out+47> ... which seems to agree 'objdump -drS scsi_error.o': 00000000000028b0 <scsi_times_out>: 28b0: 55 push %rbp 28b1: be 07 20 00 00 mov $0x2007,%esi 28b6: 53 push %rbx 28b7: 48 8b 9f f8 00 00 00 mov 0xf8(%rdi),%rbx 28be: 48 8b 03 mov (%rbx),%rax 28c1: 48 89 df mov %rbx,%rdi 28c4: 48 8b 28 mov (%rax),%rbp 28c7: e8 00 00 00 00 callq 28cc <scsi_times_out+0x1c> 28c8: R_X86_64_PC32 scsi_log_completion-0x4 28cc: 83 bd 54 01 00 00 ff cmpl $0xffffffff,0x154(%rbp)
From: Christoph Hellwig <hch@xxxxxx> Subject: blk-mq: call blk_mq_start_request from ->queue_rq When we call blk_mq_start_request from the core blk-mq code before calling into ->queue_rq there is a racy window where the timeout handler can hit before we've fully set up the driver specific part of the command. Move the call to blk_mq_start_request into the driver so the driver can start the request only once it is fully set up.
Using my original (newer) machine with a SAS SSD, today I'm seeing only the "blown stack" oops on umount. And on the next reboot, if use_blk_mq=Y then doing the mount (on the SAS SSD) causes an instant reboot. Same with and without this patch. I'll try again with the SATA SSD (but I need to archive its contents first) and maybe I can get back to the scsi_times_out oops. -- To unsubscribe from this list: send the line "unsubscribe linux-scsi" in the body of a message to majordomo@xxxxxxxxxxxxxxx More majordomo info at http://vger.kernel.org/majordomo-info.html