Re: lk 3.17-rc4 blk_mq large write problems

Douglas Gilbert <dgilbert@xxxxxxxxxxxx> · Thu, 11 Sep 2014 01:37:03 -0400

On 14-09-10 11:48 PM, Elliott, Robert (Server Storage) wrote:

-----Original Message-----
From: linux-scsi-owner@xxxxxxxxxxxxxxx [mailto:linux-scsi-
owner@xxxxxxxxxxxxxxx] On Behalf Of Jens Axboe
Sent: Wednesday, September 10, 2014 9:00 PM
To: dgilbert@xxxxxxxxxxxx; Christoph Hellwig
Cc: SCSI development list
Subject: Re: lk 3.17-rc4 blk_mq large write problems

On 2014-09-10 18:58, Douglas Gilbert wrote:
On 14-09-10 11:41 AM, Christoph Hellwig wrote:
While it might not help with a blown stack, can you give the patch
below
a try?  I tries to solve a problem where the timeout handler hits
before we've fully set up a command.  While I'd like to understand
the
root cause of why we're hitting it as well, I'd also really to fix
that
race. It would also be good to get a gdb listing of the exact area in
scsi_times_out listed in the oops.

RIP: 0010:[<ffffffff8127cd2e>]  [<ffffffff8127cd2e>]
scsi_times_out+0xe/0x2e0

(gdb) disassemble scsi_times_out
Dump of assembler code for function scsi_times_out:
     0xffffffff8127d030 <+0>:    push   %rbp
     0xffffffff8127d031 <+1>:    mov    $0x2007,%esi
     0xffffffff8127d036 <+6>:    push   %rbx
     0xffffffff8127d037 <+7>:    mov    0xf8(%rdi),%rbx
     0xffffffff8127d03e <+14>:    mov    (%rbx),%rax
     0xffffffff8127d041 <+17>:    mov    %rbx,%rdi
     0xffffffff8127d044 <+20>:    mov    (%rax),%rbp
     0xffffffff8127d047 <+23>:    callq  0xffffffff81277c70
<scsi_log_completion>
     0xffffffff8127d04c <+28>:    cmpl   $0xffffffff,0x154(%rbp)
     0xffffffff8127d053 <+35>:    je     0xffffffff8127d05f
<scsi_times_out+47>
...

which seems to agree 'objdump -drS scsi_error.o':

00000000000028b0 <scsi_times_out>:
      28b0:    55                       push   %rbp
      28b1:    be 07 20 00 00           mov    $0x2007,%esi
      28b6:    53                       push   %rbx
      28b7:    48 8b 9f f8 00 00 00     mov    0xf8(%rdi),%rbx
      28be:    48 8b 03                 mov    (%rbx),%rax
      28c1:    48 89 df                 mov    %rbx,%rdi
      28c4:    48 8b 28                 mov    (%rax),%rbp
      28c7:    e8 00 00 00 00           callq  28cc
<scsi_times_out+0x1c>
              28c8: R_X86_64_PC32    scsi_log_completion-0x4
      28cc:    83 bd 54 01 00 00 ff     cmpl   $0xffffffff,0x154(%rbp)

This would be more useful if you had DEBUGINFO enabled. At least it
would save use some time :-)

On my system, that function compiles to:

enum blk_eh_timer_return scsi_times_out(struct request *req)
{
     2580:       55                      push   %rbp
     2581:       48 89 e5                mov    %rsp,%rbp
     2584:       41 55                   push   %r13
     2586:       41 54                   push   %r12
     2588:       53                      push   %rbx
     2589:       48 83 ec 08             sub    $0x8,%rsp
     258d:       e8 00 00 00 00          callq  2592 <scsi_times_out+0x12>
                         258e: R_X86_64_PC32     mcount-0x4
         struct scsi_cmnd *scmd = req->special;
     2592:       48 8b 9f f8 00 00 00    mov    0xf8(%rdi),%rbx
         enum blk_eh_timer_return rtn = BLK_EH_NOT_HANDLED;
         struct Scsi_Host *host = scmd->device->host;
     2599:       48 8b 03                mov    (%rbx),%rax
     259c:       4c 8b 20                mov    (%rax),%r12
     259f:       0f 1f 44 00 00          nopl   0x0(%rax,%rax,1)

So, Doug's +0xe corresponds to +0x19 here.

258d-2591 and the extra code preceding them are for stack
overflow checking, which I have enabled, and the location
for the optional ftrace jump if tracing this function is
turned on.

2592 is scmd = req->special.  %rdi is req, the first and
only function argument (per the x86_64 abi).  0xf8(%rdi)
is referencing req->special.  req must be OK, since that's
not the instruction pointer address that failed.

2599 is dereferencing req->special (i.e., scmd) to set host.
req->special must contain NULL, causing the exception
at this IP.

As a short-term workaround, it might be better to just let
this function exit if scmd is NULL, assuming that it was
already dealt with properly.  That would still be misled
if the race condition, whatever it may be, allows scmd to
be reused before this function runs; this function would
then operate on a future command.

Switching back to a SATA SSD gives me the same
"scsi_times_out+0xe" NULL de-reference oops shown at the
start of this thread.

(gdb) list *(scsi_times_out+0xe)
0xffffffff812804fe is in scsi_times_out (drivers/scsi/scsi_error.c:277).
272	 */
273	enum blk_eh_timer_return scsi_times_out(struct request *req)
274	{
275		struct scsi_cmnd *scmd = req->special;
276		enum blk_eh_timer_return rtn = BLK_EH_NOT_HANDLED;
277		struct Scsi_Host *host = scmd->device->host;
278	
279		trace_scsi_dispatch_cmd_timeout(scmd);
280		scsi_log_completion(scmd, TIMEOUT_ERROR);

As Rob suggested.

--
To unsubscribe from this list: send the line "unsubscribe linux-scsi" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at  http://vger.kernel.org/majordomo-info.html