Re: [RFC PATCH] blk-mq: Fix lost request during timeout

Bart Van Assche <Bart.VanAssche@xxxxxxx> · Mon, 18 Sep 2017 23:14:38 +0000

On Mon, 2017-09-18 at 19:08 -0400, Keith Busch wrote:
> On Mon, Sep 18, 2017 at 10:53:12PM +0000, Bart Van Assche wrote:
> > Are you sure that scenario can happen? The blk-mq core calls test_and_set_bit()
> > for the REQ_ATOM_COMPLETE flag before any completion or timeout handler is
> > called. See also blk_mark_rq_complete(). This avoids that the .complete() and
> > .timeout() functions run concurrently.
> 
> Indeed that prevents .complete from running concurrently with the
> timeout handler, but scsi_mq_done and nvme_handle_cqe are not .complete
> callbacks. These are the LLD functions that call blk_mq_complete_request
> well before .complete. If the driver calls blk_mq_complete_request on
> a request that blk-mq is timing out, the request is lost because blk-mq
> already called blk_mark_rq_complete. Nothing prevents these LLD functions
> from running at the same time as the timeout handler.

Can you explain how you define "request is lost"? If a timeout occurs for a
SCSI request then scsi_times_out() calls scsi_abort_command() (if no
.eh_timed_out() callback has been defined by the LLD). It is the responsibility
of the SCSI LLD to call .scsi_done() before its .eh_abort_handler() returns
SUCCESS. If .eh_abort_handler() returns a value other than SUCCESS then the
SCSI core will escalate the error further until .scsi_done() has been called for
the command that timed out. See also scsi_abort_eh_cmnd(). So I think what you
wrote is not correct for the SCSI core and a properly implemented SCSI LLD. 

Bart.