On Tue, Nov 30, 2021 at 03:33:09PM -0800, Bart Van Assche wrote: > This patch restores the behavior of the following algorithm from the legacy > block layer: > - Before completing a request, test-and-set REQ_ATOM_COMPLETE atomically. > Only call the block driver completion function if that flag was not yet > set. > - Before calling the block driver timeout function, test-and-set > REQ_ATOM_COMPLETE atomically. Only call the timeout handler if that flag > was not yet set. If that flag was already set, do not restart the timer. > > Cc: Keith Busch <kbusch@xxxxxxxxxx> > Reported-by: Adrian Hunter <adrian.hunter@xxxxxxxxx> > Fixes: 065990bd198e ("scsi: set timed out out mq requests to complete") > Signed-off-by: Bart Van Assche <bvanassche@xxxxxxx> > --- > drivers/scsi/scsi_error.c | 22 ++++++++-------------- > 1 file changed, 8 insertions(+), 14 deletions(-) > > diff --git a/drivers/scsi/scsi_error.c b/drivers/scsi/scsi_error.c > index 9cb0f9df621a..cd05f2db3339 100644 > --- a/drivers/scsi/scsi_error.c > +++ b/drivers/scsi/scsi_error.c > @@ -331,6 +331,14 @@ enum blk_eh_timer_return scsi_times_out(struct request *req) > enum blk_eh_timer_return rtn = BLK_EH_DONE; > struct Scsi_Host *host = scmd->device->host; > > + /* > + * scsi_done() may be called concurrently with scsi_times_out(). Only > + * one of these two functions should proceed. Hence return early if > + * scsi_done() won the race. > + */ > + if (test_and_set_bit(SCMD_STATE_COMPLETE, &scmd->state)) > + return BLK_EH_DONE; > + If the the timeout handler successfully sets the state to complete, and the lld returns BLK_EH_RESET_TIMER, who gets to complete this command? > trace_scsi_dispatch_cmd_timeout(scmd); > scsi_log_completion(scmd, TIMEOUT_ERROR); > > @@ -341,20 +349,6 @@ enum blk_eh_timer_return scsi_times_out(struct request *req) > rtn = host->hostt->eh_timed_out(scmd); > > if (rtn == BLK_EH_DONE) { > - /* > - * Set the command to complete first in order to prevent a real > - * completion from releasing the command while error handling > - * is using it. If the command was already completed, then the > - * lower level driver beat the timeout handler, and it is safe > - * to return without escalating error recovery. > - * > - * If timeout handling lost the race to a real completion, the > - * block layer may ignore that due to a fake timeout injection, > - * so return RESET_TIMER to allow error handling another shot > - * at this command. > - */ > - if (test_and_set_bit(SCMD_STATE_COMPLETE, &scmd->state)) > - return BLK_EH_RESET_TIMER; > if (scsi_abort_command(scmd) != SUCCESS) { > set_host_byte(scmd, DID_TIME_OUT); > scsi_eh_scmd_add(scmd);