Re: [PATCH V3 1/3] scsi: ufs: Fix error handler clear ua deadlock

Adrian Hunter <adrian.hunter@xxxxxxxxx> · Mon, 13 Sep 2021 11:53:21 +0300

On 13/09/21 6:17 am, Bart Van Assche wrote:
> On 9/11/21 09:47, Adrian Hunter wrote:
>> On 8/09/21 1:36 am, Bart Van Assche wrote:
>>> --- a/drivers/scsi/ufs/ufshcd.c +++ b/drivers/scsi/ufs/ufshcd.c 
>>> @@ -2707,6 +2707,14 @@ static int ufshcd_queuecommand(struct
>>> Scsi_Host *host, struct scsi_cmnd *cmd) } fallthrough; case
>>> UFSHCD_STATE_RESET: +        /* +         * The SCSI error
>>> handler only starts after all pending commands +         * have
>>> failed or timed out. Complete commands with +         *
>>> DID_IMM_RETRY to allow the error handler to start +         * if
>>> it has been scheduled. +         */ +        set_host_byte(cmd,
>>> DID_IMM_RETRY); +        cmd->scsi_done(cmd);
>> 
>> Setting non-zero return value, in this case "err =
>> SCSI_MLQUEUE_HOST_BUSY" will anyway cause scsi_dec_host_busy(), so
>> does this make any difference?
> 
> The return value should be changed into 0 since returning
> SCSI_MLQUEUE_HOST_BUSY is only allowed if cmd->scsi_done(cmd) has not
> yet been called.
> 
> I expect that setting the host byte to DID_IMM_RETRY and calling
> scsi_done will make a difference, otherwise I wouldn't have suggested
> this. As explained in my previous email doing that triggers the SCSI> command completion and resubmission paths. Resubmission only happens
> if the SCSI error handler has not yet been scheduled. The SCSI error
> handler is scheduled after for all pending commands scsi_done() has
> been called or a timeout occurred. In other words, setting the host
> byte to DID_IMM_RETRY and calling scsi_done() makes it possible for
> the error handler to be scheduled, something that won't happen if
> ufshcd_queuecommand() systematically returns SCSI_MLQUEUE_HOST_BUSY.

Not getting it, sorry. :-(

The error handler sets UFSHCD_STATE_RESET and never leaves the state
as UFSHCD_STATE_RESET, so that case does not need to start the error
handler because it is already running.

The error handler is always scheduled after setting 
UFSHCD_STATE_EH_SCHEDULED_FATAL.

scsi_dec_host_busy() is called for any non-zero return value like
SCSI_MLQUEUE_HOST_BUSY:

i.e.
	reason = scsi_dispatch_cmd(cmd);
	if (reason) {
		scsi_set_blocked(cmd, reason);
		ret = BLK_STS_RESOURCE;
		goto out_dec_host_busy;
	}

	return BLK_STS_OK;

out_dec_host_busy:
	scsi_dec_host_busy(shost, cmd);

And that will wake the error handler:

static void scsi_dec_host_busy(struct Scsi_Host *shost, struct scsi_cmnd *cmd)
{
	unsigned long flags;

	rcu_read_lock();
	__clear_bit(SCMD_STATE_INFLIGHT, &cmd->state);
	if (unlikely(scsi_host_in_recovery(shost))) {
		spin_lock_irqsave(shost->host_lock, flags);
		if (shost->host_failed || shost->host_eh_scheduled)
			scsi_eh_wakeup(shost);
		spin_unlock_irqrestore(shost->host_lock, flags);
	}
	rcu_read_unlock();
}

Note that scsi_host_queue_ready() won't let any requests through
when scsi_host_in_recovery(), so the potential problem is with
requests that have already been successfully submitted to the
UFS driver but have not completed. The change you suggest
does not help with that.

That seems like another problem with the patch 
"scsi: ufs: Synchronize SCSI and UFS error handling".

> In the latter case the block layer timer is reset over and over
> again. See also the blk_mq_start_request() in scsi_queue_rq(). One
> could wonder whether this is really what the SCSI core should do if a
> SCSI LLD keeps returning the SCSI_MLQUEUE_HOST_BUSY status code ...
> 
> Bart.