Re: [PATCH v2] ata: libata: fix commands incorrectly not getting retried during NCQ error

Damien Le Moal <damien.lemoal@xxxxxxxxxxxxxxxxxx> · Fri, 18 Nov 2022 13:40:17 +0900

On 11/15/22 02:21, Niklas Cassel wrote:
> A NCQ error means that the device has aborted processing of all active
> commands.
> To get the single NCQ command that caused the NCQ error, host software has
> to read the NCQ error log, which also takes the device out of error state.
> 
> When the device encounters a NCQ error, we receive an error interrupt from
> the HBA, and call ata_do_link_abort() to mark all outstanding commands on
> the link as ATA_QCFLAG_FAILED (which means that these commands are owned
> by libata EH), and then call ata_qc_complete() on them.
> 
> ata_qc_complete() will call fill_result_tf() for all commands marked as
> ATA_QCFLAG_FAILED.
> 
> The taskfile is simply the latest status/error as seen from the device's
> perspective. The taskfile will have ATA_ERR set in the status field and
> ATA_ABORTED set in the error field.
> 
> When we fill the current taskfile values for all outstanding commands,
> that means that qc->result_tf will have ATA_ERR set for all commands
> owned by libata EH.
> 
> When ata_eh_link_autopsy() later analyzes all commands owned by libata EH,
> it will call ata_eh_analyze_tf(), which will check if qc->result_tf has
> ATA_ERR set, if it does, it will set qc->err_mask (which marks the command
> as an error).
> 
> When ata_eh_finish() later calls __ata_qc_complete() on all commands owned
> by libata EH, it will call qc->complete_fn() (ata_scsi_qc_complete()),
> ata_scsi_qc_complete() will call ata_gen_ata_sense() to generate sense
> data if qc->err_mask is set.
> 
> This means that we will generate sense data for commands that should not
> have any sense data set. Having sense data set for the non-failed commands
> will cause SCSI to finish these commands instead of retrying them.
> 
> While this incorrect behavior has existed for a long time, this first
> became a problem once we started reading the correct taskfile register in
> commit 4ba09d202657 ("ata: libahci: read correct status and error field
> for NCQ commands").
> 
> Before this commit, NCQ commands would read the taskfile values received
> from the last non-NCQ command completion, which most likely did not have
> ATA_ERR set, since the last non-NCQ command was most likely not an error.
> 
> Fix this by changing ata_eh_analyze_ncq_error() to mark all non-failed
> commands as ATA_QCFLAG_RETRY, and change the loop in ata_eh_link_autopsy()
> to skip commands marked as ATA_QCFLAG_RETRY.
> 
> While at it, make sure that we clear ATA_ERR and any error bits for all
> commands except the actual command that caused the NCQ error, so that no
> other libata code will be able to misinterpret these commands as errors.
> 
> Fixes: 4ba09d202657 ("ata: libahci: read correct status and error field for NCQ commands")
> Signed-off-by: Niklas Cassel <niklas.cassel@xxxxxxx>

Applied to for-6.1-fixes. Thanks !

> ---
> Changes since v1:
> -Squashed patch 1/2 with 2/2.
> 
>  drivers/ata/libata-eh.c   |  1 +
>  drivers/ata/libata-sata.c | 27 +++++++++++++++++++++++++++
>  2 files changed, 28 insertions(+)
> 
> diff --git a/drivers/ata/libata-eh.c b/drivers/ata/libata-eh.c
> index bde15f855f70..34303ce67c14 100644
> --- a/drivers/ata/libata-eh.c
> +++ b/drivers/ata/libata-eh.c
> @@ -1955,6 +1955,7 @@ static void ata_eh_link_autopsy(struct ata_link *link)
>  
>  	ata_qc_for_each_raw(ap, qc, tag) {
>  		if (!(qc->flags & ATA_QCFLAG_FAILED) ||
> +		    qc->flags & ATA_QCFLAG_RETRY ||
>  		    ata_dev_phys_link(qc->dev) != link)
>  			continue;
>  
> diff --git a/drivers/ata/libata-sata.c b/drivers/ata/libata-sata.c
> index 283ce1ab29cf..18ef14e749a0 100644
> --- a/drivers/ata/libata-sata.c
> +++ b/drivers/ata/libata-sata.c
> @@ -1476,6 +1476,33 @@ void ata_eh_analyze_ncq_error(struct ata_link *link)
>  		}
>  	}
>  
> +	ata_qc_for_each_raw(ap, qc, tag) {
> +		if (!(qc->flags & ATA_QCFLAG_FAILED) ||
> +		    ata_dev_phys_link(qc->dev) != link)
> +			continue;
> +
> +		/* Skip the single QC which caused the NCQ error. */
> +		if (qc->err_mask)
> +			continue;
> +
> +		/*
> +		 * For SATA, the STATUS and ERROR fields are shared for all NCQ
> +		 * commands that were completed with the same SDB FIS.
> +		 * Therefore, we have to clear the ATA_ERR bit for all QCs
> +		 * except the one that caused the NCQ error.
> +		 */
> +		qc->result_tf.status &= ~ATA_ERR;
> +		qc->result_tf.error = 0;
> +
> +		/*
> +		 * If we get a NCQ error, that means that a single command was
> +		 * aborted. All other failed commands for our link should be
> +		 * retried and has no business of going though further scrutiny
> +		 * by ata_eh_link_autopsy().
> +		 */
> +		qc->flags |= ATA_QCFLAG_RETRY;
> +	}
> +
>  	ehc->i.err_mask &= ~AC_ERR_DEV;
>  }
>  EXPORT_SYMBOL_GPL(ata_eh_analyze_ncq_error);

-- 
Damien Le Moal
Western Digital Research