On 11/15/22 02:21, Niklas Cassel wrote: > A NCQ error means that the device has aborted processing of all active > commands. > To get the single NCQ command that caused the NCQ error, host software has > to read the NCQ error log, which also takes the device out of error state. > > When the device encounters a NCQ error, we receive an error interrupt from > the HBA, and call ata_do_link_abort() to mark all outstanding commands on > the link as ATA_QCFLAG_FAILED (which means that these commands are owned > by libata EH), and then call ata_qc_complete() on them. > > ata_qc_complete() will call fill_result_tf() for all commands marked as > ATA_QCFLAG_FAILED. > > The taskfile is simply the latest status/error as seen from the device's > perspective. The taskfile will have ATA_ERR set in the status field and > ATA_ABORTED set in the error field. > > When we fill the current taskfile values for all outstanding commands, > that means that qc->result_tf will have ATA_ERR set for all commands > owned by libata EH. > > When ata_eh_link_autopsy() later analyzes all commands owned by libata EH, > it will call ata_eh_analyze_tf(), which will check if qc->result_tf has > ATA_ERR set, if it does, it will set qc->err_mask (which marks the command > as an error). > > When ata_eh_finish() later calls __ata_qc_complete() on all commands owned > by libata EH, it will call qc->complete_fn() (ata_scsi_qc_complete()), > ata_scsi_qc_complete() will call ata_gen_ata_sense() to generate sense > data if qc->err_mask is set. > > This means that we will generate sense data for commands that should not > have any sense data set. Having sense data set for the non-failed commands > will cause SCSI to finish these commands instead of retrying them. > > While this incorrect behavior has existed for a long time, this first > became a problem once we started reading the correct taskfile register in > commit 4ba09d202657 ("ata: libahci: read correct status and error field > for NCQ commands"). > > Before this commit, NCQ commands would read the taskfile values received > from the last non-NCQ command completion, which most likely did not have > ATA_ERR set, since the last non-NCQ command was most likely not an error. > > Fix this by changing ata_eh_analyze_ncq_error() to mark all non-failed > commands as ATA_QCFLAG_RETRY, and change the loop in ata_eh_link_autopsy() > to skip commands marked as ATA_QCFLAG_RETRY. > > While at it, make sure that we clear ATA_ERR and any error bits for all > commands except the actual command that caused the NCQ error, so that no > other libata code will be able to misinterpret these commands as errors. > > Fixes: 4ba09d202657 ("ata: libahci: read correct status and error field for NCQ commands") > Signed-off-by: Niklas Cassel <niklas.cassel@xxxxxxx> Applied to for-6.1-fixes. Thanks ! > --- > Changes since v1: > -Squashed patch 1/2 with 2/2. > > drivers/ata/libata-eh.c | 1 + > drivers/ata/libata-sata.c | 27 +++++++++++++++++++++++++++ > 2 files changed, 28 insertions(+) > > diff --git a/drivers/ata/libata-eh.c b/drivers/ata/libata-eh.c > index bde15f855f70..34303ce67c14 100644 > --- a/drivers/ata/libata-eh.c > +++ b/drivers/ata/libata-eh.c > @@ -1955,6 +1955,7 @@ static void ata_eh_link_autopsy(struct ata_link *link) > > ata_qc_for_each_raw(ap, qc, tag) { > if (!(qc->flags & ATA_QCFLAG_FAILED) || > + qc->flags & ATA_QCFLAG_RETRY || > ata_dev_phys_link(qc->dev) != link) > continue; > > diff --git a/drivers/ata/libata-sata.c b/drivers/ata/libata-sata.c > index 283ce1ab29cf..18ef14e749a0 100644 > --- a/drivers/ata/libata-sata.c > +++ b/drivers/ata/libata-sata.c > @@ -1476,6 +1476,33 @@ void ata_eh_analyze_ncq_error(struct ata_link *link) > } > } > > + ata_qc_for_each_raw(ap, qc, tag) { > + if (!(qc->flags & ATA_QCFLAG_FAILED) || > + ata_dev_phys_link(qc->dev) != link) > + continue; > + > + /* Skip the single QC which caused the NCQ error. */ > + if (qc->err_mask) > + continue; > + > + /* > + * For SATA, the STATUS and ERROR fields are shared for all NCQ > + * commands that were completed with the same SDB FIS. > + * Therefore, we have to clear the ATA_ERR bit for all QCs > + * except the one that caused the NCQ error. > + */ > + qc->result_tf.status &= ~ATA_ERR; > + qc->result_tf.error = 0; > + > + /* > + * If we get a NCQ error, that means that a single command was > + * aborted. All other failed commands for our link should be > + * retried and has no business of going though further scrutiny > + * by ata_eh_link_autopsy(). > + */ > + qc->flags |= ATA_QCFLAG_RETRY; > + } > + > ehc->i.err_mask &= ~AC_ERR_DEV; > } > EXPORT_SYMBOL_GPL(ata_eh_analyze_ncq_error); -- Damien Le Moal Western Digital Research