On Mon, Nov 14, 2022 at 11:11:26AM +0900, Damien Le Moal wrote: > On 11/11/22 20:09, Niklas Cassel wrote: > > A NCQ error means that the device has aborted a single NCQ command and > > halted further processing of queued commands. > > Nit: To be strict with wording, this should say something like "an NCQ > command failure results in the device aborting the execution of all active > commands." > > > To get the single NCQ command that caused the NCQ error, host software has > > to read the NCQ error log, which also takes the device out of error state. > > > > When the device encounters a NCQ error, we receive an error interrupt from > > the HBA, and call ata_do_link_abort() to mark all outstanding commands on > > the link as ATA_QCFLAG_FAILED (which means that these commands are owned > > by libata EH), and then call ata_qc_complete() on them. > > > > ata_qc_complete() will call fill_result_tf() for all commands marked as > > ATA_QCFLAG_FAILED. > > > > The taskfile is simply the latest status/error as seen from the device's > > perspective. The taskfile will have ATA_ERR set in the status field and > > ATA_ABORTED set in the error field. > > > > When we fill the current taskfile values for all outstanding commands, > > that means that qc->result_tf will have ATA_ERR set for all commands > > owned by libata EH. > > > > When ata_eh_link_autopsy() later analyzes all commands owned by libata EH, > > it will call ata_eh_analyze_tf(), which will check if qc->result_tf has > > ATA_ERR set, if it does, it will set qc->err_mask (which marks the command > > as an error). > > > > When ata_eh_finish() later calls __ata_qc_complete() on all commands owned > > by libata EH, it will call qc->complete_fn() (ata_scsi_qc_complete()), > > ata_scsi_qc_complete() will call ata_gen_ata_sense() to generate sense > > data if qc->err_mask is set. > > > > This means that we will generate sense data for commands that really > > should not have any sense data set. Having sense data set might cause SCSI > > to finish these commands instead of retrying them. > > > > While this incorrect behavior has existed for a long time, this first > > became a problem once we started reading the correct taskfile register in > > commit 4ba09d202657 ("ata: libahci: read correct status and error field > > for NCQ commands"). > > > > Before this commit, NCQ commands would read the taskfile values received > > from the last non-NCQ command completion, which most likely did not have > > ATA_ERR set, since the last non-NCQ command was most likely not an error. > > > > Fix this by clearing ATA_ERR and any error bits for all commands except > > the actual command that caused the NCQ error, since the error bits in the > > taskfile are not applicable to them. > > > > Fixes: 4ba09d202657 ("ata: libahci: read correct status and error field for NCQ commands") > > Signed-off-by: Niklas Cassel <niklas.cassel@xxxxxxx> > > --- > > drivers/ata/libata-sata.c | 19 +++++++++++++++++++ > > 1 file changed, 19 insertions(+) > > > > diff --git a/drivers/ata/libata-sata.c b/drivers/ata/libata-sata.c > > index 283ce1ab29cf..6b2dcf3eb2fb 100644 > > --- a/drivers/ata/libata-sata.c > > +++ b/drivers/ata/libata-sata.c > > @@ -1476,6 +1476,25 @@ void ata_eh_analyze_ncq_error(struct ata_link *link) > > } > > } > > > > + ata_qc_for_each_raw(ap, qc, tag) { > > + if (!(qc->flags & ATA_QCFLAG_FAILED) || > > + ata_dev_phys_link(qc->dev) != link) > > + continue; > > + > > + /* Skip the single QC which caused the NCQ error. */ > > + if (qc->err_mask) > > + continue; > > Before the continue, should we check that this qc tag is the one we got > from ata_eh_read_log_10h() ? We should at least warn if there is a mismatch. I really see no point of displaying a warning in case of mismatch here. At this point, there will be exactly one command that has AC_ERR_NCQ set. If ata_eh_read_log_10h() reported an invalid tag, we would have returned in the check performed directly after ata_eh_read_log_10h() was called: if (!(link->sactive & (1 << tag))) { ata_link_err(link, "log page 10h reported inactive tag %d\n", tag); return; } Which means that we would never have reached this new code. However, there could theoretically be another command that has e.g. AC_ERR_TIMEOUT set, if there was a command that timed out just before the NCQ error, so EH did not yet have a change to run before handling both errors at the same time. Therefore I wrote: + if (qc->err_mask) + continue; Instead of: + if (qc->err_mask & AC_ERR_NCQ) + continue; Kind regards, Niklas