Re: [PATCH v2] ata: libata: fix commands incorrectly not getting retried during NCQ error

Niklas Cassel <Niklas.Cassel@xxxxxxx> · Fri, 18 Nov 2022 09:41:58 +0000

On Fri, Nov 18, 2022 at 01:40:17PM +0900, Damien Le Moal wrote:
> On 11/15/22 02:21, Niklas Cassel wrote:
> > A NCQ error means that the device has aborted processing of all active
> > commands.
> > To get the single NCQ command that caused the NCQ error, host software has
> > to read the NCQ error log, which also takes the device out of error state.
> > 
> > When the device encounters a NCQ error, we receive an error interrupt from
> > the HBA, and call ata_do_link_abort() to mark all outstanding commands on
> > the link as ATA_QCFLAG_FAILED (which means that these commands are owned
> > by libata EH), and then call ata_qc_complete() on them.
> > 
> > ata_qc_complete() will call fill_result_tf() for all commands marked as
> > ATA_QCFLAG_FAILED.
> > 
> > The taskfile is simply the latest status/error as seen from the device's
> > perspective. The taskfile will have ATA_ERR set in the status field and
> > ATA_ABORTED set in the error field.
> > 
> > When we fill the current taskfile values for all outstanding commands,
> > that means that qc->result_tf will have ATA_ERR set for all commands
> > owned by libata EH.
> > 
> > When ata_eh_link_autopsy() later analyzes all commands owned by libata EH,
> > it will call ata_eh_analyze_tf(), which will check if qc->result_tf has
> > ATA_ERR set, if it does, it will set qc->err_mask (which marks the command
> > as an error).
> > 
> > When ata_eh_finish() later calls __ata_qc_complete() on all commands owned
> > by libata EH, it will call qc->complete_fn() (ata_scsi_qc_complete()),
> > ata_scsi_qc_complete() will call ata_gen_ata_sense() to generate sense
> > data if qc->err_mask is set.
> > 
> > This means that we will generate sense data for commands that should not
> > have any sense data set. Having sense data set for the non-failed commands
> > will cause SCSI to finish these commands instead of retrying them.
> > 
> > While this incorrect behavior has existed for a long time, this first
> > became a problem once we started reading the correct taskfile register in
> > commit 4ba09d202657 ("ata: libahci: read correct status and error field
> > for NCQ commands").
> > 
> > Before this commit, NCQ commands would read the taskfile values received
> > from the last non-NCQ command completion, which most likely did not have
> > ATA_ERR set, since the last non-NCQ command was most likely not an error.
> > 
> > Fix this by changing ata_eh_analyze_ncq_error() to mark all non-failed
> > commands as ATA_QCFLAG_RETRY, and change the loop in ata_eh_link_autopsy()
> > to skip commands marked as ATA_QCFLAG_RETRY.
> > 
> > While at it, make sure that we clear ATA_ERR and any error bits for all
> > commands except the actual command that caused the NCQ error, so that no
> > other libata code will be able to misinterpret these commands as errors.
> > 
> > Fixes: 4ba09d202657 ("ata: libahci: read correct status and error field for NCQ commands")
> > Signed-off-by: Niklas Cassel <niklas.cassel@xxxxxxx>
> 
> Applied to for-6.1-fixes. Thanks !

So, the Fixes-tag points to a commit that is only in your for-next branch.

If you this patch to 6.1-fixes, then the Fixes-tag points to a commit that
does not (yet) exist in the tree.

If you prefer to this patch to 6.1-fixes, then we should probably change
the Fixes-tag to point to:
e8ee84518c15 ("[PATCH] libata-ncq: update EH to handle NCQ")

While the problem could happen even on v6.1-rc5, it is highly unlikely,
as v6.1-rc5 is reading the wrong status register for NCQ commands,
which means that during an NCQ error, analyze_tf() will read the status
from the last non-NCQ command, which is most likely does not have ATA_ERR
set in status.

I think the only way it is a problem on v6.1-rc5 is if:
1) A non-NCQ command fails
2a) No D2H FIS (non-NCQ command) is received with ATA_ERR bit cleared,
before 3) happens
2b) The device is not reset, before 3) happens
3) A NCQ error occurs

Perhaps just queue this up for 6.2 instead?

Kind regards,
Niklas