On Fri, Nov 18, 2022 at 01:40:17PM +0900, Damien Le Moal wrote: > On 11/15/22 02:21, Niklas Cassel wrote: > > A NCQ error means that the device has aborted processing of all active > > commands. > > To get the single NCQ command that caused the NCQ error, host software has > > to read the NCQ error log, which also takes the device out of error state. > > > > When the device encounters a NCQ error, we receive an error interrupt from > > the HBA, and call ata_do_link_abort() to mark all outstanding commands on > > the link as ATA_QCFLAG_FAILED (which means that these commands are owned > > by libata EH), and then call ata_qc_complete() on them. > > > > ata_qc_complete() will call fill_result_tf() for all commands marked as > > ATA_QCFLAG_FAILED. > > > > The taskfile is simply the latest status/error as seen from the device's > > perspective. The taskfile will have ATA_ERR set in the status field and > > ATA_ABORTED set in the error field. > > > > When we fill the current taskfile values for all outstanding commands, > > that means that qc->result_tf will have ATA_ERR set for all commands > > owned by libata EH. > > > > When ata_eh_link_autopsy() later analyzes all commands owned by libata EH, > > it will call ata_eh_analyze_tf(), which will check if qc->result_tf has > > ATA_ERR set, if it does, it will set qc->err_mask (which marks the command > > as an error). > > > > When ata_eh_finish() later calls __ata_qc_complete() on all commands owned > > by libata EH, it will call qc->complete_fn() (ata_scsi_qc_complete()), > > ata_scsi_qc_complete() will call ata_gen_ata_sense() to generate sense > > data if qc->err_mask is set. > > > > This means that we will generate sense data for commands that should not > > have any sense data set. Having sense data set for the non-failed commands > > will cause SCSI to finish these commands instead of retrying them. > > > > While this incorrect behavior has existed for a long time, this first > > became a problem once we started reading the correct taskfile register in > > commit 4ba09d202657 ("ata: libahci: read correct status and error field > > for NCQ commands"). > > > > Before this commit, NCQ commands would read the taskfile values received > > from the last non-NCQ command completion, which most likely did not have > > ATA_ERR set, since the last non-NCQ command was most likely not an error. > > > > Fix this by changing ata_eh_analyze_ncq_error() to mark all non-failed > > commands as ATA_QCFLAG_RETRY, and change the loop in ata_eh_link_autopsy() > > to skip commands marked as ATA_QCFLAG_RETRY. > > > > While at it, make sure that we clear ATA_ERR and any error bits for all > > commands except the actual command that caused the NCQ error, so that no > > other libata code will be able to misinterpret these commands as errors. > > > > Fixes: 4ba09d202657 ("ata: libahci: read correct status and error field for NCQ commands") > > Signed-off-by: Niklas Cassel <niklas.cassel@xxxxxxx> > > Applied to for-6.1-fixes. Thanks ! So, the Fixes-tag points to a commit that is only in your for-next branch. If you this patch to 6.1-fixes, then the Fixes-tag points to a commit that does not (yet) exist in the tree. If you prefer to this patch to 6.1-fixes, then we should probably change the Fixes-tag to point to: e8ee84518c15 ("[PATCH] libata-ncq: update EH to handle NCQ") While the problem could happen even on v6.1-rc5, it is highly unlikely, as v6.1-rc5 is reading the wrong status register for NCQ commands, which means that during an NCQ error, analyze_tf() will read the status from the last non-NCQ command, which is most likely does not have ATA_ERR set in status. I think the only way it is a problem on v6.1-rc5 is if: 1) A non-NCQ command fails 2a) No D2H FIS (non-NCQ command) is received with ATA_ERR bit cleared, before 3) happens 2b) The device is not reset, before 3) happens 3) A NCQ error occurs Perhaps just queue this up for 6.2 instead? Kind regards, Niklas