On 11/18/22 18:41, Niklas Cassel wrote: > On Fri, Nov 18, 2022 at 01:40:17PM +0900, Damien Le Moal wrote: >> On 11/15/22 02:21, Niklas Cassel wrote: >>> A NCQ error means that the device has aborted processing of all active >>> commands. >>> To get the single NCQ command that caused the NCQ error, host software has >>> to read the NCQ error log, which also takes the device out of error state. >>> >>> When the device encounters a NCQ error, we receive an error interrupt from >>> the HBA, and call ata_do_link_abort() to mark all outstanding commands on >>> the link as ATA_QCFLAG_FAILED (which means that these commands are owned >>> by libata EH), and then call ata_qc_complete() on them. >>> >>> ata_qc_complete() will call fill_result_tf() for all commands marked as >>> ATA_QCFLAG_FAILED. >>> >>> The taskfile is simply the latest status/error as seen from the device's >>> perspective. The taskfile will have ATA_ERR set in the status field and >>> ATA_ABORTED set in the error field. >>> >>> When we fill the current taskfile values for all outstanding commands, >>> that means that qc->result_tf will have ATA_ERR set for all commands >>> owned by libata EH. >>> >>> When ata_eh_link_autopsy() later analyzes all commands owned by libata EH, >>> it will call ata_eh_analyze_tf(), which will check if qc->result_tf has >>> ATA_ERR set, if it does, it will set qc->err_mask (which marks the command >>> as an error). >>> >>> When ata_eh_finish() later calls __ata_qc_complete() on all commands owned >>> by libata EH, it will call qc->complete_fn() (ata_scsi_qc_complete()), >>> ata_scsi_qc_complete() will call ata_gen_ata_sense() to generate sense >>> data if qc->err_mask is set. >>> >>> This means that we will generate sense data for commands that should not >>> have any sense data set. Having sense data set for the non-failed commands >>> will cause SCSI to finish these commands instead of retrying them. >>> >>> While this incorrect behavior has existed for a long time, this first >>> became a problem once we started reading the correct taskfile register in >>> commit 4ba09d202657 ("ata: libahci: read correct status and error field >>> for NCQ commands"). >>> >>> Before this commit, NCQ commands would read the taskfile values received >>> from the last non-NCQ command completion, which most likely did not have >>> ATA_ERR set, since the last non-NCQ command was most likely not an error. >>> >>> Fix this by changing ata_eh_analyze_ncq_error() to mark all non-failed >>> commands as ATA_QCFLAG_RETRY, and change the loop in ata_eh_link_autopsy() >>> to skip commands marked as ATA_QCFLAG_RETRY. >>> >>> While at it, make sure that we clear ATA_ERR and any error bits for all >>> commands except the actual command that caused the NCQ error, so that no >>> other libata code will be able to misinterpret these commands as errors. >>> >>> Fixes: 4ba09d202657 ("ata: libahci: read correct status and error field for NCQ commands") >>> Signed-off-by: Niklas Cassel <niklas.cassel@xxxxxxx> >> >> Applied to for-6.1-fixes. Thanks ! > > So, the Fixes-tag points to a commit that is only in your for-next branch. Arg. I got confused with last week fix... > > If you this patch to 6.1-fixes, then the Fixes-tag points to a commit that > does not (yet) exist in the tree. > > If you prefer to this patch to 6.1-fixes, then we should probably change > the Fixes-tag to point to: > e8ee84518c15 ("[PATCH] libata-ncq: update EH to handle NCQ") > > While the problem could happen even on v6.1-rc5, it is highly unlikely, > as v6.1-rc5 is reading the wrong status register for NCQ commands, > which means that during an NCQ error, analyze_tf() will read the status > from the last non-NCQ command, which is most likely does not have ATA_ERR > set in status. > > I think the only way it is a problem on v6.1-rc5 is if: > 1) A non-NCQ command fails > 2a) No D2H FIS (non-NCQ command) is received with ATA_ERR bit cleared, > before 3) happens > 2b) The device is not reset, before 3) happens > 3) A NCQ error occurs > > Perhaps just queue this up for 6.2 instead? Indeed. It needs to be there. Moving it. Thanks. > > > Kind regards, > Niklas -- Damien Le Moal Western Digital Research