Re: [PATCH v2] ata: libata: fix commands incorrectly not getting retried during NCQ error

Damien Le Moal <damien.lemoal@xxxxxxxxxxxxxxxxxx> · Sat, 19 Nov 2022 09:39:32 +0900

On 11/18/22 18:41, Niklas Cassel wrote:
> On Fri, Nov 18, 2022 at 01:40:17PM +0900, Damien Le Moal wrote:
>> On 11/15/22 02:21, Niklas Cassel wrote:
>>> A NCQ error means that the device has aborted processing of all active
>>> commands.
>>> To get the single NCQ command that caused the NCQ error, host software has
>>> to read the NCQ error log, which also takes the device out of error state.
>>>
>>> When the device encounters a NCQ error, we receive an error interrupt from
>>> the HBA, and call ata_do_link_abort() to mark all outstanding commands on
>>> the link as ATA_QCFLAG_FAILED (which means that these commands are owned
>>> by libata EH), and then call ata_qc_complete() on them.
>>>
>>> ata_qc_complete() will call fill_result_tf() for all commands marked as
>>> ATA_QCFLAG_FAILED.
>>>
>>> The taskfile is simply the latest status/error as seen from the device's
>>> perspective. The taskfile will have ATA_ERR set in the status field and
>>> ATA_ABORTED set in the error field.
>>>
>>> When we fill the current taskfile values for all outstanding commands,
>>> that means that qc->result_tf will have ATA_ERR set for all commands
>>> owned by libata EH.
>>>
>>> When ata_eh_link_autopsy() later analyzes all commands owned by libata EH,
>>> it will call ata_eh_analyze_tf(), which will check if qc->result_tf has
>>> ATA_ERR set, if it does, it will set qc->err_mask (which marks the command
>>> as an error).
>>>
>>> When ata_eh_finish() later calls __ata_qc_complete() on all commands owned
>>> by libata EH, it will call qc->complete_fn() (ata_scsi_qc_complete()),
>>> ata_scsi_qc_complete() will call ata_gen_ata_sense() to generate sense
>>> data if qc->err_mask is set.
>>>
>>> This means that we will generate sense data for commands that should not
>>> have any sense data set. Having sense data set for the non-failed commands
>>> will cause SCSI to finish these commands instead of retrying them.
>>>
>>> While this incorrect behavior has existed for a long time, this first
>>> became a problem once we started reading the correct taskfile register in
>>> commit 4ba09d202657 ("ata: libahci: read correct status and error field
>>> for NCQ commands").
>>>
>>> Before this commit, NCQ commands would read the taskfile values received
>>> from the last non-NCQ command completion, which most likely did not have
>>> ATA_ERR set, since the last non-NCQ command was most likely not an error.
>>>
>>> Fix this by changing ata_eh_analyze_ncq_error() to mark all non-failed
>>> commands as ATA_QCFLAG_RETRY, and change the loop in ata_eh_link_autopsy()
>>> to skip commands marked as ATA_QCFLAG_RETRY.
>>>
>>> While at it, make sure that we clear ATA_ERR and any error bits for all
>>> commands except the actual command that caused the NCQ error, so that no
>>> other libata code will be able to misinterpret these commands as errors.
>>>
>>> Fixes: 4ba09d202657 ("ata: libahci: read correct status and error field for NCQ commands")
>>> Signed-off-by: Niklas Cassel <niklas.cassel@xxxxxxx>
>>
>> Applied to for-6.1-fixes. Thanks !
> 
> So, the Fixes-tag points to a commit that is only in your for-next branch.

Arg. I got confused with last week fix...

> 
> If you this patch to 6.1-fixes, then the Fixes-tag points to a commit that
> does not (yet) exist in the tree.
> 
> If you prefer to this patch to 6.1-fixes, then we should probably change
> the Fixes-tag to point to:
> e8ee84518c15 ("[PATCH] libata-ncq: update EH to handle NCQ")
> 
> While the problem could happen even on v6.1-rc5, it is highly unlikely,
> as v6.1-rc5 is reading the wrong status register for NCQ commands,
> which means that during an NCQ error, analyze_tf() will read the status
> from the last non-NCQ command, which is most likely does not have ATA_ERR
> set in status.
> 
> I think the only way it is a problem on v6.1-rc5 is if:
> 1) A non-NCQ command fails
> 2a) No D2H FIS (non-NCQ command) is received with ATA_ERR bit cleared,
> before 3) happens
> 2b) The device is not reset, before 3) happens
> 3) A NCQ error occurs
> 
> Perhaps just queue this up for 6.2 instead?

Indeed. It needs to be there. Moving it.
Thanks.

> 
> 
> Kind regards,
> Niklas

-- 
Damien Le Moal
Western Digital Research