On 28/01/2022 09:09, John Garry wrote:
I ran some more tests. In particular, I ran libzbc compliance tests on a
20TB SMR drives. All tests pass with 5.17-rc1, but after applying your
series, I see command timeout that take forever to recover from, with
the drive revalidation failing after that.
[ 385.102073] sas: Enter sas_scsi_recover_host busy: 1 failed: 1
[ 385.108026] sas: sas_scsi_find_task: aborting task 0x000000007068ed73
[ 405.561099] pm80xx0:: pm8001_exec_internal_task_abort 757:TMF task
timeout.
[ 405.568236] sas: sas_scsi_find_task: task 0x000000007068ed73 is
aborted
[ 405.574930] sas: sas_eh_handle_sas_errors: task 0x000000007068ed73 is
aborted
[ 411.192602] ata21.00: qc timeout (cmd 0xec)
[ 431.672122] pm80xx0:: pm8001_exec_internal_task_abort 757:TMF task
timeout.
[ 431.679282] ata21.00: failed to IDENTIFY (I/O error, err_mask=0x4)
[ 431.685544] ata21.00: revalidation failed (errno=-5)
[ 441.911948] ata21.00: qc timeout (cmd 0xec)
[ 462.391545] pm80xx0:: pm8001_exec_internal_task_abort 757:TMF task
timeout.
[ 462.398696] ata21.00: failed to IDENTIFY (I/O error, err_mask=0x4)
[ 462.404992] ata21.00: revalidation failed (errno=-5)
[ 492.598769] ata21.00: qc timeout (cmd 0xec)
...
So there is a problem. Need to dig into this. I see this issue only with
libzbc passthrough tests. fio runs with libaio are fine.
Thanks for the notice. I think that I also saw a hang, but, IIRC, it
happened on mainline for me - but it's hard to know if I broke something
if it is already broke in another way. That is why I wanted this card
working properly...
Hi Damien,
From testing mainline, I can see a hang on my arm64 system for SAS
disks. I think that the reason is the we don't finish some commands in
EH properly for pm8001:
- In EH, we attempt to abort the task in sas_scsi_find_task() ->
lldd_abort_task()
The default return from pm8001_exec_internal_tmf_task() is
-TMF_RESP_FUNC_FAILED, so if the TMF does not execute properly we return
this value
- sas_scsi_find_task() cannot handle -TMF_RESP_FUNC_FAILED, and returns
-TMF_RESP_FUNC_FAILED directly to sas_eh_handle_sas_errors(), which,
again, does not handle -TMF_RESP_FUNC_FAILED. So we don't progress to
ever finish the comand.
This looks like the correct fix for mainline:
--- a/drivers/scsi/pm8001/pm8001_sas.c
+++ b/drivers/scsi/pm8001/pm8001_sas.c
@@ -766,7 +766,7 @@ static int pm8001_exec_internal_tmf_task(struct
domain_device *dev,
pm8001_dev, DS_OPERATIONAL);
wait_for_completion(&completion_setstate);
}
- res = -TMF_RESP_FUNC_FAILED;
+ res = TMF_RESP_FUNC_FAILED;
That's effectively the same as what I have in this series in
sas_execute_tmf().
However your testing is a SATA device, which I'll check further.
Thanks,
John