Re: [PATCH 00/16] scsi: libsas and users: Factor out LLDD TMF code

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 



On 28/01/2022 09:09, John Garry wrote:
I ran some more tests. In particular, I ran libzbc compliance tests on a
20TB SMR drives. All tests pass with 5.17-rc1, but after applying your
series, I see command timeout that take forever to recover from, with
the drive revalidation failing after that.

[  385.102073] sas: Enter sas_scsi_recover_host busy: 1 failed: 1
[  385.108026] sas: sas_scsi_find_task: aborting task 0x000000007068ed73
[  405.561099] pm80xx0:: pm8001_exec_internal_task_abort  757:TMF task
timeout.
[  405.568236] sas: sas_scsi_find_task: task 0x000000007068ed73 is aborted
[  405.574930] sas: sas_eh_handle_sas_errors: task 0x000000007068ed73 is
aborted
[  411.192602] ata21.00: qc timeout (cmd 0xec)
[  431.672122] pm80xx0:: pm8001_exec_internal_task_abort  757:TMF task
timeout.
[  431.679282] ata21.00: failed to IDENTIFY (I/O error, err_mask=0x4)
[  431.685544] ata21.00: revalidation failed (errno=-5)
[  441.911948] ata21.00: qc timeout (cmd 0xec)
[  462.391545] pm80xx0:: pm8001_exec_internal_task_abort  757:TMF task
timeout.
[  462.398696] ata21.00: failed to IDENTIFY (I/O error, err_mask=0x4)
[  462.404992] ata21.00: revalidation failed (errno=-5)
[  492.598769] ata21.00: qc timeout (cmd 0xec)
...

So there is a problem. Need to dig into this. I see this issue only with
libzbc passthrough tests. fio runs with libaio are fine.

Thanks for the notice. I think that I also saw a hang, but, IIRC, it happened on mainline for me - but it's hard to know if I broke something if it is already broke in another way. That is why I wanted this card working properly...

Hi Damien,

From testing mainline, I can see a hang on my arm64 system for SAS disks. I think that the reason is the we don't finish some commands in EH properly for pm8001: - In EH, we attempt to abort the task in sas_scsi_find_task() -> lldd_abort_task() The default return from pm8001_exec_internal_tmf_task() is -TMF_RESP_FUNC_FAILED, so if the TMF does not execute properly we return this value - sas_scsi_find_task() cannot handle -TMF_RESP_FUNC_FAILED, and returns -TMF_RESP_FUNC_FAILED directly to sas_eh_handle_sas_errors(), which, again, does not handle -TMF_RESP_FUNC_FAILED. So we don't progress to ever finish the comand.

This looks like the correct fix for mainline:

--- a/drivers/scsi/pm8001/pm8001_sas.c
+++ b/drivers/scsi/pm8001/pm8001_sas.c
@@ -766,7 +766,7 @@ static int pm8001_exec_internal_tmf_task(struct domain_device *dev,
pm8001_dev, DS_OPERATIONAL);
wait_for_completion(&completion_setstate);
}
- res = -TMF_RESP_FUNC_FAILED;
+ res = TMF_RESP_FUNC_FAILED;

That's effectively the same as what I have in this series in sas_execute_tmf().

However your testing is a SATA device, which I'll check further.

Thanks,
John



[Index of Archives]     [Kernel Newbies]     [Security]     [Netfilter]     [Bugtraq]     [Linux FS]     [Yosemite Forum]     [MIPS Linux]     [ARM Linux]     [Linux Security]     [Linux RAID]     [Samba]     [Video 4 Linux]     [Device Mapper]     [Linux Resources]

  Powered by Linux