Re: [PATCH 00/16] scsi: libsas and users: Factor out LLDD TMF code

John Garry <john.garry@xxxxxxxxxx> · Thu, 3 Feb 2022 15:55:22 +0000

On 03/02/2022 09:44, Damien Le Moal wrote:

Hi Damien,

[  385.102073] sas: Enter sas_scsi_recover_host busy: 1 failed: 1
[  385.108026] sas: sas_scsi_find_task: aborting task 0x000000007068ed73
[  405.561099] pm80xx0:: pm8001_exec_internal_task_abort  757:TMF task

Contrary to mentioning TMF in the log, this is not a TMF but rather an 
internal abort timing out. I don't think that this should ever happen. 
This command should just abort pending IO commands in the controller and 
not send anything to the target. So for this to timeout means a HW fault 
or driver bug. And I did not touch this code for pm8001.

timeout.
[  405.568236] sas: sas_scsi_find_task: task 0x000000007068ed73 is
aborted
[  405.574930] sas: sas_eh_handle_sas_errors: task 0x000000007068ed73 is
aborted
[  411.192602] ata21.00: qc timeout (cmd 0xec)
[  431.672122] pm80xx0:: pm8001_exec_internal_task_abort  757:TMF task
timeout.
[  431.679282] ata21.00: failed to IDENTIFY (I/O error, err_mask=0x4)
[  431.685544] ata21.00: revalidation failed (errno=-5)
[  441.911948] ata21.00: qc timeout (cmd 0xec)
[  462.391545] pm80xx0:: pm8001_exec_internal_task_abort  757:TMF task
timeout.
[  462.398696] ata21.00: failed to IDENTIFY (I/O error, err_mask=0x4)
[  462.404992] ata21.00: revalidation failed (errno=-5)
[  492.598769] ata21.00: qc timeout (cmd 0xec)
...


Do you have a fuller dmesg with my series?

...

}
- res = -TMF_RESP_FUNC_FAILED;
+ res = TMF_RESP_FUNC_FAILED;

That's effectively the same as what I have in this series in
sas_execute_tmf().

However your testing is a SATA device, which I'll check further.
This did not help. Still seeing 100% reproducible hangs.

OK, but I think that we should also have this change as the mainline 
codes looks broken to be begin with:

--->8 ---

[PATCH] scsi: libsas: Handle all errors in sas_scsi_find_task()

LLDD TMFs callbacks may return linux or other error codes instead of TMF
codes. This may cause problems in sas_scsi_find_task() ->
.lldd_query_task(), as only TMF codes are handled there. As such, we may
not return a task_disposition type. Function sas_eh_handle_sas_errors() 
only handles that type, and may exit error
handling early for unrecognised types.

So use TASK_ABORT_FAILED for non-TMF types returned from
.lldd_query_task(), on the assumption that the command may still be 
alive and error handling should be escalated.

Signed-off-by: John Garry <john.garry@xxxxxxxxxx>

diff --git a/drivers/scsi/libsas/sas_scsi_host.c 
b/drivers/scsi/libsas/sas_scsi_host.c
index 53d8b7ede0cd..02274f471308 100644
--- a/drivers/scsi/libsas/sas_scsi_host.c
+++ b/drivers/scsi/libsas/sas_scsi_host.c
@@ -316,8 +316,11 @@ static enum task_disposition 
sas_scsi_find_task(struct sas_task *task)
 				pr_notice("%s: task 0x%p failed to abort\n",
 					  __func__, task);
 				return TASK_ABORT_FAILED;
+			default:
+				pr_notice("%s: task 0x%p result code %d not handled, assuming 
failed\n",
+					  __func__, task, res);
+				return TASK_ABORT_FAILED;
 			}
-
 		}
 	}
 	return res;

---8< ----


I did a lot of testing/digging today, 

Thanks for the effort!

> and the hang cause seems to be
> missing task completions.
At random, a task times out as its completion

That sounds fimilar to my general issue running this driver on an arm64 
host...

does not come, and subsequent abort trial for the task fail, revalidate
fails

I assume SMP IOs fail if revalidation fails - if this is the case, then 
the controller seems to be in bad state.

and the device is dropped (capacity goes to 0). But at that point,
doing rmmod/modprobe to reset the device does not work. sync cache
command issued at rmmod time never completes. I end up needing to power
cycle the machine every time...

No clue about the root cause yet, but it definitely seem to be related
to NCQ/high QD operation. If I force my tests to use non-NCQ commands,
everything is fine and the tests run to completion without any issue.

I wonder if their is a tag management bug somewhere...

Maybe. Not sure.

On a related point, Hannes' change here could avoid it:

https://lore.kernel.org/linux-scsi/20210222132405.91369-32-hare@xxxxxxx/


I did stumble on something very ugly in libsas too: sas_ata_qc_issue()
drops and retake the ata port lock. No other ATA driver do that since
the ata completion also take that lock. The ata port lock is taken
before ata_qc_issue() is called with IRQ disabled (spin_lock_irqsave()).
So doing a spin_unlock()/spin_lock() in sas_ata_qc_issue() (called from
ata_qc_issue()) seems like a very bad idea. I removed that and
everything work the same way (the lld execute does not sleep). But that
did not solve the hang problem.

I would need to check why this is done again. Before my time...


Of note is this is all with your libsas patches applied. Without the
patches, I have KASAN screaming at me about use-after-free in completion
context. With your patches, KASAN is silent.

Another thing: this driver does not allow changing the max qd... Very
annoying.

echo 1 > /sys/block/sdX/device/queue_depth

has no effect. QD stays at 32 for an ATA drive. Need to look into that too.

I had a look at this. It seems that we fail in 
__ata_change_queue_depth() -> ata_scsi_find_dev() returning NULL.

Thanks again for your effort, I will continue to look.

john