I figured out what is happening here and it does not help solve the
mystery of my hang.
Here's the steps:
a. scsi_cmnd times out
b. scsi error handling kicks in
c. libsas attempts to abort the task, which fails
d. libsas then tries IT nexus reset, which passes
- libsas assumes the scsi_cmnd has completed with failure
e. error handling concludes
f. scsi midlayer then retries the same scsi_cmnd
g. since we did not "free" associated ccb earlier or dma unmap at d.,
the dma unmap on the same scsi_cmnd causes the warn
So the LLD should really free resources and dma unmap at point IT nexus
reset completes, but it doesn't. I think in certain conditions dma map
should not be done twice.
Anyway, that can be fixed, but I still have the hang :(
I guess (a) (cmd timeout) is only the symptom of the hang ? That is, the
hang is causing the timeout ?
Right
It may be good to turn on scsi trace to see if the command was only
partially done, or not at all, or if it is a non-data command.
I could do that. But I think that the command just does not complete. Or
maybe it is missed.
And speaking of errors, I am currently testing v4 of my series and
noticed some weird things in the error handling. E.g., with one of the
test executing a report zones command with an LBA out of range, I see this:
[23962.027105] pm80xx0:: mpi_sata_event 2788:SATA EVENT 0x23
[23962.036099] pm80xx0:: pm80xx_send_read_log 1863:Executing read log end
I don't know why the driver even does this, but the implementation of
pm80xx_send_read_log() is questionable. It would be nice to not see ATA
code in the driver like this.
All good: this is IO_XFER_ERROR_ABORTED_NCQ_MODE. And the read log is to
get the drive queue out of error state.
[23962.046101] pm80xx0:: mpi_sata_event 2788:SATA EVENT 0x26
This is IO_XFER_ERROR_UNEXPECTED_PHASE. No clue what this is doing.
[23962.054947] pm80xx0:: mpi_sata_event 2805:task or dev null
Why ?
[23962.063865] pm80xx0:: pm80xx_send_abort_all 1796:Executing abort
task end
All queued commands are aborted when the read log completes. Again, per
ATA NCQ specs, all good. After that, normal (some useless) messages.
[23962.063964] pm80xx0:: mpi_sata_completion 2395:IO failed device_id
16388 status 0x1 tag 1
[23962.084587] pm80xx0:: mpi_sata_completion 2430:SAS Address of IO
Failure Drive:50010860002f5657
[23962.095526] sas: sas_ata_task_done: SAS error 0x8d
[23962.139470] sas: Enter sas_scsi_recover_host busy: 1 failed: 1
[23962.147897] ata24.00: exception Emask 0x0 SAct 0x800 SErr 0x0 action 0x0
[23962.156253] ata24.00: failed command: RECEIVE FPDMA QUEUED
[23962.163307] ata24.00: cmd 65/01:00:00:00:0c/00:02:23:01:00/40 tag 11
ncq dma 512 in
[23962.163307] res 43/10:00:00:00:00/00:00:00:00:00/00 Emask
0x480 (invalid argument) <F>
Good here, correct error...
[23962.182879] ata24.00: status: { DRDY SENSE ERR }
[23962.189100] ata24.00: error: { IDNF }
... but I need to look at this in libata. Getting the same with AHCI.
ok, in a way that is a relief.
[23962.215456] ata24.00: configured for UDMA/133
[23962.221469] ata24: EH complete
[23962.226056] sas: --- Exit sas_scsi_recover_host: busy: 0 failed: 1
tries: 1
That event 0x26 and the "task or dev null" are obscure. No clue, but
they look wrong.
As above, I doubt the implemenation of that code. It alloc's a domain
device itself, which is not how things should be done.
Overall, this driver is by default way too verbose I think. I would
prefer to reduce the above to something like:
[23962.095526] sas: sas_ata_task_done: SAS error 0x8d
[23962.139470] sas: Enter sas_scsi_recover_host busy: 1 failed: 1
[23962.147897] ata24.00: exception Emask 0x0 SAct 0x800 SErr 0x0 action 0x0
[23962.156253] ata24.00: failed command: RECEIVE FPDMA QUEUED
[23962.163307] ata24.00: cmd 65/01:00:00:00:0c/00:02:23:01:00/40 tag 11
ncq dma 512 in
[23962.163307] res 43/10:00:00:00:00/00:00:00:00:00/00 Emask
0x480 (invalid argument) <F>
[23962.182879] ata24.00: status: { DRDY SENSE ERR }
[23962.189100] ata24.00: error: { IDNF }
[23962.215456] ata24.00: configured for UDMA/133
[23962.221469] ata24: EH complete
[23962.226056] sas: --- Exit sas_scsi_recover_host: busy: 0 failed: 1
tries: 1
That is, by default, remove most FAIL messages, changing them to DEV
level. The same test with a SAS drive is totally silent, as it should
be, since the command error is not a fatal one.
But not touching this for now. I first want the series to be queued.
In response to later message:
On 17/02/2022 11:47, Damien Le Moal wrote:
>> Anyway, that can be fixed, but I still have the hang:(
> One thought: could it be bug with the DMA engine of your platform ?
> What if you simply run an fio workload on the disk directly (no FS),
> hang happens too ?
I did try fio a while ago and it worked ok. Strange.
> For the bugs I fixed with my series, it was the reverse: fio worked
> great but everything broke down when I ran libzbc tests...
>
I got a few more things to try but need to make progress on my libsas
series now ...
Thanks,
John