Re: [PATCH v3 27/31] scsi: pm8001: Cleanup pm8001_queue_command()

John Garry <john.garry@xxxxxxxxxx> · Thu, 17 Feb 2022 12:49:18 +0000

I figured out what is happening here and it does not help solve the
mystery of my hang.

Here's the steps:
a. scsi_cmnd times out
b. scsi error handling kicks in
c. libsas attempts to abort the task, which fails
d. libsas then tries IT nexus reset, which passes
   - libsas assumes the scsi_cmnd has completed with failure
e. error handling concludes
f. scsi midlayer then retries the same scsi_cmnd
g. since we did not "free" associated ccb earlier or dma unmap at d.,
the dma unmap on the same scsi_cmnd causes the warn

So the LLD should really free resources and dma unmap at point IT nexus
reset completes, but it doesn't. I think in certain conditions dma map
should not be done twice.

Anyway, that can be fixed, but I still have the hang :(

I guess (a) (cmd timeout) is only the symptom of the hang ? That is, the
hang is causing the timeout ?

Right

It may be good to turn on scsi trace to see if the command was only
partially done, or not at all, or if it is a non-data command.

I could do that. But I think that the command just does not complete. Or 
maybe it is missed.

And speaking of errors, I am currently testing v4 of my series and
noticed some weird things in the error handling. E.g., with one of the
test executing a report zones command with an LBA out of range, I see this:

[23962.027105] pm80xx0:: mpi_sata_event  2788:SATA EVENT 0x23
[23962.036099] pm80xx0:: pm80xx_send_read_log  1863:Executing read log end

I don't know why the driver even does this, but the implementation of 
pm80xx_send_read_log() is questionable. It would be nice to not see ATA 
code in the driver like this.

All good: this is IO_XFER_ERROR_ABORTED_NCQ_MODE. And the read log is to
get the drive queue out of error state.

[23962.046101] pm80xx0:: mpi_sata_event  2788:SATA EVENT 0x26

This is IO_XFER_ERROR_UNEXPECTED_PHASE. No clue what this is doing.

[23962.054947] pm80xx0:: mpi_sata_event  2805:task or dev null

Why ?

[23962.063865] pm80xx0:: pm80xx_send_abort_all  1796:Executing abort
task end

All queued commands are aborted when the read log completes. Again, per
ATA NCQ specs, all good. After that, normal (some useless) messages.

[23962.063964] pm80xx0:: mpi_sata_completion  2395:IO failed device_id
16388 status 0x1 tag 1
[23962.084587] pm80xx0:: mpi_sata_completion  2430:SAS Address of IO
Failure Drive:50010860002f5657
[23962.095526] sas: sas_ata_task_done: SAS error 0x8d
[23962.139470] sas: Enter sas_scsi_recover_host busy: 1 failed: 1
[23962.147897] ata24.00: exception Emask 0x0 SAct 0x800 SErr 0x0 action 0x0
[23962.156253] ata24.00: failed command: RECEIVE FPDMA QUEUED
[23962.163307] ata24.00: cmd 65/01:00:00:00:0c/00:02:23:01:00/40 tag 11
ncq dma 512 in
[23962.163307]          res 43/10:00:00:00:00/00:00:00:00:00/00 Emask
0x480 (invalid argument) <F>

Good here, correct error...

[23962.182879] ata24.00: status: { DRDY SENSE ERR }
[23962.189100] ata24.00: error: { IDNF }

... but I need to look at this in libata. Getting the same with AHCI.

ok, in a way that is a relief.

[23962.215456] ata24.00: configured for UDMA/133
[23962.221469] ata24: EH complete
[23962.226056] sas: --- Exit sas_scsi_recover_host: busy: 0 failed: 1
tries: 1

That event 0x26 and the "task or dev null" are obscure. No clue, but
they look wrong.

As above, I doubt the implemenation of that code. It alloc's a domain 
device itself, which is not how things should be done.

Overall, this driver is by default way too verbose I think. I would
prefer to reduce the above to something like:

[23962.095526] sas: sas_ata_task_done: SAS error 0x8d
[23962.139470] sas: Enter sas_scsi_recover_host busy: 1 failed: 1
[23962.147897] ata24.00: exception Emask 0x0 SAct 0x800 SErr 0x0 action 0x0
[23962.156253] ata24.00: failed command: RECEIVE FPDMA QUEUED
[23962.163307] ata24.00: cmd 65/01:00:00:00:0c/00:02:23:01:00/40 tag 11
ncq dma 512 in
[23962.163307]          res 43/10:00:00:00:00/00:00:00:00:00/00 Emask
0x480 (invalid argument) <F>
[23962.182879] ata24.00: status: { DRDY SENSE ERR }
[23962.189100] ata24.00: error: { IDNF }
[23962.215456] ata24.00: configured for UDMA/133
[23962.221469] ata24: EH complete
[23962.226056] sas: --- Exit sas_scsi_recover_host: busy: 0 failed: 1
tries: 1

That is, by default, remove most FAIL messages, changing them to DEV
level. The same test with a SAS drive is totally silent, as it should
be, since the command error is not a fatal one.

But not touching this for now. I first want the series to be queued.

In response to later message:

On 17/02/2022 11:47, Damien Le Moal wrote:
>> Anyway, that can be fixed, but I still have the hang:(
> One thought: could it be bug with the DMA engine of your platform ?
> What if you simply run an fio workload on the disk directly (no FS),
> hang happens too ?

I did try fio a while ago and it worked ok. Strange.

> For the bugs I fixed with my series, it was the reverse: fio worked
> great but everything broke down when I ran libzbc tests...
>

I got a few more things to try but need to make progress on my libsas 
series now ...

Thanks,
John