Re: [PATCH v3 27/31] scsi: pm8001: Cleanup pm8001_queue_command()

Damien Le Moal <damien.lemoal@xxxxxxxxxxxxxxxxxx> · Thu, 17 Feb 2022 19:47:15 +0900

On 2/17/22 18:23, John Garry wrote:
> On 17/02/2022 00:12, Damien Le Moal wrote:
>>>>> I'll have a look at it. And that is on mainline or mkp-scsi staging, and
>>>>> not your patchset.
>>>> Are you saying that my patches suppresses the above ? This is submission
>>>> path and the dma code seems to complain about alignment... So bad buffer
>>>> addresses ?
>>> Your series does not suppress it. It doesn't occur often, so I need to
>>> check more.
>>>
>>> I think the issue is that we call dma_map_sg() twice, i.e. ccb never
>>> unmapped.
>> That would be a big issue indeed. We could add a flag to CCBs to track
>> the buf_prd DMA mapping state and BUG_ON() when ccb free function is
>> called with the buffer still mapped. That should allow catching this
>> infrequent problem ?
>>
> 
> I figured out what is happening here and it does not help solve the 
> mystery of my hang.
> 
> Here's the steps:
> a. scsi_cmnd times out
> b. scsi error handling kicks in
> c. libsas attempts to abort the task, which fails
> d. libsas then tries IT nexus reset, which passes
>   - libsas assumes the scsi_cmnd has completed with failure
> e. error handling concludes
> f. scsi midlayer then retries the same scsi_cmnd
> g. since we did not "free" associated ccb earlier or dma unmap at d., 
> the dma unmap on the same scsi_cmnd causes the warn
> 
> So the LLD should really free resources and dma unmap at point IT nexus 
> reset completes, but it doesn't. I think in certain conditions dma map 
> should not be done twice.
> 
> Anyway, that can be fixed, but I still have the hang :(

I guess (a) (cmd timeout) is only the symptom of the hang ? That is, the
hang is causing the timeout ?
It may be good to turn on scsi trace to see if the command was only
partially done, or not at all, or if it is a non-data command.

And speaking of errors, I am currently testing v4 of my series and
noticed some weird things in the error handling. E.g., with one of the
test executing a report zones command with an LBA out of range, I see this:

[23962.027105] pm80xx0:: mpi_sata_event  2788:SATA EVENT 0x23
[23962.036099] pm80xx0:: pm80xx_send_read_log  1863:Executing read log end

All good: this is IO_XFER_ERROR_ABORTED_NCQ_MODE. And the read log is to
get the drive queue out of error state.

[23962.046101] pm80xx0:: mpi_sata_event  2788:SATA EVENT 0x26

This is IO_XFER_ERROR_UNEXPECTED_PHASE. No clue what this is doing.

[23962.054947] pm80xx0:: mpi_sata_event  2805:task or dev null

Why ?

[23962.063865] pm80xx0:: pm80xx_send_abort_all  1796:Executing abort
task end

All queued commands are aborted when the read log completes. Again, per
ATA NCQ specs, all good. After that, normal (some useless) messages.

[23962.063964] pm80xx0:: mpi_sata_completion  2395:IO failed device_id
16388 status 0x1 tag 1
[23962.084587] pm80xx0:: mpi_sata_completion  2430:SAS Address of IO
Failure Drive:50010860002f5657
[23962.095526] sas: sas_ata_task_done: SAS error 0x8d
[23962.139470] sas: Enter sas_scsi_recover_host busy: 1 failed: 1
[23962.147897] ata24.00: exception Emask 0x0 SAct 0x800 SErr 0x0 action 0x0
[23962.156253] ata24.00: failed command: RECEIVE FPDMA QUEUED
[23962.163307] ata24.00: cmd 65/01:00:00:00:0c/00:02:23:01:00/40 tag 11
ncq dma 512 in
[23962.163307]          res 43/10:00:00:00:00/00:00:00:00:00/00 Emask
0x480 (invalid argument) <F>

Good here, correct error...

[23962.182879] ata24.00: status: { DRDY SENSE ERR }
[23962.189100] ata24.00: error: { IDNF }

... but I need to look at this in libata. Getting the same with AHCI.

[23962.215456] ata24.00: configured for UDMA/133
[23962.221469] ata24: EH complete
[23962.226056] sas: --- Exit sas_scsi_recover_host: busy: 0 failed: 1
tries: 1

That event 0x26 and the "task or dev null" are obscure. No clue, but
they look wrong.

Overall, this driver is by default way too verbose I think. I would
prefer to reduce the above to something like:

[23962.095526] sas: sas_ata_task_done: SAS error 0x8d
[23962.139470] sas: Enter sas_scsi_recover_host busy: 1 failed: 1
[23962.147897] ata24.00: exception Emask 0x0 SAct 0x800 SErr 0x0 action 0x0
[23962.156253] ata24.00: failed command: RECEIVE FPDMA QUEUED
[23962.163307] ata24.00: cmd 65/01:00:00:00:0c/00:02:23:01:00/40 tag 11
ncq dma 512 in
[23962.163307]          res 43/10:00:00:00:00/00:00:00:00:00/00 Emask
0x480 (invalid argument) <F>
[23962.182879] ata24.00: status: { DRDY SENSE ERR }
[23962.189100] ata24.00: error: { IDNF }
[23962.215456] ata24.00: configured for UDMA/133
[23962.221469] ata24: EH complete
[23962.226056] sas: --- Exit sas_scsi_recover_host: busy: 0 failed: 1
tries: 1

That is, by default, remove most FAIL messages, changing them to DEV
level. The same test with a SAS drive is totally silent, as it should
be, since the command error is not a fatal one.

But not touching this for now. I first want the series to be queued.

-- 
Damien Le Moal
Western Digital Research