Re: [PATCH v3 27/31] scsi: pm8001: Cleanup pm8001_queue_command()

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 





I figured out what is happening here and it does not help solve the
mystery of my hang.

Here's the steps:
a. scsi_cmnd times out
b. scsi error handling kicks in
c. libsas attempts to abort the task, which fails
d. libsas then tries IT nexus reset, which passes
   - libsas assumes the scsi_cmnd has completed with failure
e. error handling concludes
f. scsi midlayer then retries the same scsi_cmnd
g. since we did not "free" associated ccb earlier or dma unmap at d.,
the dma unmap on the same scsi_cmnd causes the warn

So the LLD should really free resources and dma unmap at point IT nexus
reset completes, but it doesn't. I think in certain conditions dma map
should not be done twice.

Anyway, that can be fixed, but I still have the hang :(

I guess (a) (cmd timeout) is only the symptom of the hang ? That is, the
hang is causing the timeout ?

Right

It may be good to turn on scsi trace to see if the command was only
partially done, or not at all, or if it is a non-data command.


I could do that. But I think that the command just does not complete. Or maybe it is missed.

And speaking of errors, I am currently testing v4 of my series and
noticed some weird things in the error handling. E.g., with one of the
test executing a report zones command with an LBA out of range, I see this:

[23962.027105] pm80xx0:: mpi_sata_event  2788:SATA EVENT 0x23
[23962.036099] pm80xx0:: pm80xx_send_read_log  1863:Executing read log end


I don't know why the driver even does this, but the implementation of pm80xx_send_read_log() is questionable. It would be nice to not see ATA code in the driver like this.

All good: this is IO_XFER_ERROR_ABORTED_NCQ_MODE. And the read log is to
get the drive queue out of error state.

[23962.046101] pm80xx0:: mpi_sata_event  2788:SATA EVENT 0x26

This is IO_XFER_ERROR_UNEXPECTED_PHASE. No clue what this is doing.

[23962.054947] pm80xx0:: mpi_sata_event  2805:task or dev null

Why ?

[23962.063865] pm80xx0:: pm80xx_send_abort_all  1796:Executing abort
task end

All queued commands are aborted when the read log completes. Again, per
ATA NCQ specs, all good. After that, normal (some useless) messages.

[23962.063964] pm80xx0:: mpi_sata_completion  2395:IO failed device_id
16388 status 0x1 tag 1
[23962.084587] pm80xx0:: mpi_sata_completion  2430:SAS Address of IO
Failure Drive:50010860002f5657
[23962.095526] sas: sas_ata_task_done: SAS error 0x8d
[23962.139470] sas: Enter sas_scsi_recover_host busy: 1 failed: 1
[23962.147897] ata24.00: exception Emask 0x0 SAct 0x800 SErr 0x0 action 0x0
[23962.156253] ata24.00: failed command: RECEIVE FPDMA QUEUED
[23962.163307] ata24.00: cmd 65/01:00:00:00:0c/00:02:23:01:00/40 tag 11
ncq dma 512 in
[23962.163307]          res 43/10:00:00:00:00/00:00:00:00:00/00 Emask
0x480 (invalid argument) <F>

Good here, correct error...

[23962.182879] ata24.00: status: { DRDY SENSE ERR }
[23962.189100] ata24.00: error: { IDNF }

... but I need to look at this in libata. Getting the same with AHCI.

ok, in a way that is a relief.


[23962.215456] ata24.00: configured for UDMA/133
[23962.221469] ata24: EH complete
[23962.226056] sas: --- Exit sas_scsi_recover_host: busy: 0 failed: 1
tries: 1

That event 0x26 and the "task or dev null" are obscure. No clue, but
they look wrong.

As above, I doubt the implemenation of that code. It alloc's a domain device itself, which is not how things should be done.


Overall, this driver is by default way too verbose I think. I would
prefer to reduce the above to something like:

[23962.095526] sas: sas_ata_task_done: SAS error 0x8d
[23962.139470] sas: Enter sas_scsi_recover_host busy: 1 failed: 1
[23962.147897] ata24.00: exception Emask 0x0 SAct 0x800 SErr 0x0 action 0x0
[23962.156253] ata24.00: failed command: RECEIVE FPDMA QUEUED
[23962.163307] ata24.00: cmd 65/01:00:00:00:0c/00:02:23:01:00/40 tag 11
ncq dma 512 in
[23962.163307]          res 43/10:00:00:00:00/00:00:00:00:00/00 Emask
0x480 (invalid argument) <F>
[23962.182879] ata24.00: status: { DRDY SENSE ERR }
[23962.189100] ata24.00: error: { IDNF }
[23962.215456] ata24.00: configured for UDMA/133
[23962.221469] ata24: EH complete
[23962.226056] sas: --- Exit sas_scsi_recover_host: busy: 0 failed: 1
tries: 1

That is, by default, remove most FAIL messages, changing them to DEV
level. The same test with a SAS drive is totally silent, as it should
be, since the command error is not a fatal one.

But not touching this for now. I first want the series to be queued.


In response to later message:

On 17/02/2022 11:47, Damien Le Moal wrote:
>> Anyway, that can be fixed, but I still have the hang:(
> One thought: could it be bug with the DMA engine of your platform ?
> What if you simply run an fio workload on the disk directly (no FS),
> hang happens too ?

I did try fio a while ago and it worked ok. Strange.

> For the bugs I fixed with my series, it was the reverse: fio worked
> great but everything broke down when I ran libzbc tests...
>

I got a few more things to try but need to make progress on my libsas series now ...

Thanks,
John







[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
[Index of Archives]     [SCSI Target Devel]     [Linux SCSI Target Infrastructure]     [Kernel Newbies]     [IDE]     [Security]     [Git]     [Netfilter]     [Bugtraq]     [Yosemite News]     [MIPS Linux]     [ARM Linux]     [Linux Security]     [Linux RAID]     [Linux ATA RAID]     [Linux IIO]     [Samba]     [Device Mapper]

  Powered by Linux