On 2/17/22 18:23, John Garry wrote: > On 17/02/2022 00:12, Damien Le Moal wrote: >>>>> I'll have a look at it. And that is on mainline or mkp-scsi staging, and >>>>> not your patchset. >>>> Are you saying that my patches suppresses the above ? This is submission >>>> path and the dma code seems to complain about alignment... So bad buffer >>>> addresses ? >>> Your series does not suppress it. It doesn't occur often, so I need to >>> check more. >>> >>> I think the issue is that we call dma_map_sg() twice, i.e. ccb never >>> unmapped. >> That would be a big issue indeed. We could add a flag to CCBs to track >> the buf_prd DMA mapping state and BUG_ON() when ccb free function is >> called with the buffer still mapped. That should allow catching this >> infrequent problem ? >> > > I figured out what is happening here and it does not help solve the > mystery of my hang. > > Here's the steps: > a. scsi_cmnd times out > b. scsi error handling kicks in > c. libsas attempts to abort the task, which fails > d. libsas then tries IT nexus reset, which passes > - libsas assumes the scsi_cmnd has completed with failure > e. error handling concludes > f. scsi midlayer then retries the same scsi_cmnd > g. since we did not "free" associated ccb earlier or dma unmap at d., > the dma unmap on the same scsi_cmnd causes the warn > > So the LLD should really free resources and dma unmap at point IT nexus > reset completes, but it doesn't. I think in certain conditions dma map > should not be done twice. > > Anyway, that can be fixed, but I still have the hang :( I guess (a) (cmd timeout) is only the symptom of the hang ? That is, the hang is causing the timeout ? It may be good to turn on scsi trace to see if the command was only partially done, or not at all, or if it is a non-data command. And speaking of errors, I am currently testing v4 of my series and noticed some weird things in the error handling. E.g., with one of the test executing a report zones command with an LBA out of range, I see this: [23962.027105] pm80xx0:: mpi_sata_event 2788:SATA EVENT 0x23 [23962.036099] pm80xx0:: pm80xx_send_read_log 1863:Executing read log end All good: this is IO_XFER_ERROR_ABORTED_NCQ_MODE. And the read log is to get the drive queue out of error state. [23962.046101] pm80xx0:: mpi_sata_event 2788:SATA EVENT 0x26 This is IO_XFER_ERROR_UNEXPECTED_PHASE. No clue what this is doing. [23962.054947] pm80xx0:: mpi_sata_event 2805:task or dev null Why ? [23962.063865] pm80xx0:: pm80xx_send_abort_all 1796:Executing abort task end All queued commands are aborted when the read log completes. Again, per ATA NCQ specs, all good. After that, normal (some useless) messages. [23962.063964] pm80xx0:: mpi_sata_completion 2395:IO failed device_id 16388 status 0x1 tag 1 [23962.084587] pm80xx0:: mpi_sata_completion 2430:SAS Address of IO Failure Drive:50010860002f5657 [23962.095526] sas: sas_ata_task_done: SAS error 0x8d [23962.139470] sas: Enter sas_scsi_recover_host busy: 1 failed: 1 [23962.147897] ata24.00: exception Emask 0x0 SAct 0x800 SErr 0x0 action 0x0 [23962.156253] ata24.00: failed command: RECEIVE FPDMA QUEUED [23962.163307] ata24.00: cmd 65/01:00:00:00:0c/00:02:23:01:00/40 tag 11 ncq dma 512 in [23962.163307] res 43/10:00:00:00:00/00:00:00:00:00/00 Emask 0x480 (invalid argument) <F> Good here, correct error... [23962.182879] ata24.00: status: { DRDY SENSE ERR } [23962.189100] ata24.00: error: { IDNF } ... but I need to look at this in libata. Getting the same with AHCI. [23962.215456] ata24.00: configured for UDMA/133 [23962.221469] ata24: EH complete [23962.226056] sas: --- Exit sas_scsi_recover_host: busy: 0 failed: 1 tries: 1 That event 0x26 and the "task or dev null" are obscure. No clue, but they look wrong. Overall, this driver is by default way too verbose I think. I would prefer to reduce the above to something like: [23962.095526] sas: sas_ata_task_done: SAS error 0x8d [23962.139470] sas: Enter sas_scsi_recover_host busy: 1 failed: 1 [23962.147897] ata24.00: exception Emask 0x0 SAct 0x800 SErr 0x0 action 0x0 [23962.156253] ata24.00: failed command: RECEIVE FPDMA QUEUED [23962.163307] ata24.00: cmd 65/01:00:00:00:0c/00:02:23:01:00/40 tag 11 ncq dma 512 in [23962.163307] res 43/10:00:00:00:00/00:00:00:00:00/00 Emask 0x480 (invalid argument) <F> [23962.182879] ata24.00: status: { DRDY SENSE ERR } [23962.189100] ata24.00: error: { IDNF } [23962.215456] ata24.00: configured for UDMA/133 [23962.221469] ata24: EH complete [23962.226056] sas: --- Exit sas_scsi_recover_host: busy: 0 failed: 1 tries: 1 That is, by default, remove most FAIL messages, changing them to DEV level. The same test with a SAS drive is totally silent, as it should be, since the command error is not a fatal one. But not touching this for now. I first want the series to be queued. -- Damien Le Moal Western Digital Research