On 2/17/22 18:23, John Garry wrote: > On 17/02/2022 00:12, Damien Le Moal wrote: >>>>> I'll have a look at it. And that is on mainline or mkp-scsi staging, and >>>>> not your patchset. >>>> Are you saying that my patches suppresses the above ? This is submission >>>> path and the dma code seems to complain about alignment... So bad buffer >>>> addresses ? >>> Your series does not suppress it. It doesn't occur often, so I need to >>> check more. >>> >>> I think the issue is that we call dma_map_sg() twice, i.e. ccb never >>> unmapped. >> That would be a big issue indeed. We could add a flag to CCBs to track >> the buf_prd DMA mapping state and BUG_ON() when ccb free function is >> called with the buffer still mapped. That should allow catching this >> infrequent problem ? >> > > I figured out what is happening here and it does not help solve the > mystery of my hang. > > Here's the steps: > a. scsi_cmnd times out > b. scsi error handling kicks in > c. libsas attempts to abort the task, which fails > d. libsas then tries IT nexus reset, which passes > - libsas assumes the scsi_cmnd has completed with failure > e. error handling concludes > f. scsi midlayer then retries the same scsi_cmnd > g. since we did not "free" associated ccb earlier or dma unmap at d., > the dma unmap on the same scsi_cmnd causes the warn > > So the LLD should really free resources and dma unmap at point IT nexus > reset completes, but it doesn't. I think in certain conditions dma map > should not be done twice. > > Anyway, that can be fixed, but I still have the hang :( One thought: could it be bug with the DMA engine of your platform ? What if you simply run an fio workload on the disk directly (no FS), hang happens too ? For the bugs I fixed with my series, it was the reverse: fio worked great but everything broke down when I ran libzbc tests... > > Thanks, > John -- Damien Le Moal Western Digital Research