Re: [PATCH v3 27/31] scsi: pm8001: Cleanup pm8001_queue_command()

Damien Le Moal <damien.lemoal@xxxxxxxxxxxxxxxxxx> · Thu, 17 Feb 2022 09:12:02 +0900

On 2/16/22 21:21, John Garry wrote:
> 
>>> JFYI, turning on DMA debug sometimes gives this even after fdisk -l:
>>>
>>> [   45.080945] sas: sas_scsi_find_task: querying task 0x(____ptrval____)
>>> [   45.087582] pm80xx0:: mpi_ssp_completion  1936:sas IO status 0x3b
>>
>> What is status 0x3b ? Is this a driver thing or sas thing ? Have not
>> checked.
> 
> This is a driver thing. I'd need the manual to check.
> 
>>
>>> [   45.093681] pm80xx0:: mpi_ssp_completion  1947:SAS Address of IO
>>> Failure Drive:5000c50085ff5559
>>> [   45.102641] pm80xx0:: mpi_ssp_completion  1936:sas IO status 0x3b
>>> [   45.108739] pm80xx0:: mpi_ssp_completion  1947:SAS Address of IO
>>> Failure Drive:5000c50085ff5559
>>> [   45.117694] pm80xx0:: mpi_ssp_completion  1936:sas IO status 0x3b
>>> [   45.123792] pm80xx0:: mpi_ssp_completion  1947:SAS Address of IO
>>> Failure Drive:5000c50085ff5559
>>> [   45.132652] pm80xx: rc= -5
>>
>> This comes from pm8001_query_task(), pm8001_abort_task() or
>> pm8001_chip_abort_task()...
>>
>>> [   45.135370] sas: sas_scsi_find_task: task 0x(____ptrval____) result
>>> code -5 not handled
>>
>> Missing error handling ?
> 
> This is something I added. So the driver does not return a valid TMF 
> code - it returns -5, which I think comes from pm8001_query_task() -> 
> pm8001_exec_internal_tmf_task(). And sas_scsi_find_task() does not 
> recognise -5 and just assumes that the TMF failed, so ...
> 
>>
>>> [   45.143466] sas: task 0x(____ptrval____) is not at LU: I_T recover
>>> [   45.149741] sas: I_T nexus reset for dev 5000c50085ff5559
>>> [   47.183916] sas: I_T 5000c50085ff5559 recovered
>>
>> Weird... Losing your drive ? Bad cable ?
> 
> .. we escalate the error handling and call sas_eh_handle_sas_errors() -> 
> sas_recover_I_T(), which resets the PHY - see pm8001_I_T_nexus_reset().
> 
>>
>>> [   47.189034] sas: --- Exit sas_scsi_recover_host: busy: 0 failed: 1
>>> tries: 1
>>> [   47.204168] ------------[ cut here ]------------
>>> [   47.208829] DMA-API: pm80xx 0000:04:00.0: cacheline tracking EEXIST,
>>> overlapping mappings aren't supported
>>> [   47.218502] WARNING: CPU: 3 PID: 641 at kernel/dma/debug.c:570
>>> add_dma_entry+0x308/0x3f0
>>> [   47.226607] Modules linked in:
>>> [   47.229678] CPU: 3 PID: 641 Comm: kworker/3:1H Not tainted
>>> 5.17.0-rc1-11918-gd9d909a8c666 #407
>>> [   47.238298] Hardware name: Huawei D06 /D06, BIOS Hisilicon D06 UEFI
>>> RC0 - V1.16.01 03/15/2019
>>> [   47.246829] Workqueue: kblockd blk_mq_run_work_fn
>>> [   47.251552] pstate: 604000c9 (nZCv daIF +PAN -UAO -TCO -DIT -SSBS
>>> BTYPE=--)
>>> [   47.258522] pc : add_dma_entry+0x308/0x3f0
>>> [   47.262626] lr : add_dma_entry+0x308/0x3f0
>>> [   47.266730] sp : ffff80002e5c75f0
>>> [   47.270049] x29: ffff80002e5c75f0 x28: 0000002880a908c0 x27:
>>> ffff80000cc95440
>>> [   47.277216] x26: ffff80000cc94000 x25: ffff80000cc94e20 x24:
>>> ffff00208e4660c8
>>> [   47.284382] x23: ffff800009d16b40 x22: ffff80000a5b8700 x21:
>>> 1ffff00005cb8eca
>>> [   47.291548] x20: ffff80000caf4c90 x19: ffff0a2009726100 x18:
>>> 0000000000000000
>>> [   47.298713] x17: 70616c7265766f20 x16: 2c54534958454520 x15:
>>> 676e696b63617274
>>> [   47.305879] x14: 1ffff00005cb8df4 x13: 0000000041b58ab3 x12:
>>> ffff700005cb8e27
>>> [   47.313044] x11: 1ffff00005cb8e26 x10: ffff700005cb8e26 x9 :
>>> dfff800000000000
>>> [   47.320210] x8 : ffff80002e5c7137 x7 : 0000000000000001 x6 :
>>> 00008ffffa3471da
>>> [   47.327375] x5 : ffff80002e5c7130 x4 : dfff800000000000 x3 :
>>> ffff8000083a1f48
>>> [   47.334540] x2 : 0000000000000000 x1 : 0000000000000000 x0 :
>>> ffff00208f7ab200
>>> [   47.341706] Call trace:
>>> [   47.344157]  add_dma_entry+0x308/0x3f0
>>> [   47.347914]  debug_dma_map_sg+0x3ac/0x500
>>> [   47.351931]  __dma_map_sg_attrs+0xac/0x130
>>> [   47.356037]  dma_map_sg_attrs+0x14/0x2c
>>> [   47.359883]  pm8001_task_exec.constprop.0+0x5e0/0x800
>>> [   47.364945]  pm8001_queue_command+0x1c/0x2c
>>> [   47.369136]  sas_queuecommand+0x2c4/0x360
>>> [   47.373153]  scsi_queue_rq+0x810/0x1334
>>> [   47.377000]  blk_mq_dispatch_rq_list+0x340/0xda0
>>> [   47.381625]  __blk_mq_sched_dispatch_requests+0x14c/0x22c
>>> [   47.387034]  blk_mq_sched_dispatch_requests+0x60/0x9c
>>> [   47.392095]  __blk_mq_run_hw_queue+0xc8/0x274
>>> [   47.396460]  blk_mq_run_work_fn+0x30/0x40
>>> [   47.400476]  process_one_work+0x494/0xbac
>>> [   47.404494]  worker_thread+0xac/0x6d0
>>> [   47.408164]  kthread+0x174/0x184
>>> [   47.411401]  ret_from_fork+0x10/0x2[   45.080945] sas:
>>> sas_scsi_find_task: querying task 0x(____ptrval____)
>>> [   45.087582] pm80xx0:: mpi_ssp_completion  1936:sas IO status 0x3b
>>> [   45.093681] pm80xx0:: mpi_ssp_completion  1947:SAS Address of IO
>>> Failure Drive:5000c50085ff5559
>>> [   45.102641] pm80xx0:: mpi_ssp_completion  1936:sas IO status 0x3b
>>> [   45.108739] pm80xx0:: mpi_ssp_completion  1947:SAS Address of IO
>>> Failure Drive:5000c50085ff5559
>>> [   45.117694] pm80xx0:: mpi_ssp_completion  1936:sas IO status 0x3b
>>> [   45.123792] pm80xx0:: mpi_ssp_completion  1947:SAS Address of IO
>>> Failure Drive:5000c50085ff5559
>>> [   45.132652] pm80xx: rc= -5
>>> [   45.135370] sas: sas_scsi_find_task: task 0x(____ptrval____) result
>>> code -5 not handled
>>> [   45.143466] sas: task 0x(____ptrval____) is not at LU: I_T recover
>>> [   45.149741] sas: I_T nexus reset for dev 5000c50085ff5559
>>> [   47.183916] sas: I_T 5000c50085ff5559 recovered
>>> [   47.189034] sas: --- Exit sas_scsi_recover_host: busy: 0 failed: 1
>>> tries: 1
>>> [   47.204168] ------------[ cut here ]------------
>>> [   47.208829] DMA-API: pm80xx 0000:04:00.0: cacheline tracking EEXIST,
>>> overlapping mappings aren't supported
>>> [   47.218502] WARNING: CPU: 3 PID: 641 at kernel/dma/debug.c:570
>>> add_dma_entry+0x308/0x3f0
>>> [   47.226607] Modules linked in:
>>> [   47.229678] CPU: 3 PID: 641 Comm: kworker/3:1H Not tainted
>>> 5.17.0-rc1-11918-gd9d909a8c666 #407
>>> [   47.238298] Hardware name: Huawei D06 /D06, BIOS Hisilicon D06 UEFI
>>> RC0 - V1.16.01 03/15/2019
>>> [   47.246829] Workqueue: kblockd blk_mq_run_work_fn
>>> [   47.251552] pstate: 604000c9 (nZCv daIF +PAN -UAO -TCO -DIT -SSBS
>>> BTYPE=--)
>>> [   47.258522] pc : add_dma_entry+0x308/0x3f0
>>> [   47.262626] lr : add_dma_entry+0x308/0x3f0
>>> [   47.266730] sp : ffff80002e5c75f0
>>> [   47.270049] x29: ffff80002e5c75f0 x28: 0000002880a908c0 x27:
>>> ffff80000cc95440
>>> [   47.277216] x26: ffff80000cc94000 x25: ffff80000cc94e20 x24:
>>> ffff00208e4660c8
>>> [   47.284382] x23: ffff800009d16b40 x22: ffff80000a5b8700 x21:
>>> 1ffff00005cb8eca
>>> [   47.291548] x20: ffff80000caf4c90 x19: ffff0a2009726100 x18:
>>> 0000000000000000
>>> [   47.298713] x17: 70616c7265766f20 x16: 2c54534958454520 x15:
>>> 676e696b63617274
>>> [   47.305879] x14: 1ffff00005cb8df4 x13: 0000000041b58ab3 x12:
>>> ffff700005cb8e27
>>> [   47.313044] x11: 1ffff00005cb8e26 x10: ffff700005cb8e26 x9 :
>>> dfff800000000000
>>> [   47.320210] x8 : ffff80002e5c7137 x7 : 0000000000000001 x6 :
>>> 00008ffffa3471da
>>> [   47.327375] x5 : ffff80002e5c7130 x4 : dfff800000000000 x3 :
>>> ffff8000083a1f48
>>> [   47.334540] x2 : 0000000000000000 x1 : 0000000000000000 x0 :
>>> ffff00208f7ab200
>>> [   47.341706] Call trace:
>>> [   47.344157]  add_dma_entry+0x308/0x3f0
>>> [   47.347914]  debug_dma_map_sg+0x3ac/0x500
>>> [   47.351931]  __dma_map_sg_attrs+0xac/0x130
>>> [   47.356037]  dma_map_sg_attrs+0x14/0x2c
>>> [   47.359883]  pm8001_task_exec.constprop.0+0x5e0/0x800
>>> [   47.364945]  pm8001_queue_command+0x1c/0x2c
>>> [   47.369136]  sas_queuecommand+0x2c4/0x360
>>> [   47.373153]  scsi_queue_rq+0x810/0x1334
>>> [   47.377000]  blk_mq_dispatch_rq_list+0x340/0xda0
>>> [   47.381625]  __blk_mq_sched_dispatch_requests+0x14c/0x22c
>>> [   47.387034]  blk_mq_sched_dispatch_requests+0x60/0x9c
>>> [   47.392095]  __blk_mq_run_hw_queue+0xc8/0x274
>>> [   47.396460]  blk_mq_run_work_fn+0x30/0x40
>>> [   47.400476]  process_one_work+0x494/0xbac
>>> [   47.404494]  worker_thread+0xac/0x6d0
>>> [   47.408164]  kthread+0x174/0x184
>>> [   47.411401]  ret_from_fork+0x10/0x2
>>>
>>> I'll have a look at it. And that is on mainline or mkp-scsi staging, and
>>> not your patchset.
>>
>> Are you saying that my patches suppresses the above ? This is submission
>> path and the dma code seems to complain about alignment... So bad buffer
>> addresses ?
> 
> Your series does not suppress it. It doesn't occur often, so I need to 
> check more.
> 
> I think the issue is that we call dma_map_sg() twice, i.e. ccb never 
> unmapped.

That would be a big issue indeed. We could add a flag to CCBs to track
the buf_prd DMA mapping state and BUG_ON() when ccb free function is
called with the buffer still mapped. That should allow catching this
infrequent problem ?

> 
> Thanks,
> John
> 

-- 
Damien Le Moal
Western Digital Research