Re: tcm_loop and aborted TMRs

michael.christie@xxxxxxxxxx · Tue, 30 Jul 2024 16:40:34 -0500

On 7/24/24 8:42 AM, Gao Xiang wrote:
> Hi all,
> 
> On 2022/12/8 11:45, Mike Christie wrote:
>> On 12/1/22 8:15 AM, Bodo Stroesser wrote:
>>> Are we sure qla, loop and xen are the only drivers that handle aborted
>>> TMRs incorrectly?
>>
>> I'm not sure now. When we looked at this before I was only checking
>> for crashes, but didn't check if there could be issues like where
>> the driver needed to do some cleanup in their aborted_task callout
>> but hasn't been doing it.
>>
>> For example ibmvscsi's aborted_task callout won't crash because the
>> fields it references are ok for a IO or tmr se_cmds. It doesn't do
>> vio_iu(iue)->srp.tsk_mgmt or vio_iu(iue)->srp.cmd in the aborted_task
>> callout and just accesses the se_cmd and ibmvscsis_cmd. So we are ok
>> there. However, I didn't look at the driver to see if maybe it did need
>> to do some cleanup in the aborted_task callout and we just haven't
>> been doing it.
>>
>> Same for the other drivers. I only checked if aborted_task would crash.
>> Also we have a new driver efct, so we need to review that as well.
> 
> 
> Sorry I have very little knowledge of TCMU, but currently we have
> some call traces stuck as below:
> 
> [811824.868078] task:kworker/u256:1  state:D stack:    0 pid:213661
> ppid:     2 flags:0x00004000
> [811824.868084] Workqueue: scsi_tmf_24 scmd_eh_abort_handler
> [811824.868085] Call Trace:
> [811824.868091]  __schedule+0x1ac/0x480
> [811824.868092]  schedule+0x46/0xb0
> [811824.868095]  schedule_timeout+0xe5/0x130
> [811824.868110]  ? transport_generic_handle_tmr+0xb9/0xd0 [target_core_mod]
> [811824.868112]  ? __prepare_to_swait+0x4f/0x70
> [811824.868114]  wait_for_completion+0x71/0xc0
> [811824.868118]  tcm_loop_issue_tmr+0xbb/0x100 [tcm_loop]
> [811824.868120]  tcm_loop_abort_task+0x3d/0x50 [tcm_loop]
> [811824.868121]  scmd_eh_abort_handler+0x7b/0x210
> [811824.868124]  process_one_work+0x1a8/0x340
> [811824.868125]  worker_thread+0x49/0x2f0
> [811824.868126]  ? rescuer_thread+0x350/0x350
> [811824.868127]  kthread+0x118/0x140
> [811824.868129]  ? __kthread_bind_mask+0x60/0x60
> [811824.868131]  ret_from_fork+0x1f/0x30
> [811824.868166] task:kworker/121:2   state:D stack:    0 pid:242954
> ppid:     2 flags:0x00004000
> [811824.868172] Workqueue: events target_tmr_work [target_core_mod]
> [811824.868172] Call Trace:
> [811824.868174]  __schedule+0x1ac/0x480
> [811824.868175]  schedule+0x46/0xb0
> [811824.868176]  schedule_timeout+0xe5/0x130
> [811824.868177]  ? asm_sysvec_apic_timer_interrupt+0x12/0x20
> [811824.868178]  ? __prepare_to_swait+0x4f/0x70
> [811824.868179]  wait_for_completion+0x71/0xc0
> [811824.868184]  target_put_cmd_and_wait+0x5d/0xb0 [target_core_mod]
> [811824.868192]  core_tmr_abort_task.cold+0x187/0x21a [target_core_mod]
> [811824.868198]  target_tmr_work+0xa3/0xf0 [target_core_mod]
> [811824.868200]  process_one_work+0x1a8/0x340
> [811824.868201]  worker_thread+0x49/0x2f0
> [811824.868202]  ? rescuer_thread+0x350/0x350
> [811824.868202]  kthread+0x118/0x140
> [811824.868203]  ? __kthread_bind_mask+0x60/0x60
> [811824.868204]  ret_from_fork+0x1f/0x30
> 
> I'm not sure how to recover from this state.  Is it resolved upstream?
> 

It's not.

I think the safest thing is to just revert the patch which caused this

commit db5b21a24e01d354  "scsi: target/core: Use system workqueues for TMF".

because we don't currently have the resources to fix qla.

Do you want to send the patch? If not, I can send it.