Re: tcm_loop and aborted TMRs

Gao Xiang <hsiangkao@xxxxxxxxxxxxxxxxx> · Wed, 31 Jul 2024 21:49:06 +0800

Hi Michael,

On 2024/7/31 05:40, michael.christie@xxxxxxxxxx wrote:
On 7/24/24 8:42 AM, Gao Xiang wrote:
Hi all,

On 2022/12/8 11:45, Mike Christie wrote:
On 12/1/22 8:15 AM, Bodo Stroesser wrote:
Are we sure qla, loop and xen are the only drivers that handle aborted
TMRs incorrectly?

I'm not sure now. When we looked at this before I was only checking
for crashes, but didn't check if there could be issues like where
the driver needed to do some cleanup in their aborted_task callout
but hasn't been doing it.

For example ibmvscsi's aborted_task callout won't crash because the
fields it references are ok for a IO or tmr se_cmds. It doesn't do
vio_iu(iue)->srp.tsk_mgmt or vio_iu(iue)->srp.cmd in the aborted_task
callout and just accesses the se_cmd and ibmvscsis_cmd. So we are ok
there. However, I didn't look at the driver to see if maybe it did need
to do some cleanup in the aborted_task callout and we just haven't
been doing it.

Same for the other drivers. I only checked if aborted_task would crash.
Also we have a new driver efct, so we need to review that as well.

Sorry I have very little knowledge of TCMU, but currently we have
some call traces stuck as below:

[811824.868078] task:kworker/u256:1  state:D stack:    0 pid:213661
ppid:     2 flags:0x00004000
[811824.868084] Workqueue: scsi_tmf_24 scmd_eh_abort_handler
[811824.868085] Call Trace:
[811824.868091]  __schedule+0x1ac/0x480
[811824.868092]  schedule+0x46/0xb0
[811824.868095]  schedule_timeout+0xe5/0x130
[811824.868110]  ? transport_generic_handle_tmr+0xb9/0xd0 [target_core_mod]
[811824.868112]  ? __prepare_to_swait+0x4f/0x70
[811824.868114]  wait_for_completion+0x71/0xc0
[811824.868118]  tcm_loop_issue_tmr+0xbb/0x100 [tcm_loop]
[811824.868120]  tcm_loop_abort_task+0x3d/0x50 [tcm_loop]
[811824.868121]  scmd_eh_abort_handler+0x7b/0x210
[811824.868124]  process_one_work+0x1a8/0x340
[811824.868125]  worker_thread+0x49/0x2f0
[811824.868126]  ? rescuer_thread+0x350/0x350
[811824.868127]  kthread+0x118/0x140
[811824.868129]  ? __kthread_bind_mask+0x60/0x60
[811824.868131]  ret_from_fork+0x1f/0x30
[811824.868166] task:kworker/121:2   state:D stack:    0 pid:242954
ppid:     2 flags:0x00004000
[811824.868172] Workqueue: events target_tmr_work [target_core_mod]
[811824.868172] Call Trace:
[811824.868174]  __schedule+0x1ac/0x480
[811824.868175]  schedule+0x46/0xb0
[811824.868176]  schedule_timeout+0xe5/0x130
[811824.868177]  ? asm_sysvec_apic_timer_interrupt+0x12/0x20
[811824.868178]  ? __prepare_to_swait+0x4f/0x70
[811824.868179]  wait_for_completion+0x71/0xc0
[811824.868184]  target_put_cmd_and_wait+0x5d/0xb0 [target_core_mod]
[811824.868192]  core_tmr_abort_task.cold+0x187/0x21a [target_core_mod]
[811824.868198]  target_tmr_work+0xa3/0xf0 [target_core_mod]
[811824.868200]  process_one_work+0x1a8/0x340
[811824.868201]  worker_thread+0x49/0x2f0
[811824.868202]  ? rescuer_thread+0x350/0x350
[811824.868202]  kthread+0x118/0x140
[811824.868203]  ? __kthread_bind_mask+0x60/0x60
[811824.868204]  ret_from_fork+0x1f/0x30

I'm not sure how to recover from this state.  Is it resolved upstream?

It's not.

I think the safest thing is to just revert the patch which caused this

commit db5b21a24e01d354  "scsi: target/core: Use system workqueues for TMF".

because we don't currently have the resources to fix qla.

Do you want to send the patch? If not, I can send it.

Thanks for the reply.

I have very little knowledge and no time on this, just confirm its
current status.  But yeah, it'd be better to resolve, anyway.

Thanks,
Gao Xiang