Hi Michael,
On 2024/7/31 05:40, michael.christie@xxxxxxxxxx wrote:
On 7/24/24 8:42 AM, Gao Xiang wrote:
Hi all,
On 2022/12/8 11:45, Mike Christie wrote:
On 12/1/22 8:15 AM, Bodo Stroesser wrote:
Are we sure qla, loop and xen are the only drivers that handle aborted
TMRs incorrectly?
I'm not sure now. When we looked at this before I was only checking
for crashes, but didn't check if there could be issues like where
the driver needed to do some cleanup in their aborted_task callout
but hasn't been doing it.
For example ibmvscsi's aborted_task callout won't crash because the
fields it references are ok for a IO or tmr se_cmds. It doesn't do
vio_iu(iue)->srp.tsk_mgmt or vio_iu(iue)->srp.cmd in the aborted_task
callout and just accesses the se_cmd and ibmvscsis_cmd. So we are ok
there. However, I didn't look at the driver to see if maybe it did need
to do some cleanup in the aborted_task callout and we just haven't
been doing it.
Same for the other drivers. I only checked if aborted_task would crash.
Also we have a new driver efct, so we need to review that as well.
Sorry I have very little knowledge of TCMU, but currently we have
some call traces stuck as below:
[811824.868078] task:kworker/u256:1 state:D stack: 0 pid:213661
ppid: 2 flags:0x00004000
[811824.868084] Workqueue: scsi_tmf_24 scmd_eh_abort_handler
[811824.868085] Call Trace:
[811824.868091] __schedule+0x1ac/0x480
[811824.868092] schedule+0x46/0xb0
[811824.868095] schedule_timeout+0xe5/0x130
[811824.868110] ? transport_generic_handle_tmr+0xb9/0xd0 [target_core_mod]
[811824.868112] ? __prepare_to_swait+0x4f/0x70
[811824.868114] wait_for_completion+0x71/0xc0
[811824.868118] tcm_loop_issue_tmr+0xbb/0x100 [tcm_loop]
[811824.868120] tcm_loop_abort_task+0x3d/0x50 [tcm_loop]
[811824.868121] scmd_eh_abort_handler+0x7b/0x210
[811824.868124] process_one_work+0x1a8/0x340
[811824.868125] worker_thread+0x49/0x2f0
[811824.868126] ? rescuer_thread+0x350/0x350
[811824.868127] kthread+0x118/0x140
[811824.868129] ? __kthread_bind_mask+0x60/0x60
[811824.868131] ret_from_fork+0x1f/0x30
[811824.868166] task:kworker/121:2 state:D stack: 0 pid:242954
ppid: 2 flags:0x00004000
[811824.868172] Workqueue: events target_tmr_work [target_core_mod]
[811824.868172] Call Trace:
[811824.868174] __schedule+0x1ac/0x480
[811824.868175] schedule+0x46/0xb0
[811824.868176] schedule_timeout+0xe5/0x130
[811824.868177] ? asm_sysvec_apic_timer_interrupt+0x12/0x20
[811824.868178] ? __prepare_to_swait+0x4f/0x70
[811824.868179] wait_for_completion+0x71/0xc0
[811824.868184] target_put_cmd_and_wait+0x5d/0xb0 [target_core_mod]
[811824.868192] core_tmr_abort_task.cold+0x187/0x21a [target_core_mod]
[811824.868198] target_tmr_work+0xa3/0xf0 [target_core_mod]
[811824.868200] process_one_work+0x1a8/0x340
[811824.868201] worker_thread+0x49/0x2f0
[811824.868202] ? rescuer_thread+0x350/0x350
[811824.868202] kthread+0x118/0x140
[811824.868203] ? __kthread_bind_mask+0x60/0x60
[811824.868204] ret_from_fork+0x1f/0x30
I'm not sure how to recover from this state. Is it resolved upstream?
It's not.
I think the safest thing is to just revert the patch which caused this
commit db5b21a24e01d354 "scsi: target/core: Use system workqueues for TMF".
because we don't currently have the resources to fix qla.
Do you want to send the patch? If not, I can send it.
Thanks for the reply.
I have very little knowledge and no time on this, just confirm its
current status. But yeah, it'd be better to resolve, anyway.
Thanks,
Gao Xiang