On 7/24/24 8:42 AM, Gao Xiang wrote: > Hi all, > > On 2022/12/8 11:45, Mike Christie wrote: >> On 12/1/22 8:15 AM, Bodo Stroesser wrote: >>> Are we sure qla, loop and xen are the only drivers that handle aborted >>> TMRs incorrectly? >> >> I'm not sure now. When we looked at this before I was only checking >> for crashes, but didn't check if there could be issues like where >> the driver needed to do some cleanup in their aborted_task callout >> but hasn't been doing it. >> >> For example ibmvscsi's aborted_task callout won't crash because the >> fields it references are ok for a IO or tmr se_cmds. It doesn't do >> vio_iu(iue)->srp.tsk_mgmt or vio_iu(iue)->srp.cmd in the aborted_task >> callout and just accesses the se_cmd and ibmvscsis_cmd. So we are ok >> there. However, I didn't look at the driver to see if maybe it did need >> to do some cleanup in the aborted_task callout and we just haven't >> been doing it. >> >> Same for the other drivers. I only checked if aborted_task would crash. >> Also we have a new driver efct, so we need to review that as well. > > > Sorry I have very little knowledge of TCMU, but currently we have > some call traces stuck as below: > > [811824.868078] task:kworker/u256:1 state:D stack: 0 pid:213661 > ppid: 2 flags:0x00004000 > [811824.868084] Workqueue: scsi_tmf_24 scmd_eh_abort_handler > [811824.868085] Call Trace: > [811824.868091] __schedule+0x1ac/0x480 > [811824.868092] schedule+0x46/0xb0 > [811824.868095] schedule_timeout+0xe5/0x130 > [811824.868110] ? transport_generic_handle_tmr+0xb9/0xd0 [target_core_mod] > [811824.868112] ? __prepare_to_swait+0x4f/0x70 > [811824.868114] wait_for_completion+0x71/0xc0 > [811824.868118] tcm_loop_issue_tmr+0xbb/0x100 [tcm_loop] > [811824.868120] tcm_loop_abort_task+0x3d/0x50 [tcm_loop] > [811824.868121] scmd_eh_abort_handler+0x7b/0x210 > [811824.868124] process_one_work+0x1a8/0x340 > [811824.868125] worker_thread+0x49/0x2f0 > [811824.868126] ? rescuer_thread+0x350/0x350 > [811824.868127] kthread+0x118/0x140 > [811824.868129] ? __kthread_bind_mask+0x60/0x60 > [811824.868131] ret_from_fork+0x1f/0x30 > [811824.868166] task:kworker/121:2 state:D stack: 0 pid:242954 > ppid: 2 flags:0x00004000 > [811824.868172] Workqueue: events target_tmr_work [target_core_mod] > [811824.868172] Call Trace: > [811824.868174] __schedule+0x1ac/0x480 > [811824.868175] schedule+0x46/0xb0 > [811824.868176] schedule_timeout+0xe5/0x130 > [811824.868177] ? asm_sysvec_apic_timer_interrupt+0x12/0x20 > [811824.868178] ? __prepare_to_swait+0x4f/0x70 > [811824.868179] wait_for_completion+0x71/0xc0 > [811824.868184] target_put_cmd_and_wait+0x5d/0xb0 [target_core_mod] > [811824.868192] core_tmr_abort_task.cold+0x187/0x21a [target_core_mod] > [811824.868198] target_tmr_work+0xa3/0xf0 [target_core_mod] > [811824.868200] process_one_work+0x1a8/0x340 > [811824.868201] worker_thread+0x49/0x2f0 > [811824.868202] ? rescuer_thread+0x350/0x350 > [811824.868202] kthread+0x118/0x140 > [811824.868203] ? __kthread_bind_mask+0x60/0x60 > [811824.868204] ret_from_fork+0x1f/0x30 > > I'm not sure how to recover from this state. Is it resolved upstream? > It's not. I think the safest thing is to just revert the patch which caused this commit db5b21a24e01d354 "scsi: target/core: Use system workqueues for TMF". because we don't currently have the resources to fix qla. Do you want to send the patch? If not, I can send it.