On 4/9/21 12:18 PM, Mike Christie wrote: > On 4/7/21 9:21 AM, Dmitry Bogdanov wrote: >> Currently TMF commands are removed from de_device.dev_tmf_list at >> the very end of se_cmd lifecycle. But se_lun unlinks from se_cmd >> up on a command status (response) is queued in transport layer. >> It means that LUN and backend device can be deleted meantime and at >> the moment of repsonse completion a panic is occured: >> >> target_tmr_work() >> cmd->se_tfo->queue_tm_rsp(cmd); // send abort_rsp to a wire >> transport_lun_remove_cmd(cmd) // unlink se_cmd from se_lun >> — // — // — // — >> <<<--- lun remove >> <<<--- core backend device remove >> — // — // — // — >> qlt_handle_abts_completion() >> tfo->free_mcmd() >> transport_generic_free_cmd() >> target_put_sess_cmd() >> core_tmr_release_req() { >> if (dev) { // backend device, can not be null >> spin_lock_irqsave(&dev->se_tmr_lock, flags); //<<<--- CRASH >> >> Call Trace: >> NIP [c000000000e1683c] _raw_spin_lock_irqsave+0x2c/0xc0 >> LR [c00800000e433338] core_tmr_release_req+0x40/0xa0 [target_core_mod] >> Call Trace: >> (unreliable) >> 0x0 >> target_put_sess_cmd+0x2a0/0x370 [target_core_mod] >> transport_generic_free_cmd+0x6c/0x1b0 [target_core_mod] >> tcm_qla2xxx_complete_mcmd+0x28/0x50 [tcm_qla2xxx] >> process_one_work+0x2c4/0x5c0 >> worker_thread+0x88/0x690 >> >> For FC protocol it is a race condition, but for iSCSI protocol it is >> easyly reproduced by manual sending iSCSI commands: >> - Send some SCSI sommand >> - Send Abort of that command over iSCSI >> - Remove LUN on target >> - Send next iSCSI command to acknowledge the Abort_Response >> - target panics >> >> There is no sense to keep the command in tmr_list until response >> completion, so move the removal from tmr_list from the response >> completion to the response queueing when lun is unlinked. >> Move the removal from state list too as it is a subject to the same >> race condition. >> >> Fixes: c66ac9db8d4a ("[SCSI] target: Add LIO target core v4.0.0-rc6") >> Reviewed-by: Roman Bolshakov <r.bolshakov@xxxxxxxxx> >> Signed-off-by: Dmitry Bogdanov <d.bogdanov@xxxxxxxxx> >> >> --- >> The issue exists from the very begining. >> I uploaded a script that helps to reproduce the issue at >> https://urldefense.com/v3/__https://gist.github.com/logost/cb93df41dd2432454324449b390403c4__;!!GqivPVa7Brio!KN-N7Ult7Itn2AzY6LdP2vm0D81UIjpbCyOAH3uf2OoLGUykpGP3dJTQlLm9m71MjsxY$ >> --- >> drivers/target/target_core_tmr.c | 9 --------- >> drivers/target/target_core_transport.c | 19 +++++++++++++++++-- >> 2 files changed, 17 insertions(+), 11 deletions(-) >> >> diff --git a/drivers/target/target_core_tmr.c b/drivers/target/target_core_tmr.c >> index 7347285471fa..323a173010c1 100644 >> --- a/drivers/target/target_core_tmr.c >> +++ b/drivers/target/target_core_tmr.c >> @@ -50,15 +50,6 @@ EXPORT_SYMBOL(core_tmr_alloc_req); >> >> void core_tmr_release_req(struct se_tmr_req *tmr) >> { >> - struct se_device *dev = tmr->tmr_dev; >> - unsigned long flags; >> - >> - if (dev) { >> - spin_lock_irqsave(&dev->se_tmr_lock, flags); >> - list_del_init(&tmr->tmr_list); >> - spin_unlock_irqrestore(&dev->se_tmr_lock, flags); >> - } >> - >> kfree(tmr); >> } >> >> diff --git a/drivers/target/target_core_transport.c b/drivers/target/target_core_transport.c >> index 5ecb9f18a53d..4d9968f43cf1 100644 >> --- a/drivers/target/target_core_transport.c >> +++ b/drivers/target/target_core_transport.c >> @@ -667,6 +667,20 @@ static void target_remove_from_state_list(struct se_cmd *cmd) >> spin_unlock_irqrestore(&dev->queues[cmd->cpuid].lock, flags); >> } >> >> +static void target_remove_from_tmr_list(struct se_cmd *cmd) >> +{ >> + struct se_device *dev = NULL; >> + unsigned long flags; >> + > > This is just a nit. Maybe just do: > > struct se_device *dev = NULL; > unsigned long flags; > > if (!(cmd->se_cmd_flags & SCF_SCSI_TMR_CDB)) > return; > > dev = cmd->se_tmr_req->tmr_dev; > spin_lock_irqsave(&dev->se_tmr_lock, flags); > list_del_init(&cmd->se_tmr_req->tmr_list); > spin_unlock_irqrestore(&dev->se_tmr_lock, flags); > This might be wrong. I thought when you moved the deletion to transport_cmd_check_stop_to_fabric, we would always have a dev set. But in core_tmr_abort_task there is that transport_lookup_tmr_lun check. If that is a valid check in there and it could be NULL in the path: transport_generic_handle_tmr -> target_handle_abort -> ransport_cmd_check_stop_to_fabric then we would hit a NULL pointer. I'm not sure how we would get a NULL dev there though. The driver would have to not use the standard target_submit_tmr or be iscsi and not call transport_lookup_tmr_lun. One issue with the patch though is if iscsit_tmr_abort_task fails then we don't call transport_cmd_check_stop_to_fabric, so the tmr will be stuck on the list.