On 9/15/21 9:17 AM, Dmitry Bogdanov wrote: > Currently TMF commands are removed from de_device.dev_tmf_list at > the very end of se_cmd lifecycle. But se_lun unlinks from se_cmd > up on a command status (response) is queued in transport layer. > It means that LUN and backend device can be deleted meantime and at > the moment of repsonse completion a panic is occured: > > target_tmr_work() > cmd->se_tfo->queue_tm_rsp(cmd); // send abort_rsp to a wire > transport_lun_remove_cmd(cmd) // unlink se_cmd from se_lun > - // - // - // - > <<<--- lun remove > <<<--- core backend device remove > - // - // - // - > qlt_handle_abts_completion() > tfo->free_mcmd() > transport_generic_free_cmd() > target_put_sess_cmd() > core_tmr_release_req() { > if (dev) { // backend device, can not be null > spin_lock_irqsave(&dev->se_tmr_lock, flags); //<<<--- CRASH > > Call Trace: > NIP [c000000000e1683c] _raw_spin_lock_irqsave+0x2c/0xc0 > LR [c00800000e433338] core_tmr_release_req+0x40/0xa0 [target_core_mod] > Call Trace: > (unreliable) > 0x0 > target_put_sess_cmd+0x2a0/0x370 [target_core_mod] > transport_generic_free_cmd+0x6c/0x1b0 [target_core_mod] > tcm_qla2xxx_complete_mcmd+0x28/0x50 [tcm_qla2xxx] > process_one_work+0x2c4/0x5c0 > worker_thread+0x88/0x690 > > For FC protocol it is a race condition, but for iSCSI protocol it is > easyly reproduced by manual sending iSCSI commands: > - Send some SCSI sommand > - Send Abort of that command over iSCSI > - Remove LUN on target > - Send next iSCSI command to acknowledge the Abort_Response > - target panics > > There is no sense to keep the command in tmr_list until response > completion, so move the removal from tmr_list from the response > completion to the response queueing when lun is unlinked. > Move the removal from state list too as it is a subject to the same > race condition. > > Fixes: c66ac9db8d4a ("[SCSI] target: Add LIO target core v4.0.0-rc6") > Reviewed-by: Roman Bolshakov <r.bolshakov@xxxxxxxxx> > Signed-off-by: Dmitry Bogdanov <d.bogdanov@xxxxxxxxx> > > --- > v3: > remove iscsi fix as not related to the issue > avoid double removal from tmr_list > v2: > fix stuck in tmr list in error case > > The issue exists from the very begining. > I uploaded a scapy script that helps to reproduce the issue at > https://gist.github.com/logost/cb93df41dd2432454324449b390403c4 > --- > drivers/target/target_core_tmr.c | 10 +-------- > drivers/target/target_core_transport.c | 30 ++++++++++++++++++++------ > 2 files changed, 24 insertions(+), 16 deletions(-) > > diff --git a/drivers/target/target_core_tmr.c b/drivers/target/target_core_tmr.c > index e7fcbc09f9db..84ae2fe456ec 100644 > --- a/drivers/target/target_core_tmr.c > +++ b/drivers/target/target_core_tmr.c > @@ -50,15 +50,6 @@ EXPORT_SYMBOL(core_tmr_alloc_req); > > void core_tmr_release_req(struct se_tmr_req *tmr) > { > - struct se_device *dev = tmr->tmr_dev; > - unsigned long flags; > - > - if (dev) { > - spin_lock_irqsave(&dev->se_tmr_lock, flags); > - list_del_init(&tmr->tmr_list); > - spin_unlock_irqrestore(&dev->se_tmr_lock, flags); > - } > - > kfree(tmr); > } > > @@ -234,6 +225,7 @@ static void core_tmr_drain_tmr_list( > } > > list_move_tail(&tmr_p->tmr_list, &drain_tmr_list); > + tmr_p->tmr_dev = NULL; Is this patch now adding a way to hit: if (!tmr->tmr_dev) WARN_ON_ONCE(transport_lookup_tmr_lun(tmr->task_cmd) < 0); in core_tmr_abort_task? You have the abort and lun reset works running on different CPUs. The lun reset hits the above code first and clears tmr_dev. The abort then hits the tmr->tmr_dev check and tries to do transport_lookup_tmr_lun. For the case where the lun is not removed, it looks like transport_lookup_tmr_lun will add the tmr to the dev_tmr_list but it would also be on the drain_tmr_list above so we would hit list corruption. For the case where the lun is getting removed, percpu_ref_tryget_live would fail in transport_lookup_tmr_lun and we hit the WARN_ON_ONCE. I think though with your patch, we would be ok and don't want the WARN_ON_ONCE, right? The lun reset would just wait for the abort. When it completes the abort and reset complete as expected.