Hi Mike, > > @@ -234,6 +225,7 @@ static void core_tmr_drain_tmr_list( > > } > > > > list_move_tail(&tmr_p->tmr_list, &drain_tmr_list); > > + tmr_p->tmr_dev = NULL; > > Is this patch now adding a way to hit: > > if (!tmr->tmr_dev) > WARN_ON_ONCE(transport_lookup_tmr_lun(tmr->task_cmd) < 0); > > in core_tmr_abort_task? > > You have the abort and lun reset works running on different CPUs. > The lun reset hits the above code first and clears tmr_dev. > The abort then hits the tmr->tmr_dev check and tries to do > transport_lookup_tmr_lun. > > For the case where the lun is not removed, it looks like > transport_lookup_tmr_lun will add the tmr to the dev_tmr_list > but it would also be on the drain_tmr_list above so we would > hit list corruption. Yes, there is a such race. I think, I can solve it by changing the order of draining the tmr_list and state_list at LUN Reset to make the raced lines be under the same lock. Especially SAM-5 describes(but does not require) aborting commands before tmfs: | When responding to a logical unit reset condition, the logical unit shall: | a) abort all commands as described in 5.6; | b) abort all copy operations (see SPC-4); | c) terminate all task management functions; > For the case where the lun is getting removed, percpu_ref_tryget_live > would fail in transport_lookup_tmr_lun and we hit the WARN_ON_ONCE. > I think though with your patch, we would be ok and don't want > the WARN_ON_ONCE, right? The lun reset would just wait for the > abort. When it completes the abort and reset complete as expected. I don’t understand the meaning of that transport_lookup_tmr_lun there. Every TMF Abort has already executed transport_lookup_tmr_lun at the very beginning of its handling. Eliminating the race will eliminate the impact of my patch on this case too. BR, Dmitry