Re: tcm_loop and aborted TMRs

Dmitry Bogdanov <d.bogdanov@xxxxxxxxx> · Mon, 21 Nov 2022 16:35:50 +0300

Hi Bodo,

On Sat, Nov 12, 2022 at 02:59:48PM +0100, Bodo Stroesser wrote:
> 
> Hello Mike, Maurizio,
> 
> Even if we couldn't yet find a method to fix handling of aborted
> TMRs in the core or in all fabric drivers, I still think that keeping
> the parallel handling of TMRs would be fine.
> 
> Tcmu offers a TMR notification mechanism to make userspace aware
> of ABORT or RESET_LUN. So userspace can try to break cmd handling
> and thus speed up TMR response. If we serialize TMR handling, then
> the notifications are also serialized and thus lose some of their
> power.
> 
> But maybe I have a new (?) idea of how to fix handling of aborted
> TMRs in fabric drivers:
> 1) Modify core to not call target_put_sess_cmd, no matter whether
>    SCF_ACK_REF is set.
> 2) Modify fabric drivers to handle an aborted TMR just like a
>    normal TMR response. This means, e.g. qla2xxx would send a
>    normal response for the Abort. This exactly is what happens
>    when serializing TMRs, because in that case despite of the
>    RESET_LUN the core always calls queue_tm_rsp callback instead
>    of aborted_task callback.
I am not sure for all initiators, but usually it sends LUN_RESET only
after ABORT has been timed out. There is no reason to send a response on
the ABORT that was actually aborted already internally on the initiator
side.

> So to initiators we would show the 'old' behavior, while internally
> keeping the parallel processing of TMRs.
> 
> If fabric driver maintainers don't like that approach, they can
> change their drivers to correctly kill aborted TMRs.
> 
> What do you think?
> 
I will vote to your old patch. qla2xxx was fixed long time ago. Other
fabric drivers have not that issue too. Only tcm_loop and xen still do
not adapted for parallel TMRs.

I think it's a not good idea to revert 2 years old patch. It was a
reason for some other patches (that fixes issues thanks to parallel TMR
handling).

> 
> On 11.11.22 22:18, Mike Christie wrote:
> > On 11/11/22 10:20 AM, Maurizio Lombardi wrote:
> > > Hello Bodo, Mike,
> > > 
> > > Some of our customers reported that the tcm_loop module is unable
> > > to handle aborted TMRs, resulting in kernel hangs.
> > > 
> > > I noticed that Bodo submitted a patch some time ago (our customers
> > > confirm it works),
> > > Mike instead proposed to revert
> > > commit db5b21a24e01d354  "scsi: target/core: Use system workqueues for TMF".
> > > 
> > > The discussion unfortunately died out without reaching a conclusion.
> > > 
> > > Personally, I think that if the handling of aborted TMRs was working
> > > before the "Use system workqueues" commit then this should be
> > > considered as a regression and the commit should be reverted.
> > > 
> > 
> > I'm fine with reverting it because multiple drivers are affected. I had
> > talked to Bart offlist back then and he was also ok since we couldn't
> > fix up the drivers.
> > 
> > I think Bodo and I tried to convert qla, but it was difficult without
> > marvell's help (we both pinged them but didn't hear back) because from
> > what I could tell we needed to send some hw/fw commands to perform cleanup
> > to fully handle that case.