Hi Bodo, On Sat, Nov 12, 2022 at 02:59:48PM +0100, Bodo Stroesser wrote: > > Hello Mike, Maurizio, > > Even if we couldn't yet find a method to fix handling of aborted > TMRs in the core or in all fabric drivers, I still think that keeping > the parallel handling of TMRs would be fine. > > Tcmu offers a TMR notification mechanism to make userspace aware > of ABORT or RESET_LUN. So userspace can try to break cmd handling > and thus speed up TMR response. If we serialize TMR handling, then > the notifications are also serialized and thus lose some of their > power. > > But maybe I have a new (?) idea of how to fix handling of aborted > TMRs in fabric drivers: > 1) Modify core to not call target_put_sess_cmd, no matter whether > SCF_ACK_REF is set. > 2) Modify fabric drivers to handle an aborted TMR just like a > normal TMR response. This means, e.g. qla2xxx would send a > normal response for the Abort. This exactly is what happens > when serializing TMRs, because in that case despite of the > RESET_LUN the core always calls queue_tm_rsp callback instead > of aborted_task callback. I am not sure for all initiators, but usually it sends LUN_RESET only after ABORT has been timed out. There is no reason to send a response on the ABORT that was actually aborted already internally on the initiator side. > So to initiators we would show the 'old' behavior, while internally > keeping the parallel processing of TMRs. > > If fabric driver maintainers don't like that approach, they can > change their drivers to correctly kill aborted TMRs. > > What do you think? > I will vote to your old patch. qla2xxx was fixed long time ago. Other fabric drivers have not that issue too. Only tcm_loop and xen still do not adapted for parallel TMRs. I think it's a not good idea to revert 2 years old patch. It was a reason for some other patches (that fixes issues thanks to parallel TMR handling). > > On 11.11.22 22:18, Mike Christie wrote: > > On 11/11/22 10:20 AM, Maurizio Lombardi wrote: > > > Hello Bodo, Mike, > > > > > > Some of our customers reported that the tcm_loop module is unable > > > to handle aborted TMRs, resulting in kernel hangs. > > > > > > I noticed that Bodo submitted a patch some time ago (our customers > > > confirm it works), > > > Mike instead proposed to revert > > > commit db5b21a24e01d354 "scsi: target/core: Use system workqueues for TMF". > > > > > > The discussion unfortunately died out without reaching a conclusion. > > > > > > Personally, I think that if the handling of aborted TMRs was working > > > before the "Use system workqueues" commit then this should be > > > considered as a regression and the commit should be reverted. > > > > > > > I'm fine with reverting it because multiple drivers are affected. I had > > talked to Bart offlist back then and he was also ok since we couldn't > > fix up the drivers. > > > > I think Bodo and I tried to convert qla, but it was difficult without > > marvell's help (we both pinged them but didn't hear back) because from > > what I could tell we needed to send some hw/fw commands to perform cleanup > > to fully handle that case.