Hello Mike, Maurizio,
Even if we couldn't yet find a method to fix handling of aborted
TMRs in the core or in all fabric drivers, I still think that keeping
the parallel handling of TMRs would be fine.
Tcmu offers a TMR notification mechanism to make userspace aware
of ABORT or RESET_LUN. So userspace can try to break cmd handling
and thus speed up TMR response. If we serialize TMR handling, then
the notifications are also serialized and thus lose some of their
power.
But maybe I have a new (?) idea of how to fix handling of aborted
TMRs in fabric drivers:
1) Modify core to not call target_put_sess_cmd, no matter whether
SCF_ACK_REF is set.
2) Modify fabric drivers to handle an aborted TMR just like a
normal TMR response. This means, e.g. qla2xxx would send a
normal response for the Abort. This exactly is what happens
when serializing TMRs, because in that case despite of the
RESET_LUN the core always calls queue_tm_rsp callback instead
of aborted_task callback.
So to initiators we would show the 'old' behavior, while internally
keeping the parallel processing of TMRs.
If fabric driver maintainers don't like that approach, they can
change their drivers to correctly kill aborted TMRs.
What do you think?
Bodo
On 11.11.22 22:18, Mike Christie wrote:
On 11/11/22 10:20 AM, Maurizio Lombardi wrote:
Hello Bodo, Mike,
Some of our customers reported that the tcm_loop module is unable
to handle aborted TMRs, resulting in kernel hangs.
I noticed that Bodo submitted a patch some time ago (our customers
confirm it works),
Mike instead proposed to revert
commit db5b21a24e01d354 "scsi: target/core: Use system workqueues for TMF".
The discussion unfortunately died out without reaching a conclusion.
Personally, I think that if the handling of aborted TMRs was working
before the "Use system workqueues" commit then this should be
considered as a regression and the commit should be reverted.
I'm fine with reverting it because multiple drivers are affected. I had
talked to Bart offlist back then and he was also ok since we couldn't
fix up the drivers.
I think Bodo and I tried to convert qla, but it was difficult without
marvell's help (we both pinged them but didn't hear back) because from
what I could tell we needed to send some hw/fw commands to perform cleanup
to fully handle that case.