Re: tcm_loop and aborted TMRs

Bodo Stroesser <bostroesser@xxxxxxxxx> · Sat, 19 Nov 2022 12:42:25 +0100

On 12.11.22 22:46, michael.christie@xxxxxxxxxx wrote:
On 11/12/22 7:59 AM, Bodo Stroesser wrote:
Hello Mike, Maurizio,

Even if we couldn't yet find a method to fix handling of aborted
TMRs in the core or in all fabric drivers, I still think that keeping
the parallel handling of TMRs would be fine.

Tcmu offers a TMR notification mechanism to make userspace aware
of ABORT or RESET_LUN. So userspace can try to break cmd handling
and thus speed up TMR response. If we serialize TMR handling, then
the notifications are also serialized and thus lose some of their
power.

But maybe I have a new (?) idea of how to fix handling of aborted
TMRs in fabric drivers:
1) Modify core to not call target_put_sess_cmd, no matter whether
    SCF_ACK_REF is set.
2) Modify fabric drivers to handle an aborted TMR just like a
    normal TMR response. This means, e.g. qla2xxx would send a
    normal response for the Abort. This exactly is what happens
    when serializing TMRs, because in that case despite of the
    RESET_LUN the core always calls queue_tm_rsp callback instead
    of aborted_task callback.

So to initiators we would show the 'old' behavior, while internally
keeping the parallel processing of TMRs.

If fabric driver maintainers don't like that approach, they can
change their drivers to correctly kill aborted TMRs.

What do you think?

I'm fine with doing it in parallel. However, the issue is we have real
users hitting it now and we have to fix all the drivers because it's a
regression. So if your idea is going take a while then we should revert
now and then do your idea whenever it's ready.

I agree.

Even if my old patch fixes the issue for tcm_loop users, it does not
make sense to apply it, since the new idea would lead to reverting
parts of it. And with the patch we would still take the risk of users
running into trouble with fabrics other than tcm_loop.