Re: tcm_loop and aborted TMRs

Bodo Stroesser <bostroesser@xxxxxxxxx> · Thu, 1 Dec 2022 15:15:44 +0100

On 27.11.22 19:59, Mike Christie wrote:
On 11/25/22 3:34 AM, Dmitry Bogdanov wrote:
On Mon, Nov 21, 2022 at 01:25:55PM -0600, Mike Christie wrote:

On 11/21/22 7:35 AM, Dmitry Bogdanov wrote:

I will vote to your old patch. qla2xxx was fixed long time ago.

What is the qla fix? I think we still leak. In commit

commit 605e74025f953b995a3a241ead43bde71c1c99b5
Author: Mike Christie <michael.christie@xxxxxxxxxx>
Date:   Sun Nov 1 12:59:31 2020 -0600

     scsi: qla2xxx: Move sess cmd list/lock to driver

when I changed the locking I had added the check:

static void tcm_qla2xxx_aborted_task(struct se_cmd *se_cmd)
{
         struct qla_tgt_cmd *cmd;
         unsigned long flags;

         if (se_cmd->se_cmd_flags & SCF_SCSI_TMR_CDB)
                 return;

Yes, I was thinking about that commit.
because tmrs are not on the sess_cmd_list that's accessed the
next line down. We don't crash as a result, but I think we need
to add code to send the cleanup command to the FW. Bodo and I
were working on that part, but someone with more qla experience
needed to work on it so it could be properly tested. We didn't
hear back from the qla engineers so progress had stalled.

Yes, you are right, FW expects some response on every ABORT IOCB
to clear its resources.

I can prepare the patch for qla2xxx.

That would be awesome.

But still, I do not see a sense of new Bodo's solution.

Drivers are crashing in the aborted_task callout. His idea was a
workaround/hack so we would call the queue_tm_rsp callout instead of
aborted_ask like we did before which would avoid the crashes but
allow us to keep the async behavior.

Not exactly. In case we are not able to fix a driver, I wanted to
change drivers's aborted_task callout to use its internal queue_tm_rsp
routines in case of aborted TMRs.
But that would work only if we changed the core to not call
target_put_sess_cmd, which again would lead to changes in currently
working drivers. Really a hack only.

To the initiator's it would work like before where it looks like a
race where the abort response is received right after it has sent
the lun reset. His assumption is that initiators handled that before
so it would continue to work.

If you fix qla, then we can easily fix loop and xen and we can do
the correct behavior and keep the async feature.

Are we sure qla, loop and xen are the only drivers that handle aborted
TMRs incorrectly?