RE: CmdSN greather than MaxCmdSN protocol error in LIO Iser

"Moussa Ba (moussaba)" <moussaba@xxxxxxxxxx> · Tue, 12 Nov 2013 18:18:59 +0000

> -----Original Message-----
> From: Mike Christie [mailto:michaelc@xxxxxxxxxxx]
> Sent: Tuesday, November 12, 2013 7:38 AM
> To: Nicholas A. Bellinger
> Cc: Moussa Ba (moussaba); target-devel@xxxxxxxxxxxxxxx; Nicholas
> Bellinger; Or Gerlitz; linux-scsi
> Subject: Re: CmdSN greather than MaxCmdSN protocol error in LIO Iser
> 
> On 11/11/13 10:34 PM, Nicholas A. Bellinger wrote:
> > On Mon, 2013-11-11 at 20:39 -0600, Michael Christie wrote:
> >>
> >>> On Nov 11, 2013, at 8:32 PM, Michael Christie
> <michaelc@xxxxxxxxxxx> wrote:
> >>>
> >
> > <SNIP>
> >
> >>>>>>> Mike, any idea as to how this might be happening on the
> initiator
> >>>>>>> side..?
> >>>>
> >>>> <SNIP>
> >>>>
> >>>> Mmmmm, just noticed a bit of very suspicious logic in open-iscsi.
> >>>> (CC'ing linux-scsi)
> >>>>
> >>>> Namely, how drivers/scsi/libiscsi.c:iscsi_data_xmit() attempts to
> drain
> >>>> conn->cmdqueue and hence call iscsi_prep_scsi_cmd_pdu() +
> increment
> >>>> session->cmdsn, without ever checking to see if the CmdSN window
> may
> >>>> have already closed..
> >>>>
> >>>> The only CmdSN window check that I see in the I/O dispatch
> codepath is
> >>>> in iscsi_queuecommand() -> iscsi_check_cmdsn_window_closed(), but
> once
> >>>> iscsi_conn_queue_work() is invoked and process context in
> >>>> iscsi_xmitworker() -> iscsi_data_xmit() starts to execute, there
> is
> >>>> AFAICT no further CmdSN window open checks to ensure the initiator
> does
> >>>> not overflow MaxCmdSN..
> >>>>
> >>>> This would very much line up with the type of bug that is being
> >>>> reported.  Before trying to determine what a possible fix might
> look
> >>>> like, can you try the following libiscsi patch to see if your able
> to
> >>>> hit either of the BUG's below..?
> >>>>
> >>>> Thanks,
> >>>>
> >>>> --nab
> >>>>
> >>>> diff --git a/drivers/scsi/libiscsi.c b/drivers/scsi/libiscsi.c
> >>>> index e399561..9106f63 100644
> >>>> --- a/drivers/scsi/libiscsi.c
> >>>> +++ b/drivers/scsi/libiscsi.c
> >>>> @@ -1483,6 +1483,12 @@ check_mgmt:
> >>>>                        fail_scsi_task(conn->task, DID_IMM_RETRY);
> >>>>                        continue;
> >>>>                }
> >>>> +
> >>>> +               if (iscsi_check_cmdsn_window_closed(conn)) {
> >>>> +                       printk("CmdSN window already closed in
> iscsi_data_xmit #1\n");
> >>>> +                       BUG();
> >>>> +               }
> >>>> +
> >>>>                rc = iscsi_prep_scsi_cmd_pdu(conn->task);
> >>>>                if (rc) {
> >>>>                        if (rc == -ENOMEM || rc == -EACCES) {
> >>>> @@ -1518,6 +1524,11 @@ check_mgmt:
> >>>>                if (iscsi_check_tmf_restrictions(task,
> ISCSI_OP_SCSI_DATA_OUT))
> >>>>                        break;
> >>>>
> >>>> +               if (iscsi_check_cmdsn_window_closed(conn)) {
> >>>> +                       printk("CmdSN window already closed in
> iscsi_data_xmit #2\n");
> >>>> +                       BUG();
> >>>> +               }
> >>>> +
> >>>>                conn->task = task;
> >>>>                list_del_init(&conn->task->running);
> >>>>                conn->task->state = ISCSI_TASK_RUNNING;
> >>>
> >>> If you hit a bug there then it is probably the target or if you are
> >> using the new libiscsi back/frwd lock patches it could be related to
> >> them.
> >
> > FYI, this is with v3.11 and RHEL 6.x open-iscsi code..
> >
> >> We should not need to check above because we never queue more cmds
> >> from the scsi layer than the window will allow at the time. If the
> >> window is 10, the queued_cmdsn check should make sure we have only
> >> accepted 10 commands into the iscsi layer so we should not need to
> >> recheck above.
> >>>
> >>> You should just enable libiscsi debug printk in
> >> iscsi_check_cmdsn_window so we can see the sn related values at the
> >> time of queuing. You should enable that printk whithout your patch.
> >>
> >> Actually that will not help because we always have the queued cmdsn
> >> lower than the max in that path. I would do your patch but then also
> >> print the queued cmdsn, cmdsn and also the maxcmdsn in the check
> >> function.
> >>
> >>
> >
> > Yes, notice the second test patch is doing the explicit check of
> > task->cmdsn vs. sess->max_cmdsn, instead of the one that checks
> > ->queued_cmdsn vs. ->max_cmdsn in iscsi_check_cmdsn_window_closed().
> >
> > Btw, I'm surmising the bug in question does not happen from the
> > iscsi_queuecommand() submission path, but rather from the
> > iscsi_update_cmdsn() -> __iscsi_update_cmdsn() completion path, where
> > iscsi_conn_queue_work() gets called when the CmdSN window was
> > previously closed, eg:
> >
> >          if (max_cmdsn != session->max_cmdsn &&
> >              !iscsi_sna_lt(max_cmdsn, session->max_cmdsn)) {
> >                  session->max_cmdsn = max_cmdsn;
> >                  /*
> >                   * if the window closed with IO queued, then kick
> the
> >                   * xmit thread
> >                   */
> >                  if (!list_empty(&session->leadconn->cmdqueue) ||
> >                      !list_empty(&session->leadconn->mgmtqueue))
> >                          iscsi_conn_queue_work(session->leadconn);
> >          }
> > }
> >
> > Once iscsi_conn_queue_work() is invoked here to start process context
> > execution of iscsi_xmitworker() -> iscsi_data_xmit() code, AFAICT
> there
> > is no logic in place within iscsi_data_xmit() to honor the last
> received
> > MaxCmdSN.
> >
> > Or to put it another way: what is preventing iscsi_data_xmit() from
> > completely draining both conn->cmdqueue + conn->requeue, even when
> the
> > CmdSN window has potentially been closed again..?
> >
> 
> If the window was not big enough we would have not accepted the cmd
> from
> scsi-ml in iscsi_queuecommand. We would have never put it on the list
> above then.
> 
> In the initiator we have:
> 
> session->queued_cmdsn - It will always be less than or equal to the
> max_cmdsn, but greater than or equal to session->cmdsn. Think of it
> like
> preallocating the sn we are going to give the cmd. Depending on
> ordering
> it might be different (3 instead of 4), but the final value it gets
> will
> be less than the max_cmdsn.
> 
> session->cmdsn - the sn we give the cmd when we actually put it on the
> wire. It is always less than or equal to queued_cmdsn.
> 
> So we have cmdsn <= queued_cmdsn <= max_cmdsn.
> 
> The iscsi_queuecommand window check makes sure we will only have put a
> cmd on the cmdqueue list if queued_cmdsn was less than or equal to the
> max_cmdsn. And cmdsn is less than or equal to queued_cmdsn, so we can
> run iscsi_data_xmit without checking the window values again, because
> it
> is not possible for something like max_cmdsn to decrease on us.

I am attaching error codes on the inititator that seem to coincide with the timeout errors.

[  261.855254]  connection6:0: pdu (op 0x65 itt 0x1) rejected. Reason code 0x4
[  261.855266]  connection6:0: pdu (op 0xa itt 0x1) rejected. Reason code 0x4
[  261.855268]  connection6:0: pdu (op 0x51 itt 0x1) rejected. Reason code 0x4

��.n��������+%������w��{.n�����{������ܨ}���Ơz�j:+v�����w����ޙ��&�)ߡ�a����z�ޗ���ݢj��w�f