Re: BUG in stress login-logout to multiple IQNs

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 



On Sun, 2015-02-22 at 18:36 +0200, Sagi Grimberg wrote:
> On 2/21/2015 9:54 AM, Nicholas A. Bellinger wrote:
> > Hi Sagi,
> >
> > Apologies for the delayed response..  Comments below.
> >
> > On Sat, 2015-02-14 at 12:21 +0200, Sagi Grimberg wrote:
> >> On 2/12/2015 10:47 AM, Nicholas A. Bellinger wrote:
> >>> On Wed, 2015-02-11 at 10:17 +0200, Sagi Grimberg wrote:
> >>>> Hey Nic,
> >>>>
> >>>> So Our QA guys recently stepped on this bug when performing stress
> >>>> login-logout from a single initiator to 10 targets each exposed over
> >>>> 4 portals, so overall 40 sessions (needless to say we are talking on
> >>>> iser...). So there are lots of logins in parallel with lots of logouts.
> >>>>
> >>>> It seems that the connection termination causes iscsi_tx_thread to
> >>>> access the connection after it is freed or something (list corruption
> >>>> probably coming from iscsit_handle_immediate_queue or
> >>>> iscsit_handle_response_queue, and NULL deref coming from
> >>>> iscsit_take_action_for_connection_exit).
> >>>>
> >>>> Note, isert_wait_conn waits for session commands and QP flush which is
> >>>> normally pretty fast, the conn termination is done in a work that waits
> >>>> for DISCONNECTED event which might take longer (which is why we do it
> >>>> outside wait_conn context to avoid blocking it).
> >>>>
> >>>> I didn't get too far with this until now, do you have any idea on what
> >>>> might have happened?
> >>>
> >>> Mmm, it looks like iscsit_take_action_for_connection_exit() in TX thread
> >>> context is calling iscsi_close_connection() after hitting the following
> >>> check in iscsi_target_erl0.c:
> >>>
> >>>           if (conn->conn_state == TARG_CONN_STATE_IN_LOGOUT) {
> >>>                   spin_unlock_bh(&conn->state_lock);
> >>>                   iscsit_close_connection(conn);
> >>>                   return;
> >>>           }
> >>>
> >>> .. once iscsi_close_connection() has already being called earlier by
> >>> iser-target code.
> >>
> >> Not sure I understand where iscsit_close_connection is called earlier
> >> by iser target. The iser code usually only notifies any problems to the
> >> iscsi layer to do it's thing.
> >>
> >> Care to explain how iscsit_close_connection might be called twice?
> >>
> >
> > It appears iscsit_close_connection() is getting invoked first from
> > iscsi_trx context after isert_cq_comp_err() has previously called
> > iscsit_cause_connection_reinstatement() to force a connection failure to
> > occur during explicit logout + ISCSI_LOGOUT_REASON_CLOSE_SESSION
> > operation.
> 
> Yes, that is correct.
> 
> >
> > You can tell because isert_wait_conn() + isert_wait4cmds() debug output
> > appears before list_del corruption in iscsi_ttx context, which can only
> > be invoked via iscsit_close_connection() -> transport->wait_for_conn()
> > -> isert_wait_conn().
> 
> Given the fact that there are lots of connections tearing down in
> parallel, I wouldn't rely on the proximity of these debug prints to
> the stack trace.
> 
> >
> > Once iscsi_ttx context runs, it's hitting the TARG_CONN_STATE_IN_LOGOUT
> > state check in iscsit_take_action_for_connection_exit() and re-invokes
> > iscsit_close_connection(), after iscsit_logout_closesession() from
> > isert_rx_completion() context handles REASON_CLOSE_SESSION and changed
> > connection state to IN_LOGOUT, but before the logout response was posted
> > and successfully completed in isert_do_control_comp().
> 
> iscsit_take_action_for_connection_exit() is invoked both by RX/TX
> threads. But only one should get to iscsit_close_connection() since
> conn->connection_exit is set under conn->state_lock. I'd say that if
> iscsit_close_connection() was invoked twice, the bug is in
> iscsit_take_action_for_connection_exit() isn't it?
> 

Sorry, yes.

After looking at this further, I think the previous isert_cq_comp_err()
patch still makes sense for the special logout response failure case,
but as you've noted it does not address root cause of the original
OOPsen.

I'm now thinking it's related to complete(conn->conn_logout_comp)
happening the start of iscsit_close_connection() (as originally intended
for non-iser logout response failure case), that is causing
isert_wait4logout() to immediately complete instead of allowing
iscsit_logout_post_handler() to perform complete(conn->conn_logout_comp)
after completion interrupt -> isert_do_control_comp() happens.

This could result in iscsit_release_commands_from_conn() corrupting
conn_cmd_list list when attempting to release the logout response
before/during iser logout response completion interrupt handling.

Here's a quick patch to test the theory.

--nab

diff --git a/drivers/target/iscsi/iscsi_target.c b/drivers/target/iscsi/iscsi_target.c
index 50bad55..ddbd022 100644
--- a/drivers/target/iscsi/iscsi_target.c
+++ b/drivers/target/iscsi/iscsi_target.c
@@ -4256,11 +4256,12 @@ int iscsit_close_connection(
        pr_debug("Closing iSCSI connection CID %hu on SID:"
                " %u\n", conn->cid, sess->sid);
        /*
-        * Always up conn_logout_comp just in case the RX Thread is sleeping
-        * and the logout response never got sent because the connection
-        * failed.
+        * Always up conn_logout_comp for the traditional TCP case just in case
+        * the RX Thread in iscsi_target_rx_opcode() is sleeping and the logout
+        * response never got sent because the connection failed.
         */
-       complete(&conn->conn_logout_comp);
+       if (conn->conn_transport->transport_type == ISCSI_TCP)
+               complete(&conn->conn_logout_comp);
 
        iscsi_release_thread_set(conn);

--
To unsubscribe from this list: send the line "unsubscribe target-devel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at  http://vger.kernel.org/majordomo-info.html




[Index of Archives]     [Linux SCSI]     [Kernel Newbies]     [Linux SCSI Target Infrastructure]     [Share Photos]     [IDE]     [Security]     [Git]     [Netfilter]     [Bugtraq]     [Yosemite News]     [MIPS Linux]     [ARM Linux]     [Linux Security]     [Linux RAID]     [Linux ATA RAID]     [Linux IIO]     [Device Mapper]

  Powered by Linux