On Mon, 2015-02-16 at 20:56 +0200, Sagi Grimberg wrote: > On 2/12/2015 10:47 AM, Nicholas A. Bellinger wrote: > > On Wed, 2015-02-11 at 10:17 +0200, Sagi Grimberg wrote: > >> Hey Nic, > >> > >> So Our QA guys recently stepped on this bug when performing stress > >> login-logout from a single initiator to 10 targets each exposed over > >> 4 portals, so overall 40 sessions (needless to say we are talking on > >> iser...). So there are lots of logins in parallel with lots of logouts. > >> > >> It seems that the connection termination causes iscsi_tx_thread to > >> access the connection after it is freed or something (list corruption > >> probably coming from iscsit_handle_immediate_queue or > >> iscsit_handle_response_queue, and NULL deref coming from > >> iscsit_take_action_for_connection_exit). > >> > >> Note, isert_wait_conn waits for session commands and QP flush which is > >> normally pretty fast, the conn termination is done in a work that waits > >> for DISCONNECTED event which might take longer (which is why we do it > >> outside wait_conn context to avoid blocking it). > >> > >> I didn't get too far with this until now, do you have any idea on what > >> might have happened? > > > > Mmm, it looks like iscsit_take_action_for_connection_exit() in TX thread > > context is calling iscsi_close_connection() after hitting the following > > check in iscsi_target_erl0.c: > > > > if (conn->conn_state == TARG_CONN_STATE_IN_LOGOUT) { > > spin_unlock_bh(&conn->state_lock); > > iscsit_close_connection(conn); > > return; > > } > > > > .. once iscsi_close_connection() has already being called earlier by > > iser-target code. > > > > AFAICT, this check is specific to traditional iscsi-target during the > > iscsit_handle_logout_cmd() failure case in iscsi_target_rx_opcode(). > > > > Here's a quick patch to make this case ISCSI_TCP only. > > > > Thanks Sagi. > > > > --nab > > > > diff --git a/drivers/target/iscsi/iscsi_target_erl0.c b/drivers/target/iscsi/iscsi_target_erl0.c > > index bdd8731..1c197ba 100644 > > --- a/drivers/target/iscsi/iscsi_target_erl0.c > > +++ b/drivers/target/iscsi/iscsi_target_erl0.c > > @@ -22,6 +22,7 @@ > > #include <target/target_core_fabric.h> > > > > #include <target/iscsi/iscsi_target_core.h> > > +#include <target/iscsi/iscsi_transport.h> > > #include "iscsi_target_seq_pdu_list.h" > > #include "iscsi_target_tq.h" > > #include "iscsi_target_erl0.h" > > @@ -939,7 +940,8 @@ void iscsit_take_action_for_connection_exit(struct iscsi_conn *conn) > > > > if (conn->conn_state == TARG_CONN_STATE_IN_LOGOUT) { > > spin_unlock_bh(&conn->state_lock); > > - iscsit_close_connection(conn); > > + if (conn->conn_transport->transport_type == ISCSI_TCP) > > + iscsit_close_connection(conn); > > return; > > } > > > > > Don't seem to hit a NULL deref with this one (although I'm not sure what > is the root cause here - might pop up again). > > But I do see iscsi_trx thread hang since we can't seem to deallocate > it... kinda hard to understand all the completions/signals. > > kernel: INFO: task iscsi_trx:18456 blocked for more than 120 seconds. > kernel: Tainted: G OE 3.18.0-rc2+ #28 > kernel: "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this > message. > kernel: iscsi_trx D 0000000000000003 0 18456 2 0x00000000 > kernel: ffff880444297c48 0000000000000046 ffff880444297c18 ffff880444294010 > kernel: 0000000000011b00 0000000000011b00 ffff880441d88d50 ffff88046fa60190 > kernel: ffff880444297c28 ffff8800747f3f00 7fffffffffffffff ffff880444297d08 > kernel: Call Trace: > kernel: [<ffffffff815332b4>] schedule+0x24/0x70 > kernel: [<ffffffff815359c5>] schedule_timeout+0x1a5/0x210 > kernel: [<ffffffff810705f6>] ? ttwu_queue+0xe6/0x100 > kernel: [<ffffffff81074821>] ? try_to_wake_up+0x1a1/0x230 > kernel: [<ffffffff8107709b>] ? sched_clock_cpu+0xab/0xc0 > kernel: [<ffffffff81534396>] wait_for_completion+0xc6/0x100 > kernel: [<ffffffff810748b0>] ? try_to_wake_up+0x230/0x230 > kernel: [<ffffffffa0367460>] ? iscsi_target_rx_opcode+0x2b0/0x2b0 [iscsi_target_mod] > kernel: [<ffffffffa0367460>] ? iscsi_target_rx_opcode+0x2b0/0x2b0 [iscsi_target_mod] > kernel: [<ffffffff81069d89>] kthread_stop+0x49/0x80 > kernel: [<ffffffffa034fd56>] iscsi_deallocate_thread_one+0x76/0xc0 [iscsi_target_mod] > kernel: [<ffffffffa034fdcb>] iscsi_deallocate_extra_thread_sets+0x2b/0x80 [iscsi_target_mod] > kernel: [<ffffffffa0350845>] iscsi_rx_thread_pre_handler+0x135/0x170 [iscsi_target_mod] > kernel: [<ffffffffa03674d4>] iscsi_target_rx_thread+0x74/0x310 [iscsi_target_mod] > kernel: [<ffffffff81532e85>] ? __schedule+0x355/0x650 > kernel: [<ffffffffa0367460>] ? iscsi_target_rx_opcode+0x2b0/0x2b0 [iscsi_target_mod] > kernel: [<ffffffff8106994e>] kthread+0xce/0xf0 > kernel: [<ffffffff81069880>] ? kthread_freezable_should_stop+0x70/0x70 > kernel: [<ffffffff815369ac>] ret_from_fork+0x7c/0xb0 > kernel: [<ffffffff81069880>] ? kthread_freezable_should_stop+0x70/0x70 Ok, I think connection failure is getting triggered early via isert_cq_comp_err() -> iscsit_cause_connection_reinstatement() before the successful logout response completion interrupt can occur, which currently means isert_do_control_comp() -> iscsit_logout_post_handler() never actually gets invoked to finish the logout session operation. iscsit_logout_post_handler() and subsequent logic based upon the specific cmd->logout_reason type is responsible for calling iscsi_set_thread_clear() and iscsi_set_thread_set_signal() to clean up the remaining iscsi_ttx thread_set context. This corresponds with the hung task warning above, because isert_send_completion() -> isert_do_control_comp() -> iscsi_logout_post_handler() is never called for the last posted logout response after early connection failure -> rdma_disconnect() has already occurred. The following patch adds the special case in isert_cq_comp_err() to invoke isert_response_completion() if ISTATE_SEND_LOGOUTRSP and TARG_CONN_STATE_IN_LOGOUT are true. Please test. Thank you, --nab diff --git a/drivers/infiniband/ulp/isert/ib_isert.c b/drivers/infiniband/ulp/isert/ib_isert.c index 075b19c..ff0d436 100644 --- a/drivers/infiniband/ulp/isert/ib_isert.c +++ b/drivers/infiniband/ulp/isert/ib_isert.c @@ -1996,10 +1996,20 @@ isert_cq_comp_err(struct isert_conn *isert_conn, struct ib_wc *wc) desc = (struct iser_tx_desc *)(uintptr_t)wc->wr_id; isert_cmd = desc->isert_cmd; - if (!isert_cmd) + if (!isert_cmd) { isert_unmap_tx_desc(desc, ib_dev); - else - isert_completion_put(desc, isert_cmd, ib_dev, true); + } else { + struct isert_device *device = isert_conn->conn_device; + struct iscsi_conn *conn = isert_conn->conn; + struct iscsi_cmd *cmd = isert_cmd->iscsi_cmd; + + if (cmd->i_state == ISTATE_SEND_LOGOUTRSP && + conn->conn_state == TARG_CONN_STATE_IN_LOGOUT) + isert_response_completion(desc, isert_cmd, isert_conn, + device->ib_device); + else + isert_completion_put(desc, isert_cmd, ib_dev, true); + } } else { isert_conn->post_recv_buf_count--; if (!isert_conn->post_recv_buf_count) -- To unsubscribe from this list: send the line "unsubscribe target-devel" in the body of a message to majordomo@xxxxxxxxxxxxxxx More majordomo info at http://vger.kernel.org/majordomo-info.html