On 1/23/23 3:53 AM, Sagi Grimberg wrote: > > > On 1/18/23 04:14, Mike Christie wrote: >> On 1/13/23 08:08, Dmitry Bogdanov wrote: >>> On Wed, Jan 11, 2023 at 09:08:28PM -0600, Mike Christie wrote: >>>> >>>> If we get a disconnect event while logging in we can end up in a state >>>> where will never be able to relogin. This happens when: >>>> >>>> 1. login thread has put us into TARG_CONN_STATE_IN_LOGIN >>>> 2. isert then does >>>> >>>> isert_disconnected_handler -> iscsit_cause_connection_reinstatement >>>> >>>> which sets the conn connection_reinstatement flag. Nothing else happens >>>> because we are only in IN_LOGIN. The tx/rx threads are not running yet >>>> so we can't start recovery from those contexts at this time. >>>> >>>> 3. The login thread finishes processing the login pdu and thinks login is >>>> done. It sets us into TARG_CONN_STATE_LOGGED_IN/TARG_SESS_STATE_LOGGED_IN. >>>> This starts the rx/tx threads. >>>> >>>> 4. The initiator thought it disconnected the connection at 2, and has >>>> since sent a new connect which is now handled. This leads us to eventually >>>> run: >>>> >>>> iscsi_check_for_session_reinstatement-> iscsit_stop_session -> >>>> iscsit_cause_connection_reinstatement >>>> >>>> iscsit_stop_session sees the old conn and does >>>> iscsit_cause_connection_reinstatement which sees connection_reinstatement >>>> is set so it just returns instead of trying to kill the tx/rx threads >>>> which would have caused recovery to start. >>>> >>>> 5. iscsit_stop_session then waits on session_wait_comp which will never >>>> happen since we didn't kill the tx/rx threads. >>>> >>>> This has the iscsit login code check if a fabric driver ran >>>> iscsit_cause_connection_reinstatement during the login process similar >>>> to how we check for the sk state for tcp based iscsit. This will prevent >>>> us from getting into 3 and creating a ghost session. >>>> >>>> Signed-off-by: Mike Christie <michael.christie@xxxxxxxxxx> >>>> --- >>>> drivers/target/iscsi/iscsi_target_nego.c | 10 ++++++++++ >>>> 1 file changed, 10 insertions(+) >>>> >>>> diff --git a/drivers/target/iscsi/iscsi_target_nego.c b/drivers/target/iscsi/iscsi_target_nego.c >>>> index ff49c8f3fe24..2dd81752d4c9 100644 >>>> --- a/drivers/target/iscsi/iscsi_target_nego.c >>>> +++ b/drivers/target/iscsi/iscsi_target_nego.c >>>> @@ -350,6 +350,16 @@ static int iscsi_target_do_tx_login_io(struct iscsit_conn *conn, struct iscsi_lo >>>> ISCSI_LOGIN_STATUS_NO_RESOURCES); >>>> return -1; >>>> } >>>> + >>>> + /* >>>> + * isert doesn't know the iscsit state and uses >>>> + * iscsit_cause_connection_reinstatement as a generic error >>>> + * notification system. It may call it before we are in FFP. >>>> + * Handle this now in case it signaled a failure before the >>>> + * rx/tx threads were up and could start recovery. >>>> + */ >>>> + if (atomic_read(&conn->connection_reinstatement)) >>>> + goto err; >>> >>> Why only for login->login_complete case? In other case the session will >>> not hang? Will it be droppped on login timeout or something else? >> >> It will not hang. If you hit an error with isert before we think we can go into >> FFP then the login timeout currently cleans up the conn and session. >> >>> >>> May be the root cause is point 2 itself - calling iscsit_cause_connection_reinstatement >>> in not ISER_CONN_FULL_FEATURE state where there are no TX_RX threads? >>> I mean that was a misuse of iscsit_cause_connection_reinstatement? >> >> Let me drop this patch for now. After writing the response above about normally it just >> times out, and thinking about your question here I think to really fix this we want to >> fully integrate isert login into iscsit. >> >> The root problem is that isert login is not integrated into iscsit, so there is really >> no error handling. iscsit_cause_connection_reinstatement was supposed to do it, but it >> doesn't do anything. >> >> So we need to separate the LOGIN_FLAGS_CLOSED tests and setting from the iscsi_target_sk* >> code. Add a helper to set that bit and do some state checks, and make the checks generic >> in the login code (not tied to having a socket). Convert iscsit and isert to use the new >> helper. Then handle the LOGIN_FLAGS_READ_ACTIVE/LOGIN_FLAGS_WRITE_ACTIVE and login_work >> stuff. Then review cxgb? >> >> I'll do that later after fixing the command cleanup stuff in this patchsdet and the >> other patchsets I have outstanding. > > Perhaps the fix is as simple as checking the iscsi state to be also full-featured and if not, do a simple error handling within > iser_disconnected_handler? Yeah that's what I'm thinking above. The catch was what is the simple error handler going to be. For iscsit tcp it mostly just sets LOGIN_FLAGS_CLOSED to signal the failure. So I was saying above if we could just expose that to isert, then all it has to do is call some helper iscsit helper: isert_disconnected_handler ... case ISER_CONN_BOUND: iscsit_login_failed(); break; case ISER_CONN_FULL_FEATURE iscsit_cause_connection_reinstatement(isert_conn->conn, 0); *I think isert_login_send_done would also need a change. The iscsit code then would work similarly for both isert and iscsit tcp. There's then just cxgb which I think just times out like isert. The other option is for the simple case is just let it timeout like cxgb. And for isert just do nothing: case ISER_CONN_BOUND: isert_err("Failure on unbound connection. Timing out login\n"); break; case ISER_CONN_FULL_FEATURE iscsit_cause_connection_reinstatement(isert_conn->conn, 0); break;