On 03/31/2017 01:00 AM, Nicholas A. Bellinger wrote: > Hey Mike, > > On Tue, 2017-03-21 at 09:14 -0400, Mike Christie wrote: >> On 02/26/2017 10:03 PM, Mike Christie wrote: >>> The following patches fix a oops that occurs when the initiator >>> is trying to relogin to a iscsi target. The problem occurs, when >>> the initiator has sent a command that is stuck running on some >>> backend, and the initiator has sent TMFs and eventually escalated >>> to session level recovery. >>> >>> During the relogin operation, the target will wait for the stuck >>> command to complete, and the initiator may time out the relogin >>> request and drop that tcp connection and retry. The target will >>> then free the iscsi connection structs from under the np_thread >>> and we will crash. >>> >>> Patches were made over the target-pending for-next branch. >>> >> >> Hey Nick, >> >> Were these patches ok? >> >> I can rebuild them against your current tree, but before I do that, I >> was thinking there might be a cleaner alternative you know about. >> >> I think my patches are a little ugly. The behavior for the modified >> functions is now more difficult to follow because you can sleep, not >> sleep and now interruptible sleep, and you can end up retrying them and >> going down different branches on the retry. >> >> I think the alternative is some sort ref counting based teardown in that >> login error path. > > I haven't forgot about these, and apologies on the extended delay for a > proper follow-up. He He, I was a bad initiator maintainer in the end so I can't complain :) > > What I'm confused by is the particular scenario described in the patch. > That is, it's the same scenario DATERA Q/A and automation routinely > tests on v4.1.y and v3.14.y with a few thousand active volumes. So far > we've not triggered a reproduction like the one described above. > > Namely, where a backend driver takes an extended amount of time to > complete an outstanding se_cmd, resulting in ABORT_TASK and LUN_RESET, > followed by a session reinstatement that occurs while se_cmd is still > outstanding to backend driver code. > > If a session reinstatement fails due to it's login attempt taking longer > than TA_LOGIN_TIMEOUT=15 seconds since the se_cmd in question still > didn't complete, iscsi_handle_login_thread_timeout() fires and sends > SIGINT to iscsi_np->np_thread. > > If iscsi_check_for_session_reinstatement() is already blocked on > iscsi_stop_session() -> wait_for_completion(), it will wait indefinitely > until the se_cmd in question is completed back to target-core before > allowing login to make forward progress, or fail due to the login > timeout. This is where we hit the problem. At this time while in the wait, the initiator gives up (normally hit a iscsi login timeout on the initiator side) on the login attempt and just drops the tcp/ip connection. On the target side we detect this and iscsi_target_sk_state_change runs and iscsi_target_do_cleanup which frees the iscsi_login related resources. When the command eventually completes, we wake from the wait_for_completion and try to access the freed iscsi_login struct. The problem is that iscsi_check_for_session_reinstatement -> iscsi_target_check_for_existing_instances will return 0 after the command has completed so the login thread does not know that login has failed due to the tcp/ip connection getting dropped and the iscsi_login struct has been freed. It will then try to access the freed iscsi_login struct and proceed with the login process. > > If iscsi_check_for_session_reinstatement hasn't been reached yet or > hasn't blocked on wait_for_completion(), the SIGINT should fail the > connection the next time it attempts to do socket I/O. > > From what I can gather from the original problem statement, you are > hitting something different than these two cases, right..? > > So I'd really like to reproduce what you've seen to trigger the > scenario, and jump into kgdb and see what's going on. Would you mind > giving me more details wrt you've been have to reproduce this, and even > better, some debug code to reproduce at will..? > I will send a patch for scsi_debug that can simulate the problem. -- To unsubscribe from this list: send the line "unsubscribe target-devel" in the body of a message to majordomo@xxxxxxxxxxxxxxx More majordomo info at http://vger.kernel.org/majordomo-info.html