Hey Mike, On Tue, 2017-03-21 at 09:14 -0400, Mike Christie wrote: > On 02/26/2017 10:03 PM, Mike Christie wrote: > > The following patches fix a oops that occurs when the initiator > > is trying to relogin to a iscsi target. The problem occurs, when > > the initiator has sent a command that is stuck running on some > > backend, and the initiator has sent TMFs and eventually escalated > > to session level recovery. > > > > During the relogin operation, the target will wait for the stuck > > command to complete, and the initiator may time out the relogin > > request and drop that tcp connection and retry. The target will > > then free the iscsi connection structs from under the np_thread > > and we will crash. > > > > Patches were made over the target-pending for-next branch. > > > > Hey Nick, > > Were these patches ok? > > I can rebuild them against your current tree, but before I do that, I > was thinking there might be a cleaner alternative you know about. > > I think my patches are a little ugly. The behavior for the modified > functions is now more difficult to follow because you can sleep, not > sleep and now interruptible sleep, and you can end up retrying them and > going down different branches on the retry. > > I think the alternative is some sort ref counting based teardown in that > login error path. I haven't forgot about these, and apologies on the extended delay for a proper follow-up. What I'm confused by is the particular scenario described in the patch. That is, it's the same scenario DATERA Q/A and automation routinely tests on v4.1.y and v3.14.y with a few thousand active volumes. So far we've not triggered a reproduction like the one described above. Namely, where a backend driver takes an extended amount of time to complete an outstanding se_cmd, resulting in ABORT_TASK and LUN_RESET, followed by a session reinstatement that occurs while se_cmd is still outstanding to backend driver code. If a session reinstatement fails due to it's login attempt taking longer than TA_LOGIN_TIMEOUT=15 seconds since the se_cmd in question still didn't complete, iscsi_handle_login_thread_timeout() fires and sends SIGINT to iscsi_np->np_thread. If iscsi_check_for_session_reinstatement() is already blocked on iscsi_stop_session() -> wait_for_completion(), it will wait indefinitely until the se_cmd in question is completed back to target-core before allowing login to make forward progress, or fail due to the login timeout. If iscsi_check_for_session_reinstatement hasn't been reached yet or hasn't blocked on wait_for_completion(), the SIGINT should fail the connection the next time it attempts to do socket I/O. >From what I can gather from the original problem statement, you are hitting something different than these two cases, right..? So I'd really like to reproduce what you've seen to trigger the scenario, and jump into kgdb and see what's going on. Would you mind giving me more details wrt you've been have to reproduce this, and even better, some debug code to reproduce at will..? -- To unsubscribe from this list: send the line "unsubscribe target-devel" in the body of a message to majordomo@xxxxxxxxxxxxxxx More majordomo info at http://vger.kernel.org/majordomo-info.html