Re: [PATCH 0/4] iscsi target: Fix oops during relogin

"Nicholas A. Bellinger" <nab@xxxxxxxxxxxxxxx> · Thu, 30 Mar 2017 23:00:40 -0700

Hey Mike,

On Tue, 2017-03-21 at 09:14 -0400, Mike Christie wrote:
> On 02/26/2017 10:03 PM, Mike Christie wrote:
> > The following patches fix a oops that occurs when the initiator
> > is trying to relogin to a iscsi target. The problem occurs, when
> > the initiator has sent a command that is stuck running on some
> > backend, and the initiator has sent TMFs and eventually escalated
> > to session level recovery.
> > 
> > During the relogin operation, the target will wait for the stuck
> > command to complete, and the initiator may time out the relogin
> > request and drop that tcp connection and retry. The target will
> > then free the iscsi connection structs from under the np_thread
> > and we will crash.
> > 
> > Patches were made over the target-pending for-next branch.
> > 
> 
> Hey Nick,
> 
> Were these patches ok?
> 
> I can rebuild them against your current tree, but before I do that, I
> was thinking there might be a cleaner alternative you know about.
> 
> I think my patches are a little ugly. The behavior for the modified
> functions is now more difficult to follow because you can sleep, not
> sleep and now interruptible sleep, and you can end up retrying them and
> going down different branches on the retry.
> 
> I think the alternative is some sort ref counting based teardown in that
> login error path.

I haven't forgot about these, and apologies on the extended delay for a
proper follow-up.

What I'm confused by is the particular scenario described in the patch.
That is, it's the same scenario DATERA Q/A and automation routinely
tests on v4.1.y and v3.14.y with a few thousand active volumes.  So far
we've not triggered a reproduction like the one described above.

Namely, where a backend driver takes an extended amount of time to
complete an outstanding se_cmd, resulting in ABORT_TASK and LUN_RESET,
followed by a session reinstatement that occurs while se_cmd is still
outstanding to backend driver code.

If a session reinstatement fails due to it's login attempt taking longer
than TA_LOGIN_TIMEOUT=15 seconds since the se_cmd in question still
didn't complete, iscsi_handle_login_thread_timeout() fires and sends
SIGINT to iscsi_np->np_thread.

If iscsi_check_for_session_reinstatement() is already blocked on
iscsi_stop_session() -> wait_for_completion(), it will wait indefinitely
until the se_cmd in question is completed back to target-core before
allowing login to make forward progress, or fail due to the login
timeout.

If iscsi_check_for_session_reinstatement hasn't been reached yet or
hasn't blocked on wait_for_completion(), the SIGINT should fail the
connection the next time it attempts to do socket I/O.

>From what I can gather from the original problem statement, you are
hitting something different than these two cases, right..?

So I'd really like to reproduce what you've seen to trigger the
scenario, and jump into kgdb and see what's going on.  Would you mind
giving me more details wrt you've been have to reproduce this, and even
better, some debug code to reproduce at will..?

--
To unsubscribe from this list: send the line "unsubscribe target-devel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at  http://vger.kernel.org/majordomo-info.html