Re: [PATCH 0/4] iscsi target: Fix oops during relogin

"Nicholas A. Bellinger" <nab@xxxxxxxxxxxxxxx> · Tue, 11 Apr 2017 22:53:01 -0700

Hi MNC,

On Mon, 2017-04-03 at 16:44 -0500, Mike Christie wrote:
> On 03/31/2017 11:54 AM, Mike Christie wrote:
> > On 03/31/2017 01:00 AM, Nicholas A. Bellinger wrote:

<SNIP>

> > 
> > This is where we hit the problem.
> > 
> > At this time while in the wait, the initiator gives up (normally hit a
> > iscsi login timeout on the initiator side) on the login attempt and just
> > drops the tcp/ip connection. On the target side we detect this and
> > iscsi_target_sk_state_change runs and iscsi_target_do_cleanup which
> > frees the iscsi_login related resources.
> > 
> > When the command eventually completes, we wake from the
> > wait_for_completion and try to access the freed iscsi_login struct.
> > 
> > The problem is that iscsi_check_for_session_reinstatement ->
> > iscsi_target_check_for_existing_instances will return 0 after the
> > command has completed so the login thread does not know that login has
> > failed due to the tcp/ip connection getting dropped and the iscsi_login
> > struct has been freed. It will then try to access the freed iscsi_login
> > struct and proceed with the login process.
> > 
> > 
> > 
> > 
> >>
> >> If iscsi_check_for_session_reinstatement hasn't been reached yet or
> >> hasn't blocked on wait_for_completion(), the SIGINT should fail the
> >> connection the next time it attempts to do socket I/O.
> >>
> >> From what I can gather from the original problem statement, you are
> >> hitting something different than these two cases, right..?
> >>
> >> So I'd really like to reproduce what you've seen to trigger the
> >> scenario, and jump into kgdb and see what's going on.  Would you mind
> >> giving me more details wrt you've been have to reproduce this, and even
> >> better, some debug code to reproduce at will..?
> >>
> > 
> > I will send a patch for scsi_debug that can simulate the problem.
> > 
> 
> Attached is a patch to scsi_debug, scsi-debug-hang-abort.patch, which
> will hang the abort process so you can simulate commands that get stuck.
> Just export the scsi_debug /dev/sdX as a pscsi backend device and use
> these settings for scsi_debug:
> 
> 1. every_nth = 30 (set this after the initial login through sysfs on the
> target side /sys/module/scsi_debug/parameters/every_nth, so you do not
> hit scanning related issues)
> 2. abort_sleep = 120 (you might need to increase this depending on your
> timeouts below)
> 3. opts = 0x4
> 
> On the initiator side, use these settings to speed up the failure:
> 
> 1. Set /sys/block/sdX/device/timeout to 5.
> 2. node.session.timeo.replacement_timeout = 5
> 3. node.conn[0].timeo.login_timeout = 30
> 4. node.conn[0].timeo.noop_out_timeout and
> node.conn[0].timeo.noop_out_interval = 5
> 5. node.session.err_timeo.abort_timeout = 5
> 
> On the target side, I am using the default settings.
> 
> Then just do some simple IO until you hit the every_nth limit. Do
> something like
> 
> dd if=/dev/sdX of=/dev/null iflag=direct count=1
> 
> A couple times until you hit the every_nth setting, so you do not end up
> with a lot of stuck IO on the target side.
> 
> I also attached the oops I see in the attachment iscsi-relogin-bug. It
> was made against master in target-pending.
> 
> 

Thanks alot for the reproduction details.

Still thinking how this patch should look, but introducing a
iscsi_conn->login_kref to ensure the existing logic in
iscsi_target_do_cleanup() can only be called on final put vs. active
iscsi_target_do_login_rx() is probably going to be the most
straight-forward approach.

In any event, this is at the top of my list and I should have something
to start testing by the end of the week.

Thank you.

--
To unsubscribe from this list: send the line "unsubscribe target-devel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at  http://vger.kernel.org/majordomo-info.html