Hi MNC, On Mon, 2017-04-03 at 16:44 -0500, Mike Christie wrote: > On 03/31/2017 11:54 AM, Mike Christie wrote: > > On 03/31/2017 01:00 AM, Nicholas A. Bellinger wrote: <SNIP> > > > > This is where we hit the problem. > > > > At this time while in the wait, the initiator gives up (normally hit a > > iscsi login timeout on the initiator side) on the login attempt and just > > drops the tcp/ip connection. On the target side we detect this and > > iscsi_target_sk_state_change runs and iscsi_target_do_cleanup which > > frees the iscsi_login related resources. > > > > When the command eventually completes, we wake from the > > wait_for_completion and try to access the freed iscsi_login struct. > > > > The problem is that iscsi_check_for_session_reinstatement -> > > iscsi_target_check_for_existing_instances will return 0 after the > > command has completed so the login thread does not know that login has > > failed due to the tcp/ip connection getting dropped and the iscsi_login > > struct has been freed. It will then try to access the freed iscsi_login > > struct and proceed with the login process. > > > > > > > > > >> > >> If iscsi_check_for_session_reinstatement hasn't been reached yet or > >> hasn't blocked on wait_for_completion(), the SIGINT should fail the > >> connection the next time it attempts to do socket I/O. > >> > >> From what I can gather from the original problem statement, you are > >> hitting something different than these two cases, right..? > >> > >> So I'd really like to reproduce what you've seen to trigger the > >> scenario, and jump into kgdb and see what's going on. Would you mind > >> giving me more details wrt you've been have to reproduce this, and even > >> better, some debug code to reproduce at will..? > >> > > > > I will send a patch for scsi_debug that can simulate the problem. > > > > Attached is a patch to scsi_debug, scsi-debug-hang-abort.patch, which > will hang the abort process so you can simulate commands that get stuck. > Just export the scsi_debug /dev/sdX as a pscsi backend device and use > these settings for scsi_debug: > > 1. every_nth = 30 (set this after the initial login through sysfs on the > target side /sys/module/scsi_debug/parameters/every_nth, so you do not > hit scanning related issues) > 2. abort_sleep = 120 (you might need to increase this depending on your > timeouts below) > 3. opts = 0x4 > > On the initiator side, use these settings to speed up the failure: > > 1. Set /sys/block/sdX/device/timeout to 5. > 2. node.session.timeo.replacement_timeout = 5 > 3. node.conn[0].timeo.login_timeout = 30 > 4. node.conn[0].timeo.noop_out_timeout and > node.conn[0].timeo.noop_out_interval = 5 > 5. node.session.err_timeo.abort_timeout = 5 > > On the target side, I am using the default settings. > > Then just do some simple IO until you hit the every_nth limit. Do > something like > > dd if=/dev/sdX of=/dev/null iflag=direct count=1 > > A couple times until you hit the every_nth setting, so you do not end up > with a lot of stuck IO on the target side. > > I also attached the oops I see in the attachment iscsi-relogin-bug. It > was made against master in target-pending. > > Thanks alot for the reproduction details. Still thinking how this patch should look, but introducing a iscsi_conn->login_kref to ensure the existing logic in iscsi_target_do_cleanup() can only be called on final put vs. active iscsi_target_do_login_rx() is probably going to be the most straight-forward approach. In any event, this is at the top of my list and I should have something to start testing by the end of the week. Thank you. -- To unsubscribe from this list: send the line "unsubscribe target-devel" in the body of a message to majordomo@xxxxxxxxxxxxxxx More majordomo info at http://vger.kernel.org/majordomo-info.html