On 03/31/2017 11:54 AM, Mike Christie wrote: > On 03/31/2017 01:00 AM, Nicholas A. Bellinger wrote: >> Hey Mike, >> >> On Tue, 2017-03-21 at 09:14 -0400, Mike Christie wrote: >>> On 02/26/2017 10:03 PM, Mike Christie wrote: >>>> The following patches fix a oops that occurs when the initiator >>>> is trying to relogin to a iscsi target. The problem occurs, when >>>> the initiator has sent a command that is stuck running on some >>>> backend, and the initiator has sent TMFs and eventually escalated >>>> to session level recovery. >>>> >>>> During the relogin operation, the target will wait for the stuck >>>> command to complete, and the initiator may time out the relogin >>>> request and drop that tcp connection and retry. The target will >>>> then free the iscsi connection structs from under the np_thread >>>> and we will crash. >>>> >>>> Patches were made over the target-pending for-next branch. >>>> >>> >>> Hey Nick, >>> >>> Were these patches ok? >>> >>> I can rebuild them against your current tree, but before I do that, I >>> was thinking there might be a cleaner alternative you know about. >>> >>> I think my patches are a little ugly. The behavior for the modified >>> functions is now more difficult to follow because you can sleep, not >>> sleep and now interruptible sleep, and you can end up retrying them and >>> going down different branches on the retry. >>> >>> I think the alternative is some sort ref counting based teardown in that >>> login error path. >> >> I haven't forgot about these, and apologies on the extended delay for a >> proper follow-up. > > He He, I was a bad initiator maintainer in the end so I can't complain :) > >> >> What I'm confused by is the particular scenario described in the patch. >> That is, it's the same scenario DATERA Q/A and automation routinely >> tests on v4.1.y and v3.14.y with a few thousand active volumes. So far >> we've not triggered a reproduction like the one described above. >> >> Namely, where a backend driver takes an extended amount of time to >> complete an outstanding se_cmd, resulting in ABORT_TASK and LUN_RESET, >> followed by a session reinstatement that occurs while se_cmd is still >> outstanding to backend driver code. >> >> If a session reinstatement fails due to it's login attempt taking longer >> than TA_LOGIN_TIMEOUT=15 seconds since the se_cmd in question still >> didn't complete, iscsi_handle_login_thread_timeout() fires and sends >> SIGINT to iscsi_np->np_thread. >> >> If iscsi_check_for_session_reinstatement() is already blocked on >> iscsi_stop_session() -> wait_for_completion(), it will wait indefinitely >> until the se_cmd in question is completed back to target-core before >> allowing login to make forward progress, or fail due to the login >> timeout. > > This is where we hit the problem. > > At this time while in the wait, the initiator gives up (normally hit a > iscsi login timeout on the initiator side) on the login attempt and just > drops the tcp/ip connection. On the target side we detect this and > iscsi_target_sk_state_change runs and iscsi_target_do_cleanup which > frees the iscsi_login related resources. > > When the command eventually completes, we wake from the > wait_for_completion and try to access the freed iscsi_login struct. > > The problem is that iscsi_check_for_session_reinstatement -> > iscsi_target_check_for_existing_instances will return 0 after the > command has completed so the login thread does not know that login has > failed due to the tcp/ip connection getting dropped and the iscsi_login > struct has been freed. It will then try to access the freed iscsi_login > struct and proceed with the login process. > > > > >> >> If iscsi_check_for_session_reinstatement hasn't been reached yet or >> hasn't blocked on wait_for_completion(), the SIGINT should fail the >> connection the next time it attempts to do socket I/O. >> >> From what I can gather from the original problem statement, you are >> hitting something different than these two cases, right..? >> >> So I'd really like to reproduce what you've seen to trigger the >> scenario, and jump into kgdb and see what's going on. Would you mind >> giving me more details wrt you've been have to reproduce this, and even >> better, some debug code to reproduce at will..? >> > > I will send a patch for scsi_debug that can simulate the problem. > Attached is a patch to scsi_debug, scsi-debug-hang-abort.patch, which will hang the abort process so you can simulate commands that get stuck. Just export the scsi_debug /dev/sdX as a pscsi backend device and use these settings for scsi_debug: 1. every_nth = 30 (set this after the initial login through sysfs on the target side /sys/module/scsi_debug/parameters/every_nth, so you do not hit scanning related issues) 2. abort_sleep = 120 (you might need to increase this depending on your timeouts below) 3. opts = 0x4 On the initiator side, use these settings to speed up the failure: 1. Set /sys/block/sdX/device/timeout to 5. 2. node.session.timeo.replacement_timeout = 5 3. node.conn[0].timeo.login_timeout = 30 4. node.conn[0].timeo.noop_out_timeout and node.conn[0].timeo.noop_out_interval = 5 5. node.session.err_timeo.abort_timeout = 5 On the target side, I am using the default settings. Then just do some simple IO until you hit the every_nth limit. Do something like dd if=/dev/sdX of=/dev/null iflag=direct count=1 A couple times until you hit the every_nth setting, so you do not end up with a lot of stuck IO on the target side. I also attached the oops I see in the attachment iscsi-relogin-bug. It was made against master in target-pending.
diff --git a/drivers/scsi/scsi_debug.c b/drivers/scsi/scsi_debug.c index 17249c3..498a7f2 100644 --- a/drivers/scsi/scsi_debug.c +++ b/drivers/scsi/scsi_debug.c @@ -43,6 +43,7 @@ #include <linux/hrtimer.h> #include <linux/uuid.h> #include <linux/t10-pi.h> +#include <linux/delay.h> #include <net/checksum.h> @@ -110,6 +111,7 @@ #define DEF_DIF 0 #define DEF_DIX 0 #define DEF_D_SENSE 0 +#define DEF_ABORT_SLEEP 0 #define DEF_EVERY_NTH 0 #define DEF_FAKE_RW 0 #define DEF_GUARD 0 @@ -576,6 +578,7 @@ enum sdeb_opcode_index { static int sdebug_dif = DEF_DIF; static int sdebug_dix = DEF_DIX; static int sdebug_dsense = DEF_D_SENSE; +static int sdebug_abort_sleep = DEF_ABORT_SLEEP; static int sdebug_every_nth = DEF_EVERY_NTH; static int sdebug_fake_rw = DEF_FAKE_RW; static unsigned int sdebug_guard = DEF_GUARD; @@ -3782,6 +3785,16 @@ static int scsi_debug_abort(struct scsi_cmnd *SCpnt) { bool ok; + if (sdebug_abort_sleep > 0) { + sdev_printk(KERN_INFO, SCpnt->device, + "scsi_debug_abort sleeping for %d seconds\n", + sdebug_abort_sleep); + ssleep(sdebug_abort_sleep); + sdev_printk(KERN_INFO, SCpnt->device, + "scsi_debug_abort done sleeping for %d seconds\n", + sdebug_abort_sleep); + } + ++num_aborts; if (SCpnt) { ok = stop_queued_cmnd(SCpnt); @@ -4148,6 +4161,7 @@ static int schedule_resp(struct scsi_cmnd *cmnd, struct sdebug_dev_info *devip, module_param_named(dif, sdebug_dif, int, S_IRUGO); module_param_named(dix, sdebug_dix, int, S_IRUGO); module_param_named(dsense, sdebug_dsense, int, S_IRUGO | S_IWUSR); +module_param_named(abort_sleep, sdebug_abort_sleep, int, S_IRUGO | S_IWUSR); module_param_named(every_nth, sdebug_every_nth, int, S_IRUGO | S_IWUSR); module_param_named(fake_rw, sdebug_fake_rw, int, S_IRUGO | S_IWUSR); module_param_named(guard, sdebug_guard, uint, S_IRUGO);
[ 387.250623] BUG: unable to handle kernel NULL pointer dereference at 0000000000000008 [ 387.250637] IP: iscsi_encode_text_output+0x32/0x5e0 [iscsi_target_mod] [ 387.250638] PGD 0 [ 387.250641] Oops: 0000 [#1] SMP [ 387.250642] Modules linked in: iscsi_target_mod target_core_pscsi target_core_file target_core_iblock target_core_user uio target_core_mod configfs fuse nf_conntrack_ipv4 nf_defrag_ipv4 xt_conntrack nf_conntrack ipt_REJECT nf_reject_ipv4 tun bridge stp llc ebtable_filter ebtables ip6table_filter ip6_tables iptable_filter vboxpci(O) vboxnetadp(O) vboxnetflt(O) vboxdrv(O) dm_mirror dm_region_hash dm_log coretemp ext4 x86_pkg_temp_thermal kvm_intel jbd2 mbcache kvm snd_hda_codec_hdmi irqbypass crc32_pclmul ghash_clmulni_intel pcbc snd_hda_codec_realtek snd_hda_codec_generic aesni_intel aes_x86_64 crypto_simd snd_hda_intel cryptd snd_hda_codec glue_helper snd_hda_core iTCO_wdt snd_hwdep iTCO_vendor_support snd_seq dm_mod dell_smm_hwmon snd_seq_device dcdbas snd_pcm snd_timer i2c_i801 snd lpc_ich pcspkr [ 387.250682] soundcore sg i2c_core shpchp mfd_core wmi hed nfsd auth_rpcgss oid_registry nfs_acl lockd grace sunrpc ip_tables autofs4 xfs libcrc32c sr_mod sd_mod cdrom crc32c_intel ixgbe mdio xhci_pci hwmon qla2xxx e1000e ahci dca xhci_hcd libahci ptp scsi_transport_fc pps_core ipv6 crc_ccitt [ 387.250706] CPU: 18 PID: 10413 Comm: iscsi_np Tainted: G O 4.11.0-rc1+ #1 [ 387.250707] Hardware name: Dell Inc. Precision Tower 7810/0GWHMW, BIOS A02 09/05/2014 [ 387.250708] task: ffff88044aa65a80 task.stack: ffffc90004330000 [ 387.250713] RIP: 0010:iscsi_encode_text_output+0x32/0x5e0 [iscsi_target_mod] [ 387.250714] RSP: 0018:ffffc90004333d80 EFLAGS: 00010286 [ 387.250715] RAX: ffff8804492ce000 RBX: ffff880442213800 RCX: ffff88044d590cdc [ 387.250716] RDX: ffff8804492ce000 RSI: 0000000000000002 RDI: 0000000000000006 [ 387.250717] RBP: ffffc90004333df8 R08: 0000000000000000 R09: 0000000000000000 [ 387.250718] R10: 0000000000000000 R11: ffff880439bae029 R12: 0000000000000008 [ 387.250718] R13: ffff88044d590cdc R14: ffff88044d590cc0 R15: ffff880449360820 [ 387.250720] FS: 0000000000000000(0000) GS:ffff88045fc80000(0000) knlGS:0000000000000000 [ 387.250721] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 [ 387.250721] CR2: 0000000000000008 CR3: 0000000001c09000 CR4: 00000000001406e0 [ 387.250722] Call Trace: [ 387.250730] iscsi_target_do_login+0x143/0x610 [iscsi_target_mod] [ 387.250736] iscsi_target_start_negotiation+0x4e/0xa0 [iscsi_target_mod] [ 387.250741] iscsi_target_login_thread+0xa73/0xfc0 [iscsi_target_mod] [ 387.250744] ? __schedule+0x2bb/0x950 [ 387.250746] kthread+0x101/0x140 [ 387.250751] ? iscsi_target_login_sess_out+0x250/0x250 [iscsi_target_mod] [ 387.250752] ? queued_spin_unlock+0x20/0x20 [ 387.250754] ret_from_fork+0x2c/0x40 [ 387.250755] Code: 89 e5 41 57 41 56 41 55 49 89 cd 41 54 4d 8d 60 08 53 48 83 ec 50 8b 01 48 89 55 b8 4c 89 45 a0 40 88 75 97 48 01 d0 48 89 45 98 <49> 8b 40 08 49 39 c4 48 8d 58 e0 0f 84 59 02 00 00 45 31 ff 31 [ 387.250781] RIP: iscsi_encode_text_output+0x32/0x5e0 [iscsi_target_mod] RSP: ffffc90004333d80 [ 387.250781] CR2: 0000000000000008 [ 387.250783] ---[ end trace d5c6a015e9661bc0 ]---