Re: [PATCH 0/4] iscsi target: Fix oops during relogin

Mike Christie <mchristi@xxxxxxxxxx> · Mon, 3 Apr 2017 16:44:23 -0500

On 03/31/2017 11:54 AM, Mike Christie wrote:
> On 03/31/2017 01:00 AM, Nicholas A. Bellinger wrote:
>> Hey Mike,
>>
>> On Tue, 2017-03-21 at 09:14 -0400, Mike Christie wrote:
>>> On 02/26/2017 10:03 PM, Mike Christie wrote:
>>>> The following patches fix a oops that occurs when the initiator
>>>> is trying to relogin to a iscsi target. The problem occurs, when
>>>> the initiator has sent a command that is stuck running on some
>>>> backend, and the initiator has sent TMFs and eventually escalated
>>>> to session level recovery.
>>>>
>>>> During the relogin operation, the target will wait for the stuck
>>>> command to complete, and the initiator may time out the relogin
>>>> request and drop that tcp connection and retry. The target will
>>>> then free the iscsi connection structs from under the np_thread
>>>> and we will crash.
>>>>
>>>> Patches were made over the target-pending for-next branch.
>>>>
>>>
>>> Hey Nick,
>>>
>>> Were these patches ok?
>>>
>>> I can rebuild them against your current tree, but before I do that, I
>>> was thinking there might be a cleaner alternative you know about.
>>>
>>> I think my patches are a little ugly. The behavior for the modified
>>> functions is now more difficult to follow because you can sleep, not
>>> sleep and now interruptible sleep, and you can end up retrying them and
>>> going down different branches on the retry.
>>>
>>> I think the alternative is some sort ref counting based teardown in that
>>> login error path.
>>
>> I haven't forgot about these, and apologies on the extended delay for a
>> proper follow-up.
> 
> He He, I was a bad initiator maintainer in the end so I can't complain :)
> 
>>
>> What I'm confused by is the particular scenario described in the patch.
>> That is, it's the same scenario DATERA Q/A and automation routinely
>> tests on v4.1.y and v3.14.y with a few thousand active volumes.  So far
>> we've not triggered a reproduction like the one described above.
>>
>> Namely, where a backend driver takes an extended amount of time to
>> complete an outstanding se_cmd, resulting in ABORT_TASK and LUN_RESET,
>> followed by a session reinstatement that occurs while se_cmd is still
>> outstanding to backend driver code.
>>
>> If a session reinstatement fails due to it's login attempt taking longer
>> than TA_LOGIN_TIMEOUT=15 seconds since the se_cmd in question still
>> didn't complete, iscsi_handle_login_thread_timeout() fires and sends
>> SIGINT to iscsi_np->np_thread.
>>
>> If iscsi_check_for_session_reinstatement() is already blocked on
>> iscsi_stop_session() -> wait_for_completion(), it will wait indefinitely
>> until the se_cmd in question is completed back to target-core before
>> allowing login to make forward progress, or fail due to the login
>> timeout.
> 
> This is where we hit the problem.
> 
> At this time while in the wait, the initiator gives up (normally hit a
> iscsi login timeout on the initiator side) on the login attempt and just
> drops the tcp/ip connection. On the target side we detect this and
> iscsi_target_sk_state_change runs and iscsi_target_do_cleanup which
> frees the iscsi_login related resources.
> 
> When the command eventually completes, we wake from the
> wait_for_completion and try to access the freed iscsi_login struct.
> 
> The problem is that iscsi_check_for_session_reinstatement ->
> iscsi_target_check_for_existing_instances will return 0 after the
> command has completed so the login thread does not know that login has
> failed due to the tcp/ip connection getting dropped and the iscsi_login
> struct has been freed. It will then try to access the freed iscsi_login
> struct and proceed with the login process.
> 
> 
> 
> 
>>
>> If iscsi_check_for_session_reinstatement hasn't been reached yet or
>> hasn't blocked on wait_for_completion(), the SIGINT should fail the
>> connection the next time it attempts to do socket I/O.
>>
>> From what I can gather from the original problem statement, you are
>> hitting something different than these two cases, right..?
>>
>> So I'd really like to reproduce what you've seen to trigger the
>> scenario, and jump into kgdb and see what's going on.  Would you mind
>> giving me more details wrt you've been have to reproduce this, and even
>> better, some debug code to reproduce at will..?
>>
> 
> I will send a patch for scsi_debug that can simulate the problem.
> 

Attached is a patch to scsi_debug, scsi-debug-hang-abort.patch, which
will hang the abort process so you can simulate commands that get stuck.
Just export the scsi_debug /dev/sdX as a pscsi backend device and use
these settings for scsi_debug:

1. every_nth = 30 (set this after the initial login through sysfs on the
target side /sys/module/scsi_debug/parameters/every_nth, so you do not
hit scanning related issues)
2. abort_sleep = 120 (you might need to increase this depending on your
timeouts below)
3. opts = 0x4

On the initiator side, use these settings to speed up the failure:

1. Set /sys/block/sdX/device/timeout to 5.
2. node.session.timeo.replacement_timeout = 5
3. node.conn[0].timeo.login_timeout = 30
4. node.conn[0].timeo.noop_out_timeout and
node.conn[0].timeo.noop_out_interval = 5
5. node.session.err_timeo.abort_timeout = 5

On the target side, I am using the default settings.

Then just do some simple IO until you hit the every_nth limit. Do
something like

dd if=/dev/sdX of=/dev/null iflag=direct count=1

A couple times until you hit the every_nth setting, so you do not end up
with a lot of stuck IO on the target side.

I also attached the oops I see in the attachment iscsi-relogin-bug. It
was made against master in target-pending.

diff --git a/drivers/scsi/scsi_debug.c b/drivers/scsi/scsi_debug.c
index 17249c3..498a7f2 100644
--- a/drivers/scsi/scsi_debug.c
+++ b/drivers/scsi/scsi_debug.c
@@ -43,6 +43,7 @@
 #include <linux/hrtimer.h>
 #include <linux/uuid.h>
 #include <linux/t10-pi.h>
+#include <linux/delay.h>
 
 #include <net/checksum.h>
 
@@ -110,6 +111,7 @@
 #define DEF_DIF 0
 #define DEF_DIX 0
 #define DEF_D_SENSE   0
+#define DEF_ABORT_SLEEP 0
 #define DEF_EVERY_NTH   0
 #define DEF_FAKE_RW	0
 #define DEF_GUARD 0
@@ -576,6 +578,7 @@ enum sdeb_opcode_index {
 static int sdebug_dif = DEF_DIF;
 static int sdebug_dix = DEF_DIX;
 static int sdebug_dsense = DEF_D_SENSE;
+static int sdebug_abort_sleep = DEF_ABORT_SLEEP;
 static int sdebug_every_nth = DEF_EVERY_NTH;
 static int sdebug_fake_rw = DEF_FAKE_RW;
 static unsigned int sdebug_guard = DEF_GUARD;
@@ -3782,6 +3785,16 @@ static int scsi_debug_abort(struct scsi_cmnd *SCpnt)
 {
 	bool ok;
 
+	if (sdebug_abort_sleep > 0) {
+		sdev_printk(KERN_INFO, SCpnt->device,
+			    "scsi_debug_abort sleeping for %d seconds\n",
+			    sdebug_abort_sleep);
+		ssleep(sdebug_abort_sleep);
+		sdev_printk(KERN_INFO, SCpnt->device,
+			    "scsi_debug_abort done sleeping for %d seconds\n",
+			    sdebug_abort_sleep);
+	}
+
 	++num_aborts;
 	if (SCpnt) {
 		ok = stop_queued_cmnd(SCpnt);
@@ -4148,6 +4161,7 @@ static int schedule_resp(struct scsi_cmnd *cmnd, struct sdebug_dev_info *devip,
 module_param_named(dif, sdebug_dif, int, S_IRUGO);
 module_param_named(dix, sdebug_dix, int, S_IRUGO);
 module_param_named(dsense, sdebug_dsense, int, S_IRUGO | S_IWUSR);
+module_param_named(abort_sleep, sdebug_abort_sleep, int, S_IRUGO | S_IWUSR);
 module_param_named(every_nth, sdebug_every_nth, int, S_IRUGO | S_IWUSR);
 module_param_named(fake_rw, sdebug_fake_rw, int, S_IRUGO | S_IWUSR);
 module_param_named(guard, sdebug_guard, uint, S_IRUGO);
[  387.250623] BUG: unable to handle kernel NULL pointer dereference at 0000000000000008
[  387.250637] IP: iscsi_encode_text_output+0x32/0x5e0 [iscsi_target_mod]
[  387.250638] PGD 0 

[  387.250641] Oops: 0000 [#1] SMP
[  387.250642] Modules linked in: iscsi_target_mod target_core_pscsi target_core_file target_core_iblock target_core_user uio target_core_mod configfs fuse nf_conntrack_ipv4 nf_defrag_ipv4 xt_conntrack nf_conntrack ipt_REJECT nf_reject_ipv4 tun bridge stp llc ebtable_filter ebtables ip6table_filter ip6_tables iptable_filter vboxpci(O) vboxnetadp(O) vboxnetflt(O) vboxdrv(O) dm_mirror dm_region_hash dm_log coretemp ext4 x86_pkg_temp_thermal kvm_intel jbd2 mbcache kvm snd_hda_codec_hdmi irqbypass crc32_pclmul ghash_clmulni_intel pcbc snd_hda_codec_realtek snd_hda_codec_generic aesni_intel aes_x86_64 crypto_simd snd_hda_intel cryptd snd_hda_codec glue_helper snd_hda_core iTCO_wdt snd_hwdep iTCO_vendor_support snd_seq dm_mod dell_smm_hwmon snd_seq_device dcdbas snd_pcm snd_timer i2c_i801 snd lpc_ich pcspkr
[  387.250682]  soundcore sg i2c_core shpchp mfd_core wmi hed nfsd auth_rpcgss oid_registry nfs_acl lockd grace sunrpc ip_tables autofs4 xfs libcrc32c sr_mod sd_mod cdrom crc32c_intel ixgbe mdio xhci_pci hwmon qla2xxx e1000e ahci dca xhci_hcd libahci ptp scsi_transport_fc pps_core ipv6 crc_ccitt
[  387.250706] CPU: 18 PID: 10413 Comm: iscsi_np Tainted: G           O    4.11.0-rc1+ #1
[  387.250707] Hardware name: Dell Inc. Precision Tower 7810/0GWHMW, BIOS A02 09/05/2014
[  387.250708] task: ffff88044aa65a80 task.stack: ffffc90004330000
[  387.250713] RIP: 0010:iscsi_encode_text_output+0x32/0x5e0 [iscsi_target_mod]
[  387.250714] RSP: 0018:ffffc90004333d80 EFLAGS: 00010286
[  387.250715] RAX: ffff8804492ce000 RBX: ffff880442213800 RCX: ffff88044d590cdc
[  387.250716] RDX: ffff8804492ce000 RSI: 0000000000000002 RDI: 0000000000000006
[  387.250717] RBP: ffffc90004333df8 R08: 0000000000000000 R09: 0000000000000000
[  387.250718] R10: 0000000000000000 R11: ffff880439bae029 R12: 0000000000000008
[  387.250718] R13: ffff88044d590cdc R14: ffff88044d590cc0 R15: ffff880449360820
[  387.250720] FS:  0000000000000000(0000) GS:ffff88045fc80000(0000) knlGS:0000000000000000
[  387.250721] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[  387.250721] CR2: 0000000000000008 CR3: 0000000001c09000 CR4: 00000000001406e0
[  387.250722] Call Trace:
[  387.250730]  iscsi_target_do_login+0x143/0x610 [iscsi_target_mod]
[  387.250736]  iscsi_target_start_negotiation+0x4e/0xa0 [iscsi_target_mod]
[  387.250741]  iscsi_target_login_thread+0xa73/0xfc0 [iscsi_target_mod]
[  387.250744]  ? __schedule+0x2bb/0x950
[  387.250746]  kthread+0x101/0x140
[  387.250751]  ? iscsi_target_login_sess_out+0x250/0x250 [iscsi_target_mod]
[  387.250752]  ? queued_spin_unlock+0x20/0x20
[  387.250754]  ret_from_fork+0x2c/0x40
[  387.250755] Code: 89 e5 41 57 41 56 41 55 49 89 cd 41 54 4d 8d 60 08 53 48 83 ec 50 8b 01 48 89 55 b8 4c 89 45 a0 40 88 75 97 48 01 d0 48 89 45 98 <49> 8b 40 08 49 39 c4 48 8d 58 e0 0f 84 59 02 00 00 45 31 ff 31 
[  387.250781] RIP: iscsi_encode_text_output+0x32/0x5e0 [iscsi_target_mod] RSP: ffffc90004333d80
[  387.250781] CR2: 0000000000000008
[  387.250783] ---[ end trace d5c6a015e9661bc0 ]---