On 11/04/2016 02:52 PM, Nicholas A. Bellinger wrote:
Hi Zhu & Co,
Thanks for the detailed logs. Comments below.
On Mon, 2016-10-31 at 16:51 +0800, Zhu Lingshan wrote:
Hi Nicholas,
(sorry it would be a long mail)
Sorry for the delay, I spent some test and debug work. I find the patch
http://www.spinics.net/lists/target-devel/msg13530.html can solve two
issues:
(a). iscsit_stop_session() on the top of iscsi_np stack.
(b).iscsi_check_for_session_reinstatement() on the top of iscsi_np
stack, it is a must to reboot to recover.
Great, thanks for confirming the patch above.
The key wrt this scenario, and other scenarios below is once target-core
ABORT_TASK logic locates a se_cmd descriptor for IBLOCK backend I/O
still outstanding, both iscsi session reinstatement in iscsi_np context
and iscsi session shutdown in iscsi_t[t,r]x context are blocked until
the specific outstanding se_cmd I/Os are completed, back to target-core
backend driver.
Note this is expected behavior during target-core ABORT_TASK and
iscsi-target session reinstatement + session shutdown.
But I also find two more issues.
-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
Please let me explain my setup:
I have a target server, kernel version 4.4.21, it has a SATA HDD as a
LUN, only one LUN.
I have two initiators, both login to the target.
Create two partitions on the LUN, each initiator mount a certain
partition. Like initiator1 mount /dev/sdc1, initiator2 mount /dev/sdc2.
looks like this:
lszhu_DEV:~ # lsscsi
[0:0:0:0] cd/dvd HL-DT-ST DVD+-RW GHB0N A1C0 /dev/sr0
[4:0:0:0] disk ATA TOSHIBA DT01ACA2 MX4O /dev/sda
[9:0:0:0] disk LIO-ORG IBLOCK 4.0 /dev/sdc
lszhu_DEV:~ # lsblk
NAME MAJ:MIN RM SIZE RO TYPE MOUNTPOINT
sda 8:0 0 1.8T 0 disk
├─sda1 8:1 0 16G 0 part [SWAP]
├─sda2 8:2 0 200G 0 part /
└─sda3 8:3 0 1.6T 0 part /home
sdc 8:32 0 465.8G 0 disk
├─sdc1 8:33 0 200G 0 part
└─sdc2 8:34 0 265.8G 0 part /mnt
sr0 11:0 1 1024M 0 rom
bj-ha-5:~ # lsblk
NAME MAJ:MIN RM SIZE RO TYPE MOUNTPOINT
sda 8:0 0 465.8G 0 disk
├─sda1 8:1 0 8G 0 part [SWAP]
├─sda2 8:2 0 100G 0 part /
└─sda3 8:3 0 357.8G 0 part /home
sdb 8:16 0 465.8G 0 disk
sdc 8:32 0 465.8G 0 disk
├─sdc1 8:33 0 200G 0 part /mnt
└─sdc2 8:34 0 265.8G 0 part
sr0 11:0 1 1024M 0 rom
so you can see each initiator will read / write on their own partition
from the LUN.
mount the partition like this:
mount -o dioread_nolock /dev/sdc1 /mnt, option dioread_nolock can help
we reproduce this bug a little quicker.
Then run fio on both initiators like this:
fio -filename=/mnt/file1 -direct=1 -rw=randread -iodepth=64
-ioengin=libaio -bs=64K -size=5G -numjobs=24 -runtime=60000 -time_based
-group_reporting -name=init1
-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
Here are the three more issues I found, kernel version 4.4.21
(1) The first one looks like this, it is quite rare, only once, I failed
to get stack information for iscsi_trx and hard to reproduce, call stack
of iscsi_np is here:
bj-ha-3:~ # ps -aux | grep iscsi
root 10501 0.0 0.0 0 0 ? D 16:45 0:00 [iscsi_np]
root 10519 0.4 0.0 0 0 ? S 16:49 0:02 [iscsi_ttx]
root 10520 2.2 0.0 0 0 ? S 16:49 0:14 [iscsi_trx]
root 10533 0.9 0.0 0 0 ? D 16:54 0:03 [iscsi_trx]
root 10547 0.0 0.0 10508 1552 pts/0 S+ 16:59 0:00 grep
--color=auto iscsi
bj-ha-3:~ # cat /proc/10501/stack
[<ffffffff815334c9>] inet_csk_accept+0x269/0x2e0
[<ffffffff8155f4aa>] inet_accept+0x2a/0x100
[<ffffffff814d2a88>] kernel_accept+0x48/0xa0
[<ffffffffa06fe871>] iscsit_accept_np+0x31/0x230 [iscsi_target_mod]
[<ffffffffa06fefab>] iscsi_target_login_thread+0xeb/0xfd0 [iscsi_target_mod]
[<ffffffff810996bd>] kthread+0xbd/0xe0
[<ffffffff815e15bf>] ret_from_fork+0x3f/0x70
[<ffffffff81099600>] kthread+0x0/0xe0
[<ffffffffffffffff>] 0xffffffffffffffff
The iscsi_np here is looks like it's doing normal Linux/NET accept.
Is the uninterruptible sleep for PID=10533 waiting for outstanding I/O
to complete during iscsit_close_connection() shutdown after
ABORT_TASK..?
sorry I have been trying to reproduce this case for a week, but can not
reproduce this case anymore, it is quite rare.
------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
(2) The second issue I found has iscsi_check_for_session_reinstatement()
on the top of iscsi_np call stack, but different from case (b) we
mentioned before. In case (b), we must reboot the target to recover, but
in this case, it will auto recover, here is the details:
dmesg:
[ 100.487421] iSCSI/iqn.1996-04.de.suse:01:faad5846cde9: Unsupported
SCSI Opcode 0xa3, sending CHECK_CONDITION.
[ 182.582323] ABORT_TASK: Found referenced iSCSI task_tag: 15
[ 197.616800] Unexpected ret: -32 send data 48
[ 197.712278] ABORT_TASK: Sending TMR_FUNCTION_COMPLETE for ref_tag: 15
[ 263.603305] ABORT_TASK: Found referenced iSCSI task_tag: 268435537
[ 278.640172] Unexpected ret: -32 send data 48
[ 278.671299] ABORT_TASK: Sending TMR_FUNCTION_COMPLETE for ref_tag:
268435537
[ 341.584329] ABORT_TASK: Found referenced iSCSI task_tag: 536871038
[ 357.423710] Unexpected ret: -32 send data 48
[ 373.660476] iSCSI Login timeout on Network Portal 192.168.100.233:3260
Great, recovering from uninterruptible sleep once the backend (finally)
completes outstanding I/O here confirms the earlier SCF_ACK_KREF patch.
So the backend device in question looks like it's repeatably taking
extended time to complete I/O. At least 15 seconds plus the host
initiator side ABORT_TASK timeout.
Is the device failing..?
I tried many times (more than 50), I_T nexus always alive, after killed
the survival node( we have two initiator nodes, one fail one keep
running), I see the former D threads on the target side back to S
status, and I can read or write the LUN, the mount point is still
active. The target is fully functional! Thanks for the patch!!!!
But I still see such message "[ 887.448142] sd 8:0:0:0: rejecting I/O
to offline device" on the failed initiator dmesg, I guess maybe target
can not response to the failed initiator during that time because the
target is fully occupied by the other initiator, I guess maybe HDD
performance is the bottle neck.
But I think maybe there should not be an "offline device", yes IO
traffic at the failed initiator is stuck there, but we still have much
free network resources (FIO shows IO traffic < 20MB/s), also CPU and RAM
resources, so I think it is still possible to ping in the protocol
layer, should not be an "offline device".
Is it possible to investigate more(I will do that)? Maybe on the
initiator side, any suggestions?
call stacks:
bj-ha-3:~ # ps -aux | grep iscsi
root 3063 0.0 0.0 0 0 ? D 15:36 0:00 [iscsi_np]
root 3073 0.3 0.0 0 0 ? S 15:36 0:01 [iscsi_ttx]
root 3074 0.4 0.0 0 0 ? S 15:36 0:01 [iscsi_trx]
root 3094 0.3 0.0 0 0 ? D 15:39 0:00 [iscsi_trx]
root 3116 0.0 0.0 10508 1616 pts/0 S+ 15:42 0:00 grep
--color=auto iscsi
bj-ha-3:~ # cat /proc/3063/stack
[<ffffffffa03fcd97>] iscsi_check_for_session_reinstatement+0x1d7/0x270
[iscsi_target_mod]
[<ffffffffa03ff971>] iscsi_target_do_login+0x111/0x5f0 [iscsi_target_mod]
[<ffffffffa0400b17>] iscsi_target_start_negotiation+0x17/0xa0
[iscsi_target_mod]
[<ffffffffa03fe932>] iscsi_target_login_thread+0xa72/0xfd0
[iscsi_target_mod]
[<ffffffff810997cd>] kthread+0xbd/0xe0
[<ffffffff815e177f>] ret_from_fork+0x3f/0x70
[<ffffffff81099710>] kthread+0x0/0xe0
[<ffffffffffffffff>] 0xffffffffffffffff
bj-ha-3:~ # cat /proc/3094/stack
[<ffffffffa03adcc6>] transport_generic_free_cmd+0x76/0x110 [target_core_mod]
[<ffffffffa0404917>] iscsit_free_cmd+0x67/0x130 [iscsi_target_mod]
[<ffffffffa040b929>] iscsit_close_connection+0x4a9/0x860 [iscsi_target_mod]
[<ffffffffa040b0b3>] iscsi_target_rx_thread+0x93/0xa0 [iscsi_target_mod]
[<ffffffff810997cd>] kthread+0xbd/0xe0
[<ffffffff815e177f>] ret_from_fork+0x3f/0x70
[<ffffffff81099710>] kthread+0x0/0xe0
[<ffffffffffffffff>] 0xffffffffffffffff
At this moment, we can see IO errors at one initiator side, the other
will survive. Once we killed the survival side FIO processes or wait a
moment, the target would auto recover, looks like this:
bj-ha-3:~ # ps -aux | grep iscsi
root 3063 0.0 0.0 0 0 ? S 15:36 0:00 [iscsi_np]
root 3142 0.0 0.0 0 0 ? S 15:43 0:00 [iscsi_ttx]
root 3143 0.0 0.0 0 0 ? S 15:43 0:00 [iscsi_trx]
root 3146 0.6 0.0 0 0 ? S 15:43 0:00 [iscsi_ttx]
root 3147 0.8 0.0 0 0 ? S 15:43 0:00 [iscsi_trx]
root 3149 0.0 0.0 10508 1636 pts/0 S+ 15:43 0:00 grep
--color=auto iscsi
dmesg at this moment:
[ 90.507364] iSCSI/iqn.1996-04.de.suse:01:4039926d2313: Unsupported
SCSI Opcode 0xa3, sending CHECK_CONDITION.
[ 100.487421] iSCSI/iqn.1996-04.de.suse:01:faad5846cde9: Unsupported
SCSI Opcode 0xa3, sending CHECK_CONDITION.
[ 182.582323] ABORT_TASK: Found referenced iSCSI task_tag: 15
[ 197.616800] Unexpected ret: -32 send data 48
[ 197.712278] ABORT_TASK: Sending TMR_FUNCTION_COMPLETE for ref_tag: 15
[ 263.603305] ABORT_TASK: Found referenced iSCSI task_tag: 268435537
[ 278.640172] Unexpected ret: -32 send data 48
[ 278.671299] ABORT_TASK: Sending TMR_FUNCTION_COMPLETE for ref_tag:
268435537
[ 341.584329] ABORT_TASK: Found referenced iSCSI task_tag: 536871038
[ 357.423710] Unexpected ret: -32 send data 48
[ 373.660476] iSCSI Login timeout on Network Portal 192.168.100.233:3260
[ 535.852903] Unexpected ret: -32 send data 48
[ 535.923803] ABORT_TASK: Sending TMR_FUNCTION_COMPLETE for ref_tag:
536871038
[ 535.923811] ABORT_TASK: Found referenced iSCSI task_tag: 76
[ 535.923815] ABORT_TASK: Sending TMR_TASK_DOES_NOT_EXIST for ref_tag: 76
[ 535.924325] Unexpected ret: -104 send data 360
[ 535.924329] tx_data returned -32, expecting 360.
[ 535.924489] iSCSI Login negotiation failed.
[ 535.924916] Unexpected ret: -104 send data 360
[ 535.924918] tx_data returned -32, expecting 360.
[ 535.924943] iSCSI Login negotiation failed.
[ 535.925424] Unexpected ret: -104 send data 360
[ 535.925428] tx_data returned -32, expecting 360.
[ 535.925650] iSCSI Login negotiation failed.
[ 535.926095] Unexpected ret: -104 send data 360
[ 535.926097] tx_data returned -32, expecting 360.
[ 535.926132] iSCSI Login negotiation failed.
[ 535.926349] Unexpected ret: -104 send data 360
[ 535.926351] tx_data returned -32, expecting 360.
[ 535.926369] iSCSI Login negotiation failed.
[ 535.926576] Unexpected ret: -104 send data 360
[ 535.926577] tx_data returned -32, expecting 360.
[ 535.926593] iSCSI Login negotiation failed.
[ 535.926750] Unexpected ret: -104 send data 360
[ 535.926752] tx_data returned -32, expecting 360.
[ 535.926768] iSCSI Login negotiation failed.
[ 535.926905] Unexpected ret: -104 send data 360
[ 535.926906] tx_data returned -32, expecting 360.
[ 535.926925] iSCSI Login negotiation failed.
[ 535.927064] Unexpected ret: -104 send data 360
[ 535.927065] tx_data returned -32, expecting 360.
[ 535.927082] iSCSI Login negotiation failed.
[ 535.927219] Unexpected ret: -104 send data 360
[ 535.927220] tx_data returned -32, expecting 360.
[ 535.927237] iSCSI Login negotiation failed.
I am still debugging on this case, I have core dumps, I think the dump
files are too big for a mail list, I can send it off the list if needed.
AFAICT, this is all expected behavior from target-core + iscsi-target
when the backend device is not completing I/O for extended periods of
time.
------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
(3) In this case, dmesg and call stack looks like this:
bj-ha-3:~ # dmesg
[ 803.570868] ABORT_TASK: Found referenced iSCSI task_tag: 96
[ 818.584038] Unexpected ret: -32 send data 48
[ 818.620210] ABORT_TASK: Sending TMR_FUNCTION_COMPLETE for ref_tag: 96
[ 882.555923] ABORT_TASK: Found referenced iSCSI task_tag: 268435496
[ 897.587925] Unexpected ret: -32 send data 48
[ 898.845646] Unexpected ret: -32 send data 48
[ 898.916090] ABORT_TASK: Sending TMR_FUNCTION_COMPLETE for ref_tag:
268435496
[ 898.916095] ABORT_TASK: Found referenced iSCSI task_tag: 45
[ 898.916377] ABORT_TASK: Sending TMR_FUNCTION_COMPLETE for ref_tag: 45
[ 962.785419] ABORT_TASK: Found referenced iSCSI task_tag: 268435489
bj-ha-3:~ # ps -aux | grep iscsi
root 3063 0.0 0.0 0 0 ? D 15:36 0:00 [iscsi_np]
root 3296 1.9 0.0 0 0 ? S 15:49 0:01 [iscsi_ttx]
root 3297 1.9 0.0 0 0 ? R 15:49 0:01 [iscsi_trx]
root 3299 0.0 0.0 0 0 ? D 15:49 0:00 [iscsi_trx]
root 3506 0.0 0.0 10508 1516 pts/0 S+ 15:51 0:00 grep
--color=auto iscsi
bj-ha-3:~ # cat /proc/3063/stack
[<ffffffffa03fcd97>] iscsi_check_for_session_reinstatement+0x1d7/0x270
[iscsi_target_mod]
[<ffffffffa03ff971>] iscsi_target_do_login+0x111/0x5f0 [iscsi_target_mod]
[<ffffffffa0400b17>] iscsi_target_start_negotiation+0x17/0xa0
[iscsi_target_mod]
[<ffffffffa03fe932>] iscsi_target_login_thread+0xa72/0xfd0
[iscsi_target_mod]
[<ffffffff810997cd>] kthread+0xbd/0xe0
[<ffffffff815e177f>] ret_from_fork+0x3f/0x70
[<ffffffff81099710>] kthread+0x0/0xe0
[<ffffffffffffffff>] 0xffffffffffffffff
bj-ha-3:~ # cat /proc/3099/stack
[<ffffffffa03ac784>] __transport_wait_for_tasks+0xb4/0x190 [target_core_mod]
[<ffffffffa03ac905>] transport_wait_for_tasks+0x45/0x60 [target_core_mod]
[<ffffffffa03aa034>] core_tmr_abort_task+0xf4/0x160 [target_core_mod]
[<ffffffffa03aca43>] target_tmr_work+0x123/0x140 [target_core_mod]
[<ffffffff81093ace>] process_one_work+0x14e/0x410
[<ffffffff81094326>] worker_thread+0x116/0x490
[<ffffffff810997cd>] kthread+0xbd/0xe0
[<ffffffff815e177f>] ret_from_fork+0x3f/0x70
[<ffffffff81099710>] kthread+0x0/0xe0
[<ffffffffffffffff>] 0xffffffffffffffff
This case is also rare, only once, but I got the core dump files.
Looks the same as #2.
-------------------------------------------------------------------------------------------------------------------------------------------------
Sorry again for the delay and send such a long mail, I am still working
on there cases, I will keep the list update once I find anything new.
Thanks again for confirming the earlier patch.
Thanks a lot for your help!
BR
Zhu Lingshan
--
To unsubscribe from this list: send the line "unsubscribe linux-scsi" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at http://vger.kernel.org/majordomo-info.html