Re: [PATCH-v4 0/5] Fix LUN_RESET active I/O + TMR handling

Himanshu Madhani <himanshu.madhani@xxxxxxxxxx> · Fri, 12 Feb 2016 05:30:39 +0000

Hi Nic, 

On 2/11/16, 3:47 PM, "Nicholas A. Bellinger" <nab@xxxxxxxxxxxxxxx> wrote:

>On Wed, 2016-02-10 at 22:53 -0800, Nicholas A. Bellinger wrote:
>> On Tue, 2016-02-09 at 18:03 +0000, Himanshu Madhani wrote:
>> > On 2/8/16, 9:25 PM, "Nicholas A. Bellinger" <nab@xxxxxxxxxxxxxxx>
>>wrote:
>> > >On Mon, 2016-02-08 at 23:27 +0000, Himanshu Madhani wrote:
>> > >> 
>> > >> I am testing this series with with 4.5.0-rc2+ kernel and I am
>>seeing
>> > >>issue
>> > >> where trying to trigger
>> > >> sg_reset with option of host/device/bus in loop at 120second
>>interval
>> > >> causes call stack. At this point
>> > >> removing configuration hangs indefinitely. See attached dmesg
>>output
>> > >>from
>> > >> my setup. 
>> > >> 
>> > >
>> > >Thanks alot for testing this.
>> > >
>> > >So It looks like we're still hitting a indefinite schedule() on
>> > >se_cmd->cmd_wait_comp once tcm_qla2xxx session disconnect/reconnect
>> > >occurs, after repeated explicit active I/O remote-port sg_resets.
>> > >
>> > >Does this trigger on the first tcm_qla2xxx session reconnect after
>> > >explicit remote-port sg_reset..?  Are session reconnects actively
>>being
>> > >triggered during the test..?
>> > >
>> > >To verify the latter for iscsi-target, I've been using a small patch
>>to
>> > >trigger session reset from TMR kthread context in order to simulate
>>the
>> > >I_T disconnects.  Something like that would be useful for verifying
>>with
>> > >tcm_qla2xxx too.
>> > >
>> > >That said, I'll be reproducing with tcm_qla2xxx ports this week, and
>> > >will enable various debug in a WIP branch for testing.
>> 
>> Following up here..
>> 
>> So far using my test setup with ISP2532 ports in P2P + RAMDISK_MCP and
>> v4.5-rc1, repeated remote-port active I/O LUN_RESET (sg_reset -d) has
>> been functioning as expected with a blocksize_range=4k-256k + iodepth=32
>> fio write-verify style workload.
>> 
>> No ->cmd_kref -1 OOPsen or qla2xxx initiator generated ABORT_TASKs from
>> outstanding target TAS responses, nor fio write-verify failures to
>> report after 800x remote-port active I/O LUN_RESETS.
>> 
>> Next step will be to verify explicit tcm_qla2xxx port + module shutdown
>> after 1K test iterations, and then IBLOCK async completions <-> NVMe
>> backends with the same case.
>> 
>
>After letting this test run over-night up to 7k active I/O remote-port
>LUN_RESETs, things are still functioning as expected.
>
>Also, /etc/init.d/target stop was able to successfully shutdown all
>active sessions and unload tcm_qla2xxx after the test run.
>
>So AFAICT, the active I/O remote-port LUN_RESET changes are stable with
>tcm_qla2xxx ports, separate from concurrent session disconnect hung task
>you reported earlier.
>
>That said, I'll likely push this series as-is for -rc4, given that Dan
>has also been able to verify the non conncurrent session disconnect case
>on his setup generating constant ABORT_TASKs, and it's still surviving
>both cases for iscsi-target ports.
>
>Please give the debug patch from last night a shot, and see if we can
>determine the se_cmd states when you hit the hung task.

I¹ll give your latest debug patch try in a little while

>From the testing that I have done, what is seen is that active IO has
already been completed and qla2xxx driver is waiting for commands to be
Completed and it¹s waiting indefinitely for cmd_wait_comp.
So it looks like there is a missing complete call from target_core. I¹ve
attached our analysis from crash debug on a live system after the issues
happens.

I can recreate this issue at will within 5 minute of triggering sg_reset
with following steps

1. Export 4 RAM disk LUNs on each of 2 port adapter. Initiator will see 8
RAM disk targets
2. Start IO with 4K block size and 8 threads with 80% write 20% read and
100% dandom. 
(I am using vdbench for generating IO. I can provide setup/config script
if needed)
3. Start sg_reset for each LUNs with first device, bus and host with 120s
delay. (I¹ve attached
My script that I am using for triggering sg_reset)

>
>Thank you,
>
>-nab
>

<<attachment: winmail.dat>>