Re: Trying to understand a weird timeout when using fio and multipath IO

Sitsofe Wheeler <sitsofe@xxxxxxxxx> · Wed, 4 Nov 2015 05:47:38 +0000

Hi Todd,

What path policy have you set Windows to and what does the does the
Windows event log have to say when things stall? Can you reproduce
this error if you readwrite to only writes or only reads? Does halving
the number of disks still show the problem? Does reducing the
blocksize make the problem less likely to happen? If you run two fio
binaries with the same job do they both pause at the same time?

On 3 November 2015 at 21:22, Todd Lawall <tlawall@xxxxxxxxxxxxxxxxx> wrote:
> Hello,
>
> I'm trying to understand a behavior, and I am hoping to further my
> understanding of what Fio is doing.  In the specific case in question,
> I'm seeing a seven minute wait in resuming IO after a failover.  In other
>  variations on this job file, the seven minute wait disappears, and it
>  drops back down to the 40 second wait that I see with usual IO
> loads that are run.
>
> The setup:
> - I have one Windows 2012 R2 host, with two NICs.
> - I have one storage array, with 2 controllers A and B, with 2 10GbE ports
>    for each side, making 4 ports, with failover capability between the two
>    sides.
> - I have iSCSI and MPIO setup so that there is one login from each NIC to
>    each side, so four sessions total for each volume.   The map looks
>    something like this:
>
>            nic1                 nic2
>            /  \                 /  \
>           /    \               /    \
>      side A   side B       side A  side B
>      port 0   port 0       port 1  port 1
>
> - I have the fio job below.  It is basically 256k blocks, 1 iodepth, one
>    worker with 48 drives.
>
> [global]
> do_verify=0
> ioengine=windowsaio
> numjobs=1
> iodepth=1
> offset=0
> direct=1
> thread
>
> [fio-0]
> blocksize=256k
> readwrite=rw
> filename=\\.\PHYSICALDRIVE19
> filename=\\.\PHYSICALDRIVE20
> <snipped out the other 44 drives>
> filename=\\.\PHYSICALDRIVE13
> filename=\\.\PHYSICALDRIVE14
> size=100%
>
> If I alter the job in any of the following ways, IO keeps going after the
>  failover period which is about 40 seconds.  To summarize:
>
> Doesn't work:
>  - multiple disks, single job, 1 iodepth
>
> Works:
>  - Single disk, one job, 1 iodepth
>  - multiple disks, one job with all disks, same iodepth as # of disks (e.g.
>    if there's 48 disks, iodepth is set to 48)
>  - multiple disks, one job per disk, 1 iodepth
>
> Would anyone have any idea why that one arrangement causes a
> significant delay before IO is resumed?
>
> Thanks in advance,
> Todd
> --
> To unsubscribe from this list: send the line "unsubscribe fio" in
> the body of a message to majordomo@xxxxxxxxxxxxxxx
> More majordomo info at  http://vger.kernel.org/majordomo-info.html

-- 
Sitsofe | http://sucs.org/~sits/
--
To unsubscribe from this list: send the line "unsubscribe fio" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at  http://vger.kernel.org/majordomo-info.html