LIO iSER small random IO performance.

Benjamin ESTRABAUD <be@xxxxxxxxxx> · Thu, 06 Nov 2014 16:30:51 +0000

Hi All,

When using LIO iSER over RoCE, we see variations in 8K read IOPS 
performance depending on the backend storage.

If using a ramdisk backend storage (loop device created atop a 20G tmpfs 
RAM filesystem, or ramdisk_mcp which yields more or less the same 
performance) we get 2.5x less IOPS than when running "fio" locally.

If using a "real" block backend (MD or LV interleaved (RAID 0 or 
interleaved volume) built ontop of six Crucial M50 1TB SSDs) we get 3.4x 
less IOPS than when running "fio" locally.

While we expected a small performance degradation between "local" IOs 
and iSER ones, we did not expect to see a gap of 2.5x or 3.5x less IOPS.

Is this expected? It's hard to find proper unbiased benchmarks that 
compare local IOPS vs iSER IOPS. We don't get that issue when running 
large nice sequential IOs, where our local bandwidth is equivalent to 
our remote one. We were wondering if there were anything obvious we 
might have overlooked in our configuration. Any idea would be greatly 
appreciated.

The system configuration is as follow:

Target node (Running LIO):

* "Homemade" buildroot based distribution, Linux 3.10.35 x86_64 (SMP), 
stock Infiniband drivers (*NO* OFED drivers).
* Running on a Xeon E5-2695v2 (2.40Ghz, 12 physical cores, 24 logical 
cores). HT is enabled (we therefore have 24 logical cores showing up in 
"top"), with 64GiB of RAM and a ConnectX-3 Pro 40Gb converged card 
configured as RoCE.

Initiator node:

* CentOS 6.5, running a "stock" upstream 3.10.59 x86_64 (SMP) kernel 
with default config from "make menuconfig". Again using stock Infiniband 
drivers (*NO* OFED drivers).
* Running on a Xeon E3-1241v3 (3.5Ghz, 4 physical cores, 8 logical 
cores). HT is enabled (8 cores show up in top), with 16GiB of RAM a 
ConnectX-3 Pro 40Gb converged card configured as RoCE.

Both cards are directly connected.

Here are the "fio" tests and their respective results.

NOTE: The same "fio" command is used on either the target (locally) or 
the initiator (over iSER).

fio --filename=/dev/<device> --direct=1 --rw=randrw --ioengine=libaio 
--bs=8k --rwmixread=100 --iodepth=16 --numjobs=16 --runtime=60 
--group_reporting --name=test1

/dev/loop0 (tmpfs ramdisk), local: 341k io/s
/dev/loop0 (tmpfs ramdisk), remote (iSER): 186k io/s

/dev/md_d1 (6*1TB Crucial M50 RAID0), local: 210k io/s
/dev/md_d1 (6*1TB Crucial M50 RAID0), remote (iSER): 71.2k io/s

CPU usage when running over "fio" over iSER is about 65% of one core 
running "kworker" and 15% of that core in "hardware interrupt" with 
about 15-20% idle.

So here we know we can reach high IOPS on the backend storage directly, 
but somehow we're unable to get close when running over iSER, whether 
the backend storage is real disks or a memdisk. Also, the bottleneck is 
clearly not the iSER link, at least for the test on the RAID since we 
get over twice as many IOPS when running on a ramdisk backstore. The 
issue here is the difference between local IOPS and iSER IOPS.

Thanks a lot in advance for your help!

Regards,
Ben - MPSTOR.
--
To unsubscribe from this list: send the line "unsubscribe target-devel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at  http://vger.kernel.org/majordomo-info.html