Hi Sagi,
Thank you very much for your prompt reply.
On 06/11/14 17:54, Sagi Grimberg wrote:
On 11/6/2014 6:30 PM, Benjamin ESTRABAUD wrote:
Hi All,
When using LIO iSER over RoCE, we see variations in 8K read IOPS
performance depending on the backend storage.
If using a ramdisk backend storage (loop device created atop a 20G tmpfs
RAM filesystem, or ramdisk_mcp which yields more or less the same
performance) we get 2.5x less IOPS than when running "fio" locally.
That doesn't sound right...
If using a "real" block backend (MD or LV interleaved (RAID 0 or
interleaved volume) built ontop of six Crucial M50 1TB SSDs) we get 3.4x
less IOPS than when running "fio" locally.
While we expected a small performance degradation between "local" IOs
and iSER ones, we did not expect to see a gap of 2.5x or 3.5x less IOPS.
That's also strange...
Is this expected? It's hard to find proper unbiased benchmarks that
compare local IOPS vs iSER IOPS. We don't get that issue when running
large nice sequential IOs, where our local bandwidth is equivalent to
our remote one. We were wondering if there were anything obvious we
might have overlooked in our configuration. Any idea would be greatly
appreciated.
I would like to know your initiator block layer settings such as:
- scheduler
- nomerges
- rq_affinity
- add_random
Here are the settings on the iSER initiator:
cat /sys/block/sdh/queue/scheduler
noop deadline [cfq]
cat /sys/block/sdh/queue/nomerges
0
cat /sys/block/sdh/queue/rq_affinity
1
cat /sys/block/sdh/queue/add_random
1
(We left them all as "default")
Also, I would like to understand your IRQ affinity placement on both
stations.
Is this what you are looking for?
On initiator side:
show_irq_affinity.sh rename5
63: ff
64: ff
65: ff
66: ff
67: ff
68: ff
69: ff
70: ff
On target side:
cat /proc/irq/168/smp_affinity
ffffff
cat /proc/irq/169/smp_affinity
ffffff
cat /proc/irq/170/smp_affinity
ffffff
cat /proc/irq/171/smp_affinity
ffffff
cat /proc/irq/172/smp_affinity
ffffff
cat /proc/irq/173/smp_affinity
ffffff
cat /proc/irq/174/smp_affinity
ffffff
cat /proc/irq/175/smp_affinity
ffffff
Is it a single device? single session?
We're running IOs over a single device connected on a single session,
established over a single port from a single ConnectX-3 Dual port card.
Both ports are directly connected using a QSFP cable and link is
established on RoCE at 40Gbs.
The system configuration is as follow:
Target node (Running LIO):
* "Homemade" buildroot based distribution, Linux 3.10.35 x86_64 (SMP),
stock Infiniband drivers (*NO* OFED drivers).
* Running on a Xeon E5-2695v2 (2.40Ghz, 12 physical cores, 24 logical
cores). HT is enabled (we therefore have 24 logical cores showing up in
"top"), with 64GiB of RAM and a ConnectX-3 Pro 40Gb converged card
configured as RoCE.
Initiator node:
* CentOS 6.5, running a "stock" upstream 3.10.59 x86_64 (SMP) kernel
with default config from "make menuconfig". Again using stock Infiniband
drivers (*NO* OFED drivers).
* Running on a Xeon E3-1241v3 (3.5Ghz, 4 physical cores, 8 logical
cores). HT is enabled (8 cores show up in top), with 16GiB of RAM a
ConnectX-3 Pro 40Gb converged card configured as RoCE.
Both cards are directly connected.
Here are the "fio" tests and their respective results.
NOTE: The same "fio" command is used on either the target (locally) or
the initiator (over iSER).
fio --filename=/dev/<device> --direct=1 --rw=randrw --ioengine=libaio
--bs=8k --rwmixread=100 --iodepth=16 --numjobs=16 --runtime=60
--group_reporting --name=test1
/dev/loop0 (tmpfs ramdisk), local: 341k io/s
/dev/loop0 (tmpfs ramdisk), remote (iSER): 186k io/s
I get on ramdisk(_mcp) over iser the following results:
numjobs=16,iodepth=16: 254K IOPs
numjobs=16,iodepth=128: 297K IOPs
We're not necessarily concerned about the actual IOPS we get using a
ramdisk over iSER, but more the gap between the "local" ramdisk IOPS
perf and the iSER one. In your benchmarks, does the "local" IOPS
performance is usually close to the iSER one (+/- 5-10%)?
And I don't have a significant different system than yours:
Systems connected b2b - CX3 (VPI) single 40GE link
Both systems: Intel(R) Xeon(R) CPU E5-2670 0 @ 2.60GHz (8 cores but only
1 is active on the target and 4 are active at the initiator)
We were a bit concerned that having lots of "slower" cores would hinder
performance also, as the CPU usage, while not reaching 100% of one core
is quite high. Other than the slightly higher individual core speed on
the target your system is pretty similar.
Target OS: RHEL7.0
Initiator OS: RH6.4 (iser-1.5 package - same as upstream)
Rest of fio settings:
direct=1
rw=randread
bs=8k
runtime=60
group_reporting
name=test1
ioengine=libaio
time_based
loops=1
fsync_on_close=1
randrepeat=1
norandommap
exitall
/dev/md_d1 (6*1TB Crucial M50 RAID0), local: 210k io/s
/dev/md_d1 (6*1TB Crucial M50 RAID0), remote (iSER): 71.2k io/s
CPU usage when running over "fio" over iSER is about 65% of one core
running "kworker" and 15% of that core in "hardware interrupt" with
about 15-20% idle.
Haven't tried that - but I don't think you should see this gap...
So here we know we can reach high IOPS on the backend storage directly,
but somehow we're unable to get close when running over iSER, whether
the backend storage is real disks or a memdisk. Also, the bottleneck is
clearly not the iSER link,
Definitely not... the Link can carry way more than that...
Good to know, we also wanted to make sure we weren't going to hit a
deadend performance wise.
at least for the test on the RAID since we
get over twice as many IOPS when running on a ramdisk backstore. The
issue here is the difference between local IOPS and iSER IOPS.
I strongly recommend that you checkout Mellanox community's
iSER/LIO/RDMA/PERF
related posts at:
http://community.mellanox.com/content?filterID=all~objecttype~objecttype%5Bdocument%5D&query=Iser
Will do that now. Thanks a lot.
Cheers,
Sagi.
Regards,
Ben - MPSTOR.