Thanks a lot Robert for the explanation. I understand what you are saying and I'm also excited to see more about IB with Ceph to get those performance numbers up, and hopefully (hopefully soon) to see accelio working for production. Regarding the HP IB switch we got 4 ports (uplinks) connected to our IB SW, and internally the blades are connected through the backplane to two ports so they used the total number of ports inside the Encl SW (16 ports). The bonding that I've configured is active/backup, I didn't know that active/active is possible with IPoIB. Also, the adapters that we got on the ceph nodes (supermicro servers), are Mellanox Technologies MT27500 Family [ConnectX-3], I also double check the port type configuration on the IB SW and see that it's speed rate is 14.0 Gbps and also that the MTU supported is 4096 and the current line rate is 56.0 Gbps.
I've try almost all possible combinations and I'm not getting any improvement that's more than 1.8 GB/s, so I was wondering if this is my top limit speed with this kind of setup.
Best,
German
2015-11-24 14:11 GMT-03:00 Robert LeBlanc <robert@xxxxxxxxxxxxx>:
-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA256
I've had wildly different iperf results based on the version of the
kernel, OFED and whether you are using datagram or connected mode as
well as the MTU. You really have to just try all the different options
to figure out what works the best.
Please also remember that you will not get iSER performance out of
Ceph at the moment (probably never), but the work being done will
help. Even if you get the network transport optimially tuned, unless
you have a massive Ceph cluster, you won't get the performance out the
of the SSDs. I'm just as excited about Ceph on Infiniband, but I've
had to just chill out and let the devs do their work.
I've never had good experiences with active/active bonding on IPoIB.
For two blades in the same chassis, you should get non-blocking line
rate. For going out of the chassis, you will be limited by the number
of ports you connect to the upstream switch (that is why there is
usually the same number of uplink ports as there are blades so that
you can do non-blocking, however HP has been selling switches with
only half the uplinks making your oversubscription 2:1, it really
depends on what you actually need). Between QDR and FDR, you should
get QDR speed. Also be sure it is full FDR and not FDR-10 which is the
same signal rate as QDR but with the new 64/66 encoding, it won't give
you as much speed improvement as FDR and it can be difficult to tell
which your adapter has if you don't research it. We thought we bought
FDR cards only to find out later they were FDR-10.
-----BEGIN PGP SIGNATURE-----
Version: Mailvelope v1.2.3
Comment: https://www.mailvelope.com
wsFcBAEBCAAQBQJWVJpCCRDmVDuy+mK58QAAEX4P/jFvdBzNob2xdftEkD2K
rSB5i/Idmi7BAe1/JUzMF/t7l7zFXEpq96oLbt5NMbreOhCe6MitEApfhpWq
dmt3IZYyUYVvXCxNGE/U7L58wi9DGPKJTWsigKScFtqjcQkIOlCh2VAHCmnE
/WZBtlMnBsoibqq+zZsM4GEBwvPCwUwpGDKU13DhpuvmiN09jICEHH05wZzq
ig/Ia309ioAZJ8PEKZ61kHUxAzTIMhwe1LV2jtlGQcJB4jMq7TQzOyizq0mQ
7DJTNNkMVpB9IEBCuOzzs/ByjKz+Tu31Jw2Y8R9MjtoDpOo+WQzzn6W4+NS0
jG0cFiumIBKVwoMJyXpQeS6UC0w7balHaXy+8F4SUa+J/9X5w4bH9MmlJBfh
p81YDtNs7mQYKsuDOkjNe0BkthhHbdQThHn4A75j8Hqaltwr28UqL83ywCUJ
SqTGkhRLyU9O74snPfG+T7hM4fIVpH7DS4ebmK7yvSVzwwuExPgwWhjvAsmt
DRnXv0qd8UAIgza0VYTyZuElUC4V39wMe503tXo5By+NGKWzVNOWR1X0+46i
Xq2zvZQzc9MPtGHMmnm1dkJ+d6imfLzTf099njZ+Wl1xbagnQiKbiwKL8T/k
d3OClf514rV4i7FtwOoB8NQcUMUjaeZGmPVDhmVt7fRYz/+rARkN/jwXH4qG
x/Dk
=/88f
-----END PGP SIGNATURE-----
----------------
Robert LeBlanc
PGP Fingerprint 79A2 9CA4 6CC4 45DD A904 C70E E654 3BB2 FA62 B9F1
On Tue, Nov 24, 2015 at 8:24 AM, German Anders <ganders@xxxxxxxxxxxx> wrote:
> Another test make between two HP blades with QDR (with bonding)
>
> e60-host01# iperf -s
> ------------------------------------------------------------
> Server listening on TCP port 5001
> TCP window size: 85.3 KByte (default)
> ------------------------------------------------------------
> [ 5] local 172.23.18.2 port 5001 connected with 172.23.18.1 port 41807
> [ 4] local 172.23.18.2 port 5001 connected with 172.23.18.1 port 41806
> [ 6] local 172.23.18.2 port 5001 connected with 172.23.18.1 port 41808
> [ 7] local 172.23.18.2 port 5001 connected with 172.23.18.1 port 41809
> [ ID] Interval Transfer Bandwidth
> [ 5] 0.0-10.0 sec 2.64 GBytes 2.27 Gbits/sec
> [ 4] 0.0-10.0 sec 2.64 GBytes 2.27 Gbits/sec
> [ 6] 0.0-10.0 sec 3.58 GBytes 3.08 Gbits/sec
> [ 7] 0.0-10.0 sec 3.57 GBytes 3.07 Gbits/sec
> [SUM] 0.0-10.0 sec 12.4 GBytes 10.7 Gbits/sec
>
> e60-host02# iperf -c 172.23.18.2 -P 4
>
> ------------------------------------------------------------
> Client connecting to 172.23.18.2, TCP port 5001
> TCP window size: 2.50 MByte (default)
> ------------------------------------------------------------
> [ 3] local 172.23.18.1 port 41806 connected with 172.23.18.2 port 5001
> [ 5] local 172.23.18.1 port 41808 connected with 172.23.18.2 port 5001
> [ 4] local 172.23.18.1 port 41807 connected with 172.23.18.2 port 5001
> [ 6] local 172.23.18.1 port 41809 connected with 172.23.18.2 port 5001
> [ ID] Interval Transfer Bandwidth
> [ 3] 0.0-10.0 sec 2.64 GBytes 2.27 Gbits/sec
> [ 5] 0.0-10.0 sec 3.58 GBytes 3.08 Gbits/sec
> [ 4] 0.0-10.0 sec 2.64 GBytes 2.27 Gbits/sec
> [ 6] 0.0-10.0 sec 3.57 GBytes 3.07 Gbits/sec
> [SUM] 0.0-10.0 sec 12.4 GBytes 10.7 Gbits/sec
>
> notice that also the blades are on the same enclosure.
>
> bonding configuration:
>
> alias bond-ib bonding options bonding mode=1 miimon=100 downdelay=100
> updelay=100 max_bonds=2
>
> ## INFINIBAND CONF
>
> auto ib0
> iface ib0 inet manual
> bond-master bond-ib
>
> auto ib1
> iface ib1 inet manual
> bond-master bond-ib
>
> auto bond-ib
> iface bond-ib inet static
> address 172.23.xx.xx
> netmask 255.255.xx.xx
> slaves ib0 ib1
> bond_miimon 100
> bond_mode active-backup
> pre-up echo connected > /sys/class/net/ib0/mode
> pre-up echo connected > /sys/class/net/ib1/mode
> pre-up /sbin/ifconfig ib0 mtu 65520
> pre-up /sbin/ifconfig ib1 mtu 65520
> pre-up modprobe bond-ib
> pre-up /sbin/ifconfig bond-ib mtu 65520
>
>
> German
>
> 2015-11-24 11:51 GMT-03:00 Mark Nelson <mnelson@xxxxxxxxxx>:
>>
>> Each port should be able to do 40Gb/s or 56Gb/s minus overhead and any
>> PCIe or car related bottlenecks. IPoIB will further limit that, especially
>> if you haven't done any kind of interrupt affinity tuning.
>>
>> Assuming these are mellanox cards you'll want to read this guide:
>>
>>
>> http://www.mellanox.com/related-docs/prod_software/Performance_Tuning_Guide_for_Mellanox_Network_Adapters.pdf
>>
>> For QDR I think the maximum throughput with IPoIB I've ever seen was about
>> 2.7GB/s for a single port. Typically 2-2.5GB/s is probably about what you
>> should expect for a well tuned setup.
>>
>> I'd still suggest doing iperf tests. It's really easy:
>>
>> "iperf -s" on one node to act as a server.
>>
>> "iperf -c <server ip> -P <num connections, ie: 4>" on the client
>>
>> This will give you an idea of how your network is doing. All-To-All
>> network tests are also useful, in that sometimes network issues can crop up
>> only when there's lots of traffic across many ports. We've seen this in lab
>> environments, especially with bonded ethernet.
>>
>> Mark
>>
>> On 11/24/2015 07:22 AM, German Anders wrote:
>>>
>>> After doing some more in deep research and tune some parameters I've
>>> gain a little bit more of performance:
>>>
>>> # fio --rw=randread --bs=1m --numjobs=4 --iodepth=32 --runtime=22
>>> --time_based --size=16777216k --loops=1 --ioengine=libaio --direct=1
>>> --invalidate=1 --fsync_on_close=1 --randrepeat=1 --norandommap
>>> --group_reporting --exitall --name
>>> dev-ceph-randread-1m-4thr-libaio-32iodepth-22sec
>>> --filename=/mnt/e60host01vol1/test1
>>> dev-ceph-randread-1m-4thr-libaio-32iodepth-22sec: (g=0): rw=randread,
>>> bs=1M-1M/1M-1M/1M-1M, ioengine=libaio, iodepth=32
>>> ...
>>> dev-ceph-randread-1m-4thr-libaio-32iodepth-22sec: (g=0): rw=randread,
>>> bs=1M-1M/1M-1M/1M-1M, ioengine=libaio, iodepth=32
>>> fio-2.1.3
>>> Starting 4 processes
>>> dev-ceph-randread-1m-4thr-libaio-32iodepth-22sec: Laying out IO file(s)
>>> (1 file(s) / 16384MB)
>>> Jobs: 4 (f=4): [rrrr] [60.5% done] [*1714MB*/0KB/0KB /s] [1713/0/0 iops]
>>>
>>> [eta 00m:15s]
>>> dev-ceph-randread-1m-4thr-libaio-32iodepth-22sec: (groupid=0, jobs=4):
>>> err= 0: pid=54857: Tue Nov 24 07:56:30 2015
>>> read : io=38699MB, bw=1754.2MB/s, iops=1754, runt= 22062msec
>>> slat (usec): min=131, max=63426, avg=2249.87, stdev=4320.91
>>> clat (msec): min=2, max=321, avg=70.56, stdev=35.80
>>> lat (msec): min=2, max=321, avg=72.81, stdev=36.13
>>> clat percentiles (msec):
>>> | 1.00th=[ 13], 5.00th=[ 24], 10.00th=[ 30], 20.00th=[
>>> 40],
>>> | 30.00th=[ 50], 40.00th=[ 57], 50.00th=[ 65], 60.00th=[
>>> 75],
>>> | 70.00th=[ 85], 80.00th=[ 98], 90.00th=[ 120], 95.00th=[
>>> 139],
>>> | 99.00th=[ 178], 99.50th=[ 194], 99.90th=[ 229], 99.95th=[
>>> 247],
>>> | 99.99th=[ 273]
>>> bw (KB /s): min=301056, max=612352, per=25.01%, avg=449291.87,
>>> stdev=54288.85
>>> lat (msec) : 4=0.11%, 10=0.61%, 20=2.11%, 50=27.87%, 100=50.92%
>>> lat (msec) : 250=18.34%, 500=0.03%
>>> cpu : usr=0.19%, sys=33.60%, ctx=66708, majf=0, minf=636
>>> IO depths : 1=0.1%, 2=0.1%, 4=0.1%, 8=0.1%, 16=0.2%, 32=99.7%,
>>> >=64=0.0%
>>> submit : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%,
>>> >=64=0.0%
>>> complete : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.1%, 64=0.0%,
>>> >=64=0.0%
>>> issued : total=r=38699/w=0/d=0, short=r=0/w=0/d=0
>>>
>>> Run status group 0 (all jobs):
>>> READ: io=38699MB, aggrb=*1754.2MB/s*, minb=1754.2MB/s,
>>>
>>> maxb=1754.2MB/s, mint=22062msec, maxt=22062msec
>>>
>>> Disk stats (read/write):
>>> rbd1: ios=77386/17, merge=0/122, ticks=3168312/500, in_queue=3170168,
>>> util=99.76%
>>>
>>> The thing is that this test was running from a 'HP Blade enclosure with
>>> QDR' so I think that if in QDR the max Throughput is around 3.2 GB/s (I
>>> guess that this number must be divided by the total number of ports, in
>>> this case 2, so a maximum of 1.6GB/s is the max of throughput that I'll
>>> get on a single port, is that correct? Also I made another test in
>>> another host that also had FDR so (max throughput would be around 6.8
>>> GB/s), and if the same theory is valid, that would lead me to 3.4 GB/s
>>> per port, but I'm not getting more than 1.4 - 1.6 GB/s, any ideas? same
>>> tuning on both servers.
>>>
>>> Basically I changed the scaling_governor of the cpufreq of all cpus to
>>> 'performance' and then set the following values:
>>>
>>> sysctl -w net.ipv4.tcp_timestamps=0
>>> sysctl -w net.core.netdev_max_backlog=250000
>>> sysctl -w net.core.rmem_max=4194304
>>> sysctl -w net.core.wmem_max=4194304
>>> sysctl -w net.core.rmem_default=4194304
>>> sysctl -w net.core.wmem_default=4194304
>>> sysctl -w net.core.optmem_max=4194304
>>> sysctl -w net.ipv4.tcp_rmem="4096 87380 4194304"
>>> sysctl -w net.ipv4.tcp_wmem="4096 65536 4194304"
>>> sysctl -w net.ipv4.tcp_low_latency=1
>>>
>>>
>>> However, on the HP blade, there's no Intel CPUs like the other server,
>>> so this kind of 'tuning' can't be done, so I left it as a default and
>>> only changed the TCP networking part.
>>>
>>> Any comments or hint would be really appreciated.
>>>
>>> Thanks in advance,
>>>
>>> Best,
>>>
>>>
>>> **
>>>
>>> *German
>>>
>>> *
>>> 2015-11-23 15:06 GMT-03:00 Robert LeBlanc <robert@xxxxxxxxxxxxx
>>> <mailto:robert@xxxxxxxxxxxxx>>:
>>>
>>>
>>> -----BEGIN PGP SIGNED MESSAGE-----
>>> Hash: SHA256
>>>
>>> Are you using unconnected mode or connected mode? With connected mode
>>> you can up your MTU to 64K which may help on the network side.
>>> - ----------------
>>> Robert LeBlanc
>>> PGP Fingerprint 79A2 9CA4 6CC4 45DD A904 C70E E654 3BB2 FA62 B9F1
>>>
>>>
>>> On Mon, Nov 23, 2015 at 10:40 AM, German Anders wrote:
>>> > Hi Mark,
>>> >
>>> > Thanks a lot for the quick response. Regarding the numbers that
>>> you send me,
>>> > they look REALLY nice. I've the following setup
>>> >
>>> > 4 OSD nodes:
>>> >
>>> > 2 x Intel Xeon E5-2650v2 @2.60Ghz
>>> > 1 x Network controller: Mellanox Technologies MT27500 Family
>>> [ConnectX-3]
>>> > Dual-Port (1 for PUB and 1 for CLUS)
>>> > 1 x SAS2308 PCI-Express Fusion-MPT SAS-2
>>> > 8 x Intel SSD DC S3510 800GB (1 OSD on each drive + journal on
>>> the same
>>> > drive, so 1:1 relationship)
>>> > 3 x Intel SSD DC S3710 200GB (to be used maybe as a cache tier)
>>> > 128GB RAM
>>> >
>>> > [0:0:0:0] disk ATA INTEL SSDSC2BA20 0110 /dev/sdc
>>> > [0:0:1:0] disk ATA INTEL SSDSC2BA20 0110 /dev/sdd
>>> > [0:0:2:0] disk ATA INTEL SSDSC2BA20 0110 /dev/sde
>>> > [0:0:3:0] disk ATA INTEL SSDSC2BB80 0130 /dev/sdf
>>> > [0:0:4:0] disk ATA INTEL SSDSC2BB80 0130 /dev/sdg
>>> > [0:0:5:0] disk ATA INTEL SSDSC2BB80 0130 /dev/sdh
>>> > [0:0:6:0] disk ATA INTEL SSDSC2BB80 0130 /dev/sdi
>>> > [0:0:7:0] disk ATA INTEL SSDSC2BB80 0130 /dev/sdj
>>> > [0:0:8:0] disk ATA INTEL SSDSC2BB80 0130 /dev/sdk
>>> > [0:0:9:0] disk ATA INTEL SSDSC2BB80 0130 /dev/sdl
>>> > [0:0:10:0] disk ATA INTEL SSDSC2BB80 0130 /dev/sdm
>>> >
>>> > sdf 8:80 0 745.2G 0 disk
>>> > |-sdf1 8:81 0 740.2G 0 part
>>> > /var/lib/ceph/osd/ceph-16
>>> > `-sdf2 8:82 0 5G 0 part
>>> > sdg 8:96 0 745.2G 0 disk
>>> > |-sdg1 8:97 0 740.2G 0 part
>>> > /var/lib/ceph/osd/ceph-17
>>> > `-sdg2 8:98 0 5G 0 part
>>> > sdh 8:112 0 745.2G 0 disk
>>> > |-sdh1 8:113 0 740.2G 0 part
>>> > /var/lib/ceph/osd/ceph-18
>>> > `-sdh2 8:114 0 5G 0 part
>>> > sdi 8:128 0 745.2G 0 disk
>>> > |-sdi1 8:129 0 740.2G 0 part
>>> > /var/lib/ceph/osd/ceph-19
>>> > `-sdi2 8:130 0 5G 0 part
>>> > sdj 8:144 0 745.2G 0 disk
>>> > |-sdj1 8:145 0 740.2G 0 part
>>> > /var/lib/ceph/osd/ceph-20
>>> > `-sdj2 8:146 0 5G 0 part
>>> > sdk 8:160 0 745.2G 0 disk
>>> > |-sdk1 8:161 0 740.2G 0 part
>>> > /var/lib/ceph/osd/ceph-21
>>> > `-sdk2 8:162 0 5G 0 part
>>> > sdl 8:176 0 745.2G 0 disk
>>> > |-sdl1 8:177 0 740.2G 0 part
>>> > /var/lib/ceph/osd/ceph-22
>>> > `-sdl2 8:178 0 5G 0 part
>>> > sdm 8:192 0 745.2G 0 disk
>>> > |-sdm1 8:193 0 740.2G 0 part
>>> > /var/lib/ceph/osd/ceph-23
>>> > `-sdm2 8:194 0 5G 0 part
>>> >
>>> >
>>> > $ rados bench -p rbd 20 write --no-cleanup -t 4
>>> > Maintaining 4 concurrent writes of 4194304 bytes for up to 20
>>> seconds or 0
>>> > objects
>>> > Object prefix: benchmark_data_cibm01_1409
>>> > sec Cur ops started finished avg MB/s cur MB/s last lat
>>> avg lat
>>> > 0 0 0 0 0 0 -
>>> 0
>>> > 1 4 121 117 467.894 468 0.0337203
>>> 0.0336809
>>> > 2 4 244 240 479.895 492 0.0304306
>>> 0.0330524
>>> > 3 4 372 368 490.559 512 0.0361914
>>> 0.0323822
>>> > 4 4 491 487 486.899 476 0.0346544
>>> 0.0327169
>>> > 5 4 587 583 466.302 384 0.110718
>>> 0.0342427
>>> > 6 4 701 697 464.575 456 0.0324953
>>> 0.0343136
>>> > 7 4 811 807 461.053 440 0.0400344
>>> 0.0345994
>>> > 8 4 923 919 459.412 448 0.0255677
>>> 0.0345767
>>> > 9 4 1032 1028 456.803 436 0.0309743
>>> 0.0349256
>>> > 10 4 1119 1115 445.917 348 0.229508
>>> 0.0357856
>>> > 11 4 1222 1218 442.826 412 0.0277902
>>> 0.0360635
>>> > 12 4 1315 1311 436.919 372 0.0303377
>>> 0.0365673
>>> > 13 4 1424 1420 436.842 436 0.0288001
>>> 0.03659
>>> > 14 4 1524 1520 434.206 400 0.0360993
>>> 0.0367697
>>> > 15 4 1632 1628 434.054 432 0.0296406
>>> 0.0366877
>>> > 16 4 1740 1736 433.921 432 0.0310995
>>> 0.0367746
>>> > 17 4 1836 1832 430.98 384 0.0250518
>>> 0.0370169
>>> > 18 4 1941 1937 430.366 420 0.027502
>>> 0.0371341
>>> > 19 4 2049 2045 430.448 432 0.0260257
>>> 0.0370807
>>> > 2015-11-23 12:10:58.587087min lat: 0.0229266 max lat: 0.27063 avg
>>> lat:
>>> > 0.0373936
>>> > sec Cur ops started finished avg MB/s cur MB/s last lat
>>> avg lat
>>> > 20 4 2141 2137 427.322 368 0.0351276
>>> 0.0373936
>>> > Total time run: 20.186437
>>> > Total writes made: 2141
>>> > Write size: 4194304
>>> > Bandwidth (MB/sec): 424.245
>>> >
>>> > Stddev Bandwidth: 102.136
>>> > Max bandwidth (MB/sec): 512
>>> > Min bandwidth (MB/sec): 0
>>> > Average Latency: 0.0376536
>>> > Stddev Latency: 0.032886
>>> > Max latency: 0.27063
>>> > Min latency: 0.0229266
>>> >
>>> >
>>> > $ rados bench -p rbd 20 seq --no-cleanup -t 4
>>> > sec Cur ops started finished avg MB/s cur MB/s last lat
>>> avg lat
>>> > 0 0 0 0 0 0 -
>>> 0
>>> > 1 4 394 390 1559.52 1560 0.0148888
>>> 0.0102236
>>> > 2 4 753 749 1496.68 1436 0.0129162
>>> 0.0106595
>>> > 3 4 1137 1133 1509.65 1536 0.0101854
>>> 0.0105731
>>> > 4 4 1526 1522 1521.17 1556 0.0122154
>>> 0.0103827
>>> > 5 4 1890 1886 1508.07 14560.00825445
>>> 0.0105908
>>> > Total time run: 5.675418
>>> > Total reads made: 2141
>>> > Read size: 4194304
>>> > Bandwidth (MB/sec): 1508.964
>>> >
>>> > Average Latency: 0.0105951
>>> > Max latency: 0.211469
>>> > Min latency: 0.00603694
>>> >
>>> >
>>> > I'm not even close to those numbers that you are getting... :(
>>> any ideas? or
>>> > hints? Also I've configured NOOP as the scheduler for all the SSD
>>> disks. I
>>> > don't know really what else to look for, in order to improve
>>> performance and
>>> > get some similar numbers from what you are getting
>>> >
>>> >
>>> > Thanks in advance,
>>> >
>>> > Cheers,
>>> >
>>> >
>>> > German
>>> >
>>> > 2015-11-23 13:32 GMT-03:00 Mark Nelson :
>>> >>
>>> >> Hi German,
>>> >>
>>> >> I don't have exactly the same setup, but on the ceph community
>>> cluster I
>>> >> have tests with:
>>> >>
>>> >> 4 nodes, each of which are configured in some tests with:
>>> >>
>>> >> 2 x Intel Xeon E5-2650
>>> >> 1 x Intel XL710 40GbE (currently limited to about 2.5GB/s each)
>>> >> 1 x Intel P3700 800GB (4 OSDs per card using 4 data and 4 journal
>>> >> partitions)
>>> >> 64GB RAM
>>> >>
>>> >> With filestore, I can get an aggregate throughput of:
>>> >>
>>> >> 1MB randread: 8715.3MB/s
>>> >> 4MB randread: 8046.2MB/s
>>> >>
>>> >> This is with 4 fio instances on the same nodes as the OSDs using
>>> the fio
>>> >> librbd engine.
>>> >>
>>> >> A couple of things I would suggest trying:
>>> >>
>>> >> 1) See how rados bench does. This is an easy test and you can
>>> see how
>>> >> different the numbers look.
>>> >>
>>> >> 2) try fio with librbd to see if it might be a qemu limitation.
>>> >>
>>> >> 3) Assuming you are using IPoIB, try some iperf tests to see how
>>> your
>>> >> network is doing.
>>> >>
>>> >> Mark
>>> >>
>>> >>
>>> >> On 11/23/2015 10:17 AM, German Anders wrote:
>>> >>>
>>> >>> Thanks a lot for the quick update Greg. This lead me to ask if
>>> there's
>>> >>> anything out there to improve performance in an Infiniband
>>> environment
>>> >>> with Ceph. In the cluster that I mentioned earlier. I've setup
>>> 4 OSD
>>> >>> server nodes nodes each with 8 OSD daemons running with 800x
>>> Intel SSD
>>> >>> DC S3710 disks (740.2G for OSD and 5G for Journal) and also
>>> using IB FDR
>>> >>> 56Gb/s for the PUB and CLUS network, and I'm getting the
>>> following fio
>>> >>> numbers:
>>> >>>
>>> >>>
>>> >>> # fio --rw=randread --bs=1m --numjobs=4 --iodepth=32
>>> --runtime=22
>>> >>> --time_based --size=16777216k --loops=1 --ioengine=libaio
>>> --direct=1
>>> >>> --invalidate=1 --fsync_on_close=1 --randrepeat=1 --norandommap
>>> >>> --group_reporting --exitall --name
>>> >>> dev-ceph-randread-1m-4thr-libaio-32iodepth-22sec
>>> >>> --filename=/mnt/rbd/test1
>>> >>> dev-ceph-randread-1m-4thr-libaio-32iodepth-22sec: (g=0):
>>> rw=randread,
>>> >>> bs=1M-1M/1M-1M/1M-1M, ioengine=libaio, iodepth=32
>>> >>> ...
>>> >>> dev-ceph-randread-1m-4thr-libaio-32iodepth-22sec: (g=0):
>>> rw=randread,
>>> >>> bs=1M-1M/1M-1M/1M-1M, ioengine=libaio, iodepth=32
>>> >>> fio-2.1.3
>>> >>> Starting 4 processes
>>> >>> dev-ceph-randread-1m-4thr-libaio-32iodepth-22sec: Laying out IO
>>> file(s)
>>> >>> (1 file(s) / 16384MB)
>>> >>> Jobs: 4 (f=4): [rrrr] [33.8% done] [1082MB/0KB/0KB /s]
>>> [1081/0/0 iops]
>>> >>> [eta 00m:45s]
>>> >>> dev-ceph-randread-1m-4thr-libaio-32iodepth-22sec: (groupid=0,
>>> jobs=4):
>>> >>> err= 0: pid=63852: Mon Nov 23 10:48:07 2015
>>> >>> read : io=21899MB, bw=988.23MB/s, iops=988, runt= 22160msec
>>> >>> slat (usec): min=192, max=186274, avg=3990.48,
>>> stdev=7533.77
>>> >>> clat (usec): min=10, max=808610, avg=125099.41,
>>> stdev=90717.56
>>> >>> lat (msec): min=6, max=809, avg=129.09, stdev=91.14
>>> >>> clat percentiles (msec):
>>> >>> | 1.00th=[ 27], 5.00th=[ 38], 10.00th=[ 45],
>>> 20.00th=[
>>> >>> 61],
>>> >>> | 30.00th=[ 74], 40.00th=[ 85], 50.00th=[ 100],
>>> 60.00th=[
>>> >>> 117],
>>> >>> | 70.00th=[ 141], 80.00th=[ 174], 90.00th=[ 235],
>>> 95.00th=[
>>> >>> 297],
>>> >>> | 99.00th=[ 482], 99.50th=[ 578], 99.90th=[ 717],
>>> 99.95th=[
>>> >>> 750],
>>> >>> | 99.99th=[ 775]
>>> >>> bw (KB /s): min=134691, max=335872, per=25.08%,
>>> avg=253748.08,
>>> >>> stdev=40454.88
>>> >>> lat (usec) : 20=0.01%
>>> >>> lat (msec) : 10=0.02%, 20=0.27%, 50=12.90%, 100=36.93%,
>>> 250=41.39%
>>> >>> lat (msec) : 500=7.59%, 750=0.84%, 1000=0.05%
>>> >>> cpu : usr=0.11%, sys=26.76%, ctx=39695, majf=0,
>>> minf=405
>>> >>> IO depths : 1=0.1%, 2=0.1%, 4=0.1%, 8=0.1%, 16=0.3%,
>>> 32=99.4%,
>>> >>> >=64=0.0%
>>> >>> submit : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%,
>>> 64=0.0%,
>>> >>> >=64=0.0%
>>> >>> complete : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.1%,
>>> 64=0.0%,
>>> >>> >=64=0.0%
>>> >>> issued : total=r=21899/w=0/d=0, short=r=0/w=0/d=0
>>> >>>
>>> >>> Run status group 0 (all jobs):
>>> >>> READ: io=21899MB, aggrb=988.23MB/s, minb=988.23MB/s,
>>> >>> maxb=988.23MB/s, mint=22160msec, maxt=22160msec
>>> >>>
>>> >>> Disk stats (read/write):
>>> >>> rbd1: ios=43736/163, merge=0/5, ticks=3189484/15276,
>>> >>> in_queue=3214988, util=99.78%
>>> >>>
>>> >>>
>>> >>>
>>> >>>
>>>
>>> ############################################################################################################################################################
>>> >>>
>>> >>>
>>> >>> # fio --rw=randread --bs=4m --numjobs=4 --iodepth=32
>>> --runtime=22
>>> >>> --time_based --size=16777216k --loops=1 --ioengine=libaio
>>> --direct=1
>>> >>> --invalidate=1 --fsync_on_close=1 --randrepeat=1 --norandommap
>>> >>> --group_reporting --exitall --name
>>> >>> dev-ceph-randread-4m-4thr-libaio-32iodepth-22sec
>>> >>> --filename=/mnt/rbd/test2
>>> >>>
>>> >>> fio-2.1.3
>>> >>> Starting 4 processes
>>> >>> dev-ceph-randread-4m-4thr-libaio-32iodepth-22sec: Laying out IO
>>> file(s)
>>> >>> (1 file(s) / 16384MB)
>>> >>> Jobs: 4 (f=4): [rrrr] [28.7% done] [894.3MB/0KB/0KB /s]
>>> [223/0/0 iops]
>>> >>> [eta 00m:57s]
>>> >>> dev-ceph-randread-4m-4thr-libaio-32iodepth-22sec: (groupid=0,
>>> jobs=4):
>>> >>> err= 0: pid=64654: Mon Nov 23 10:51:58 2015
>>> >>> read : io=18952MB, bw=876868KB/s, iops=214, runt= 22132msec
>>> >>> slat (usec): min=518, max=81398, avg=18576.88,
>>> stdev=14840.55
>>> >>> clat (msec): min=90, max=1915, avg=570.37, stdev=166.51
>>> >>> lat (msec): min=123, max=1936, avg=588.95, stdev=169.19
>>> >>> clat percentiles (msec):
>>> >>> | 1.00th=[ 258], 5.00th=[ 343], 10.00th=[ 383],
>>> 20.00th=[
>>> >>> 437],
>>> >>> | 30.00th=[ 482], 40.00th=[ 519], 50.00th=[ 553],
>>> 60.00th=[
>>> >>> 594],
>>> >>> | 70.00th=[ 627], 80.00th=[ 685], 90.00th=[ 775],
>>> 95.00th=[
>>> >>> 865],
>>> >>> | 99.00th=[ 1057], 99.50th=[ 1156], 99.90th=[ 1680],
>>> 99.95th=[
>>> >>> 1860],
>>> >>> | 99.99th=[ 1909]
>>> >>> bw (KB /s): min= 5665, max=383251, per=24.61%,
>>> avg=215755.74,
>>> >>> stdev=61735.70
>>> >>> lat (msec) : 100=0.02%, 250=0.80%, 500=33.88%, 750=53.31%,
>>> >>> 1000=10.26%
>>> >>> lat (msec) : 2000=1.73%
>>> >>> cpu : usr=0.07%, sys=12.52%, ctx=32466, majf=0,
>>> minf=372
>>> >>> IO depths : 1=0.1%, 2=0.2%, 4=0.3%, 8=0.7%, 16=1.4%,
>>> 32=97.4%,
>>> >>> >=64=0.0%
>>> >>> submit : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%,
>>> 64=0.0%,
>>> >>> >=64=0.0%
>>> >>> complete : 0=0.0%, 4=99.9%, 8=0.0%, 16=0.0%, 32=0.1%,
>>> 64=0.0%,
>>> >>> >=64=0.0%
>>> >>> issued : total=r=4738/w=0/d=0, short=r=0/w=0/d=0
>>> >>>
>>> >>> Run status group 0 (all jobs):
>>> >>> READ: io=18952MB, aggrb=876868KB/s, minb=876868KB/s,
>>> >>> maxb=876868KB/s, mint=22132msec, maxt=22132msec
>>> >>>
>>> >>> Disk stats (read/write):
>>> >>> rbd1: ios=37721/177, merge=0/5, ticks=3075924/11408,
>>> >>> in_queue=3097448, util=99.77%
>>> >>>
>>> >>>
>>> >>> Can anyone share some results from a similar environment?
>>> >>>
>>> >>> Thanks in advance,
>>> >>>
>>> >>> Best,
>>> >>>
>>> >>> **
>>> >>>
>>> >>> *German*
>>> >>>
>>> >>> 2015-11-23 13:08 GMT-03:00 Gregory Farnum >> >:
>>> >>>
>>> >>> On Mon, Nov 23, 2015 at 10:05 AM, German Anders
>>> >>> > wrote:
>>> >>> > Hi all,
>>> >>> >
>>> >>> > I want to know if there's any improvement or update
>>> regarding ceph
>>> >>> 0.94.5
>>> >>> > with accelio, I've an already configured cluster (with no
>>> data on
>>> >>> it) and I
>>> >>> > would like to know if there's a way to 'modify' the
>>> cluster in
>>> >>> order to use
>>> >>> > accelio. Any info would be really appreciated.
>>> >>>
>>> >>> The XioMessenger is still experimental. As far as I know
>>> it's not
>>> >>> expected to be stable any time soon and I can't imagine it
>>> will be
>>> >>> backported to Hammer even when done.
>>> >>> -Greg
>>> >>>
>>> >>>
>>> >>>
>>> >>>
>>> >>> _______________________________________________
>>> >>> ceph-users mailing list
>>> >>> ceph-users@xxxxxxxxxxxxxx <mailto:ceph-users@xxxxxxxxxxxxxx>
>>> >>> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>>> >>>
>>> >> _______________________________________________
>>> >> ceph-users mailing list
>>> >> ceph-users@xxxxxxxxxxxxxx <mailto:ceph-users@xxxxxxxxxxxxxx>
>>> >> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>>> >
>>> >
>>> >
>>> > _______________________________________________
>>> > ceph-users mailing list
>>> > ceph-users@xxxxxxxxxxxxxx <mailto:ceph-users@xxxxxxxxxxxxxx>
>>> > http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>>> >
>>>
>>> -----BEGIN PGP SIGNATURE-----
>>> Version: Mailvelope v1.2.3
>>> Comment: https://www.mailvelope.com
>>>
>>> wsFcBAEBCAAQBQJWU1WqCRDmVDuy+mK58QAAo5cQALjuZB+dyjbcRDyScvj/
>>> qjurMqCHlScgG9U8CE4L6/E/QUfCNmdvE4KaeQC82oj/SplXYOuglTHJkUMg
>>> KPyjb9jJs+ZyS560IoUB/l/XQZpO9WL+DNnSAg96Hpb3eG+G5jukW9/E/QHQ
>>> aDjn/c1njEqUhxMAosUFZR58CxejyyI5Vr/SXX+oE6y2tCF31Z3KPiOVTOtj
>>> BPIx74xpigXMSP+zaK4UelhjPzrRnefkN2sLpQS5uwJlOY1f35KoM3dX+LHO
>>> 2BWpyrLUtL6ZzpalKr/QbaWko1VM109vjAoPZ3X82ig9DZp2DW8ZVX4abVcy
>>> +Zyre4SCncKFJZcL9VkQHPJxRFhqXHC43mpSHIKmhuhmGVwr9ngiKGUY1Q7t
>>> O0aks06KHfqSRxjWmuhtP0eMLwsH7gLAEqqtAjnIhRTCDDkhRdp/MdZJ7ftO
>>> LHF9+Eqdp/KiVrGK7BX9zwVshr608bR4g7JCfK4/ukSHXOWFVR6GZ8jue85q
>>> e6dWhHsdwrPt1QnSrfhnKjoMdhTpvPVzlxqo2jHDXEyE57RxW/zXr776HxcQ
>>> cISj4zDZ0nGZ1F8w4DdB0ql8CpsCDAEoaNG0ZQPXcItyrHIB0lFOJYDi5m+4
>>> YqOCG8TWh7b28IbEEwwUSpx3pi2iyH0ObJZM5dgf62AOCKCEsixf+UguFVwd
>>> /jdL
>>> =6LtO
>>> -----END PGP SIGNATURE-----
>>>
>>>
>
_______________________________________________ ceph-users mailing list ceph-users@xxxxxxxxxxxxxx http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com