Re: Ceph 0.94.5 with accelio

Robert LeBlanc <robert@xxxxxxxxxxxxx> · Tue, 24 Nov 2015 10:35:06 -0700



-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA256

I've gotten about 3.2 GB/s with IPoIB on QDR, but it took a couple of
weeks of tuning to get that rate. If your switch is at 2048 MTU, it is
really hard to get it increased without an outage if I remember
correctly. Connected mode is much easier to get higher MTUs, but it
was a bit flaky with IPoIB (had to send several pings to get the
connection established some times). This was all a couple of years ago
now so my memory is a bit fuzzy. My current IB Ceph cluster is so
small that doing any tuning is not going to help because the
bottleneck is my disks and CPU.
- ----------------
Robert LeBlanc
PGP Fingerprint 79A2 9CA4 6CC4 45DD A904  C70E E654 3BB2 FA62 B9F1


On Tue, Nov 24, 2015 at 10:26 AM, German Anders  wrote:
> Thanks a lot Robert for the explanation. I understand what you are saying
> and I'm also excited to see more about IB with Ceph to get those performance
> numbers up, and hopefully (hopefully soon) to see accelio working for
> production. Regarding the HP IB switch we got 4 ports (uplinks) connected to
> our IB SW, and internally the blades are connected through the backplane to
> two ports so they used the total number of ports inside the Encl SW (16
> ports). The bonding that I've configured is active/backup, I didn't know
> that active/active is possible with IPoIB. Also, the adapters that we got on
> the ceph nodes (supermicro servers), are Mellanox Technologies MT27500
> Family [ConnectX-3], I also double check the port type configuration on the
> IB SW and see that it's speed rate is 14.0 Gbps and also that the MTU
> supported is 4096 and the current line rate is 56.0 Gbps.
>
> I've try almost all possible combinations and I'm not getting any
> improvement that's more than 1.8 GB/s, so I was wondering if this is my top
> limit speed with this kind of setup.
>
> Best,
>
>
> German
>
> 2015-11-24 14:11 GMT-03:00 Robert LeBlanc :
>>
>> -----BEGIN PGP SIGNED MESSAGE-----
>> Hash: SHA256
>>
>> I've had wildly different iperf results based on the version of the
>> kernel, OFED and whether you are using datagram or connected mode as
>> well as the MTU. You really have to just try all the different options
>> to figure out what works the best.
>>
>> Please also remember that you will not get iSER performance out of
>> Ceph at the moment (probably never), but the work being done will
>> help. Even if you get the network transport optimially tuned, unless
>> you have a massive Ceph cluster, you won't get the performance out the
>> of the SSDs. I'm just as excited about Ceph on Infiniband, but I've
>> had to just chill out and let the devs do their work.
>>
>> I've never had good experiences with active/active bonding on IPoIB.
>> For two blades in the same chassis, you should get non-blocking line
>> rate. For going out of the chassis, you will be limited by the number
>> of ports you connect to the upstream switch (that is why there is
>> usually the same number of uplink ports as there are blades so that
>> you can do non-blocking, however HP has been selling switches with
>> only half the uplinks making your oversubscription 2:1, it really
>> depends on what you actually need). Between QDR and FDR, you should
>> get QDR speed. Also be sure it is full FDR and not FDR-10 which is the
>> same signal rate as QDR but with the new 64/66 encoding, it won't give
>> you as much speed improvement as FDR and it can be difficult to tell
>> which your adapter has if you don't research it. We thought we bought
>> FDR cards only to find out later they were FDR-10.
>> -----BEGIN PGP SIGNATURE-----
>> Version: Mailvelope v1.2.3
>> Comment: https://www.mailvelope.com
>>
>> wsFcBAEBCAAQBQJWVJpCCRDmVDuy+mK58QAAEX4P/jFvdBzNob2xdftEkD2K
>> rSB5i/Idmi7BAe1/JUzMF/t7l7zFXEpq96oLbt5NMbreOhCe6MitEApfhpWq
>> dmt3IZYyUYVvXCxNGE/U7L58wi9DGPKJTWsigKScFtqjcQkIOlCh2VAHCmnE
>> /WZBtlMnBsoibqq+zZsM4GEBwvPCwUwpGDKU13DhpuvmiN09jICEHH05wZzq
>> ig/Ia309ioAZJ8PEKZ61kHUxAzTIMhwe1LV2jtlGQcJB4jMq7TQzOyizq0mQ
>> 7DJTNNkMVpB9IEBCuOzzs/ByjKz+Tu31Jw2Y8R9MjtoDpOo+WQzzn6W4+NS0
>> jG0cFiumIBKVwoMJyXpQeS6UC0w7balHaXy+8F4SUa+J/9X5w4bH9MmlJBfh
>> p81YDtNs7mQYKsuDOkjNe0BkthhHbdQThHn4A75j8Hqaltwr28UqL83ywCUJ
>> SqTGkhRLyU9O74snPfG+T7hM4fIVpH7DS4ebmK7yvSVzwwuExPgwWhjvAsmt
>> DRnXv0qd8UAIgza0VYTyZuElUC4V39wMe503tXo5By+NGKWzVNOWR1X0+46i
>> Xq2zvZQzc9MPtGHMmnm1dkJ+d6imfLzTf099njZ+Wl1xbagnQiKbiwKL8T/k
>> d3OClf514rV4i7FtwOoB8NQcUMUjaeZGmPVDhmVt7fRYz/+rARkN/jwXH4qG
>> x/Dk
>> =/88f
>> -----END PGP SIGNATURE-----
>> ----------------
>> Robert LeBlanc
>> PGP Fingerprint 79A2 9CA4 6CC4 45DD A904  C70E E654 3BB2 FA62 B9F1
>>
>>
>> On Tue, Nov 24, 2015 at 8:24 AM, German Anders
>> wrote:
>> > Another test make between two HP blades with QDR (with bonding)
>> >
>> > e60-host01# iperf -s
>> > ------------------------------------------------------------
>> > Server listening on TCP port 5001
>> > TCP window size: 85.3 KByte (default)
>> > ------------------------------------------------------------
>> > [  5] local 172.23.18.2 port 5001 connected with 172.23.18.1 port 41807
>> > [  4] local 172.23.18.2 port 5001 connected with 172.23.18.1 port 41806
>> > [  6] local 172.23.18.2 port 5001 connected with 172.23.18.1 port 41808
>> > [  7] local 172.23.18.2 port 5001 connected with 172.23.18.1 port 41809
>> > [ ID] Interval       Transfer     Bandwidth
>> > [  5]  0.0-10.0 sec  2.64 GBytes  2.27 Gbits/sec
>> > [  4]  0.0-10.0 sec  2.64 GBytes  2.27 Gbits/sec
>> > [  6]  0.0-10.0 sec  3.58 GBytes  3.08 Gbits/sec
>> > [  7]  0.0-10.0 sec  3.57 GBytes  3.07 Gbits/sec
>> > [SUM]  0.0-10.0 sec  12.4 GBytes  10.7 Gbits/sec
>> >
>> > e60-host02# iperf -c 172.23.18.2 -P 4
>> >
>> > ------------------------------------------------------------
>> > Client connecting to 172.23.18.2, TCP port 5001
>> > TCP window size: 2.50 MByte (default)
>> > ------------------------------------------------------------
>> > [  3] local 172.23.18.1 port 41806 connected with 172.23.18.2 port 5001
>> > [  5] local 172.23.18.1 port 41808 connected with 172.23.18.2 port 5001
>> > [  4] local 172.23.18.1 port 41807 connected with 172.23.18.2 port 5001
>> > [  6] local 172.23.18.1 port 41809 connected with 172.23.18.2 port 5001
>> > [ ID] Interval       Transfer     Bandwidth
>> > [  3]  0.0-10.0 sec  2.64 GBytes  2.27 Gbits/sec
>> > [  5]  0.0-10.0 sec  3.58 GBytes  3.08 Gbits/sec
>> > [  4]  0.0-10.0 sec  2.64 GBytes  2.27 Gbits/sec
>> > [  6]  0.0-10.0 sec  3.57 GBytes  3.07 Gbits/sec
>> > [SUM]  0.0-10.0 sec  12.4 GBytes  10.7 Gbits/sec
>> >
>> > notice that also the blades are on the same enclosure.
>> >
>> > bonding configuration:
>> >
>> > alias bond-ib bonding options bonding mode=1 miimon=100 downdelay=100
>> > updelay=100 max_bonds=2
>> >
>> > ## INFINIBAND CONF
>> >
>> > auto ib0
>> > iface ib0 inet manual
>> >         bond-master bond-ib
>> >
>> > auto ib1
>> > iface ib1 inet manual
>> >         bond-master bond-ib
>> >
>> > auto bond-ib
>> > iface bond-ib inet static
>> >         address 172.23.xx.xx
>> >         netmask 255.255.xx.xx
>> >         slaves ib0 ib1
>> >         bond_miimon 100
>> >         bond_mode active-backup
>> >         pre-up echo connected > /sys/class/net/ib0/mode
>> >         pre-up echo connected > /sys/class/net/ib1/mode
>> >         pre-up /sbin/ifconfig ib0 mtu 65520
>> >         pre-up /sbin/ifconfig ib1 mtu 65520
>> >         pre-up modprobe bond-ib
>> >         pre-up /sbin/ifconfig bond-ib mtu 65520
>> >
>> >
>> > German
>> >
>> > 2015-11-24 11:51 GMT-03:00 Mark Nelson :
>> >>
>> >> Each port should be able to do 40Gb/s or 56Gb/s minus overhead and any
>> >> PCIe or car related bottlenecks.  IPoIB will further limit that,
>> >> especially
>> >> if you haven't done any kind of interrupt affinity tuning.
>> >>
>> >> Assuming these are mellanox cards you'll want to read this guide:
>> >>
>> >>
>> >>
>> >> http://www.mellanox.com/related-docs/prod_software/Performance_Tuning_Guide_for_Mellanox_Network_Adapters.pdf
>> >>
>> >> For QDR I think the maximum throughput with IPoIB I've ever seen was
>> >> about
>> >> 2.7GB/s for a single port.  Typically 2-2.5GB/s is probably about what
>> >> you
>> >> should expect for a well tuned setup.
>> >>
>> >> I'd still suggest doing iperf tests.  It's really easy:
>> >>
>> >> "iperf -s" on one node to act as a server.
>> >>
>> >> "iperf -c  -P " on the client
>> >>
>> >> This will give you an idea of how your network is doing.  All-To-All
>> >> network tests are also useful, in that sometimes network issues can
>> >> crop up
>> >> only when there's lots of traffic across many ports.  We've seen this
>> >> in lab
>> >> environments, especially with bonded ethernet.
>> >>
>> >> Mark
>> >>
>> >> On 11/24/2015 07:22 AM, German Anders wrote:
>> >>>
>> >>> After doing some more in deep research and tune some parameters I've
>> >>> gain a little bit more of performance:
>> >>>
>> >>> # fio --rw=randread --bs=1m --numjobs=4 --iodepth=32 --runtime=22
>> >>> --time_based --size=16777216k --loops=1 --ioengine=libaio --direct=1
>> >>> --invalidate=1 --fsync_on_close=1 --randrepeat=1 --norandommap
>> >>> --group_reporting --exitall --name
>> >>> dev-ceph-randread-1m-4thr-libaio-32iodepth-22sec
>> >>> --filename=/mnt/e60host01vol1/test1
>> >>> dev-ceph-randread-1m-4thr-libaio-32iodepth-22sec: (g=0): rw=randread,
>> >>> bs=1M-1M/1M-1M/1M-1M, ioengine=libaio, iodepth=32
>> >>> ...
>> >>> dev-ceph-randread-1m-4thr-libaio-32iodepth-22sec: (g=0): rw=randread,
>> >>> bs=1M-1M/1M-1M/1M-1M, ioengine=libaio, iodepth=32
>> >>> fio-2.1.3
>> >>> Starting 4 processes
>> >>> dev-ceph-randread-1m-4thr-libaio-32iodepth-22sec: Laying out IO
>> >>> file(s)
>> >>> (1 file(s) / 16384MB)
>> >>> Jobs: 4 (f=4): [rrrr] [60.5% done] [*1714MB*/0KB/0KB /s] [1713/0/0
>> >>> iops]
>> >>>
>> >>> [eta 00m:15s]
>> >>> dev-ceph-randread-1m-4thr-libaio-32iodepth-22sec: (groupid=0, jobs=4):
>> >>> err= 0: pid=54857: Tue Nov 24 07:56:30 2015
>> >>>    read : io=38699MB, bw=1754.2MB/s, iops=1754, runt= 22062msec
>> >>>      slat (usec): min=131, max=63426, avg=2249.87, stdev=4320.91
>> >>>      clat (msec): min=2, max=321, avg=70.56, stdev=35.80
>> >>>       lat (msec): min=2, max=321, avg=72.81, stdev=36.13
>> >>>      clat percentiles (msec):
>> >>>       |  1.00th=[   13],  5.00th=[   24], 10.00th=[   30], 20.00th=[
>> >>> 40],
>> >>>       | 30.00th=[   50], 40.00th=[   57], 50.00th=[   65], 60.00th=[
>> >>> 75],
>> >>>       | 70.00th=[   85], 80.00th=[   98], 90.00th=[  120], 95.00th=[
>> >>> 139],
>> >>>       | 99.00th=[  178], 99.50th=[  194], 99.90th=[  229], 99.95th=[
>> >>> 247],
>> >>>       | 99.99th=[  273]
>> >>>      bw (KB  /s): min=301056, max=612352, per=25.01%, avg=449291.87,
>> >>> stdev=54288.85
>> >>>      lat (msec) : 4=0.11%, 10=0.61%, 20=2.11%, 50=27.87%, 100=50.92%
>> >>>      lat (msec) : 250=18.34%, 500=0.03%
>> >>>    cpu          : usr=0.19%, sys=33.60%, ctx=66708, majf=0, minf=636
>> >>>    IO depths    : 1=0.1%, 2=0.1%, 4=0.1%, 8=0.1%, 16=0.2%, 32=99.7%,
>> >>>  >=64=0.0%
>> >>>       submit    : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%,
>> >>>  >=64=0.0%
>> >>>       complete  : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.1%, 64=0.0%,
>> >>>  >=64=0.0%
>> >>>       issued    : total=r=38699/w=0/d=0, short=r=0/w=0/d=0
>> >>>
>> >>> Run status group 0 (all jobs):
>> >>>     READ: io=38699MB, aggrb=*1754.2MB/s*, minb=1754.2MB/s,
>> >>>
>> >>> maxb=1754.2MB/s, mint=22062msec, maxt=22062msec
>> >>>
>> >>> Disk stats (read/write):
>> >>>    rbd1: ios=77386/17, merge=0/122, ticks=3168312/500,
>> >>> in_queue=3170168,
>> >>> util=99.76%
>> >>>
>> >>> The thing is that this test was running from a 'HP Blade enclosure
>> >>> with
>> >>> QDR' so I think that if in QDR the max Throughput is around 3.2 GB/s
>> >>> (I
>> >>> guess that this number must be divided by the total number of ports,
>> >>> in
>> >>> this case 2, so a maximum of 1.6GB/s is the max of throughput that
>> >>> I'll
>> >>> get on a single port, is that correct? Also I made another test in
>> >>> another host that also had FDR so (max throughput would be around 6.8
>> >>> GB/s), and if the same theory is valid, that would lead me to 3.4 GB/s
>> >>> per port, but I'm not getting more than 1.4 - 1.6 GB/s, any ideas?
>> >>> same
>> >>> tuning on both servers.
>> >>>
>> >>> Basically I changed the scaling_governor of the cpufreq of all cpus to
>> >>> 'performance' and then set the following values:
>> >>>
>> >>> sysctl -w net.ipv4.tcp_timestamps=0
>> >>> sysctl -w net.core.netdev_max_backlog=250000
>> >>> sysctl -w net.core.rmem_max=4194304
>> >>> sysctl -w net.core.wmem_max=4194304
>> >>> sysctl -w net.core.rmem_default=4194304
>> >>> sysctl -w net.core.wmem_default=4194304
>> >>> sysctl -w net.core.optmem_max=4194304
>> >>> sysctl -w net.ipv4.tcp_rmem="4096 87380 4194304"
>> >>> sysctl -w net.ipv4.tcp_wmem="4096 65536 4194304"
>> >>> sysctl -w net.ipv4.tcp_low_latency=1
>> >>>
>> >>>
>> >>> However, on the HP blade, there's no Intel CPUs like the other server,
>> >>> so this kind of 'tuning' can't be done, so I left it as a default and
>> >>> only changed the TCP networking part.
>> >>>
>> >>> Any comments or hint would be really appreciated.
>> >>>
>> >>> Thanks in advance,
>> >>>
>> >>> Best,
>> >>>
>> >>>
>> >>> **
>> >>>
>> >>> *German
>> >>>
>> >>> *
>> >>> 2015-11-23 15:06 GMT-03:00 Robert LeBlanc > >>> >:
>> >>>
>> >>>
>> >>>     -----BEGIN PGP SIGNED MESSAGE-----
>> >>>     Hash: SHA256
>> >>>
>> >>>     Are you using unconnected mode or connected mode? With connected
>> >>> mode
>> >>>     you can up your MTU to 64K which may help on the network side.
>> >>>     - ----------------
>> >>>     Robert LeBlanc
>> >>>     PGP Fingerprint 79A2 9CA4 6CC4 45DD A904  C70E E654 3BB2 FA62 B9F1
>> >>>
>> >>>
>> >>>     On Mon, Nov 23, 2015 at 10:40 AM, German Anders  wrote:
>> >>>      > Hi Mark,
>> >>>      >
>> >>>      > Thanks a lot for the quick response. Regarding the numbers that
>> >>>     you send me,
>> >>>      > they look REALLY nice. I've the following setup
>> >>>      >
>> >>>      > 4 OSD nodes:
>> >>>      >
>> >>>      > 2 x Intel Xeon E5-2650v2 @2.60Ghz
>> >>>      > 1 x Network controller: Mellanox Technologies MT27500 Family
>> >>>     [ConnectX-3]
>> >>>      > Dual-Port (1 for PUB and 1 for CLUS)
>> >>>      > 1 x SAS2308 PCI-Express Fusion-MPT SAS-2
>> >>>      > 8 x Intel SSD DC S3510 800GB (1 OSD on each drive + journal on
>> >>>     the same
>> >>>      > drive, so 1:1 relationship)
>> >>>      > 3 x Intel SSD DC S3710 200GB (to be used maybe as a cache tier)
>> >>>      > 128GB RAM
>> >>>      >
>> >>>      > [0:0:0:0]    disk    ATA      INTEL SSDSC2BA20 0110  /dev/sdc
>> >>>      > [0:0:1:0]    disk    ATA      INTEL SSDSC2BA20 0110  /dev/sdd
>> >>>      > [0:0:2:0]    disk    ATA      INTEL SSDSC2BA20 0110  /dev/sde
>> >>>      > [0:0:3:0]    disk    ATA      INTEL SSDSC2BB80 0130  /dev/sdf
>> >>>      > [0:0:4:0]    disk    ATA      INTEL SSDSC2BB80 0130  /dev/sdg
>> >>>      > [0:0:5:0]    disk    ATA      INTEL SSDSC2BB80 0130  /dev/sdh
>> >>>      > [0:0:6:0]    disk    ATA      INTEL SSDSC2BB80 0130  /dev/sdi
>> >>>      > [0:0:7:0]    disk    ATA      INTEL SSDSC2BB80 0130  /dev/sdj
>> >>>      > [0:0:8:0]    disk    ATA      INTEL SSDSC2BB80 0130  /dev/sdk
>> >>>      > [0:0:9:0]    disk    ATA      INTEL SSDSC2BB80 0130  /dev/sdl
>> >>>      > [0:0:10:0]   disk    ATA      INTEL SSDSC2BB80 0130  /dev/sdm
>> >>>      >
>> >>>      > sdf                                8:80   0 745.2G  0 disk
>> >>>      > |-sdf1                             8:81   0 740.2G  0 part
>> >>>      > /var/lib/ceph/osd/ceph-16
>> >>>      > `-sdf2                             8:82   0     5G  0 part
>> >>>      > sdg                                8:96   0 745.2G  0 disk
>> >>>      > |-sdg1                             8:97   0 740.2G  0 part
>> >>>      > /var/lib/ceph/osd/ceph-17
>> >>>      > `-sdg2                             8:98   0     5G  0 part
>> >>>      > sdh                                8:112  0 745.2G  0 disk
>> >>>      > |-sdh1                             8:113  0 740.2G  0 part
>> >>>      > /var/lib/ceph/osd/ceph-18
>> >>>      > `-sdh2                             8:114  0     5G  0 part
>> >>>      > sdi                                8:128  0 745.2G  0 disk
>> >>>      > |-sdi1                             8:129  0 740.2G  0 part
>> >>>      > /var/lib/ceph/osd/ceph-19
>> >>>      > `-sdi2                             8:130  0     5G  0 part
>> >>>      > sdj                                8:144  0 745.2G  0 disk
>> >>>      > |-sdj1                             8:145  0 740.2G  0 part
>> >>>      > /var/lib/ceph/osd/ceph-20
>> >>>      > `-sdj2                             8:146  0     5G  0 part
>> >>>      > sdk                                8:160  0 745.2G  0 disk
>> >>>      > |-sdk1                             8:161  0 740.2G  0 part
>> >>>      > /var/lib/ceph/osd/ceph-21
>> >>>      > `-sdk2                             8:162  0     5G  0 part
>> >>>      > sdl                                8:176  0 745.2G  0 disk
>> >>>      > |-sdl1                             8:177  0 740.2G  0 part
>> >>>      > /var/lib/ceph/osd/ceph-22
>> >>>      > `-sdl2                             8:178  0     5G  0 part
>> >>>      > sdm                                8:192  0 745.2G  0 disk
>> >>>      > |-sdm1                             8:193  0 740.2G  0 part
>> >>>      > /var/lib/ceph/osd/ceph-23
>> >>>      > `-sdm2                             8:194  0     5G  0 part
>> >>>      >
>> >>>      >
>> >>>      > $ rados bench -p rbd 20 write --no-cleanup -t 4
>> >>>      >  Maintaining 4 concurrent writes of 4194304 bytes for up to 20
>> >>>     seconds or 0
>> >>>      > objects
>> >>>      >  Object prefix: benchmark_data_cibm01_1409
>> >>>      >    sec Cur ops   started  finished  avg MB/s  cur MB/s  last
>> >>> lat
>> >>>       avg lat
>> >>>      >      0       0         0         0         0         0
>> >>> -
>> >>>             0
>> >>>      >      1       4       121       117   467.894       468
>> >>> 0.0337203
>> >>>     0.0336809
>> >>>      >      2       4       244       240   479.895       492
>> >>> 0.0304306
>> >>>     0.0330524
>> >>>      >      3       4       372       368   490.559       512
>> >>> 0.0361914
>> >>>     0.0323822
>> >>>      >      4       4       491       487   486.899       476
>> >>> 0.0346544
>> >>>     0.0327169
>> >>>      >      5       4       587       583   466.302       384
>> >>> 0.110718
>> >>>     0.0342427
>> >>>      >      6       4       701       697   464.575       456
>> >>> 0.0324953
>> >>>     0.0343136
>> >>>      >      7       4       811       807   461.053       440
>> >>> 0.0400344
>> >>>     0.0345994
>> >>>      >      8       4       923       919   459.412       448
>> >>> 0.0255677
>> >>>     0.0345767
>> >>>      >      9       4      1032      1028   456.803       436
>> >>> 0.0309743
>> >>>     0.0349256
>> >>>      >     10       4      1119      1115   445.917       348
>> >>> 0.229508
>> >>>     0.0357856
>> >>>      >     11       4      1222      1218   442.826       412
>> >>> 0.0277902
>> >>>     0.0360635
>> >>>      >     12       4      1315      1311   436.919       372
>> >>> 0.0303377
>> >>>     0.0365673
>> >>>      >     13       4      1424      1420   436.842       436
>> >>> 0.0288001
>> >>>       0.03659
>> >>>      >     14       4      1524      1520   434.206       400
>> >>> 0.0360993
>> >>>     0.0367697
>> >>>      >     15       4      1632      1628   434.054       432
>> >>> 0.0296406
>> >>>     0.0366877
>> >>>      >     16       4      1740      1736   433.921       432
>> >>> 0.0310995
>> >>>     0.0367746
>> >>>      >     17       4      1836      1832    430.98       384
>> >>> 0.0250518
>> >>>     0.0370169
>> >>>      >     18       4      1941      1937   430.366       420
>> >>> 0.027502
>> >>>     0.0371341
>> >>>      >     19       4      2049      2045   430.448       432
>> >>> 0.0260257
>> >>>     0.0370807
>> >>>      > 2015-11-23 12:10:58.587087min lat: 0.0229266 max lat: 0.27063
>> >>> avg
>> >>>     lat:
>> >>>      > 0.0373936
>> >>>      >    sec Cur ops   started  finished  avg MB/s  cur MB/s  last
>> >>> lat
>> >>>       avg lat
>> >>>      >     20       4      2141      2137   427.322       368
>> >>> 0.0351276
>> >>>     0.0373936
>> >>>      >  Total time run:         20.186437
>> >>>      > Total writes made:      2141
>> >>>      > Write size:             4194304
>> >>>      > Bandwidth (MB/sec):     424.245
>> >>>      >
>> >>>      > Stddev Bandwidth:       102.136
>> >>>      > Max bandwidth (MB/sec): 512
>> >>>      > Min bandwidth (MB/sec): 0
>> >>>      > Average Latency:        0.0376536
>> >>>      > Stddev Latency:         0.032886
>> >>>      > Max latency:            0.27063
>> >>>      > Min latency:            0.0229266
>> >>>      >
>> >>>      >
>> >>>      > $ rados bench -p rbd 20 seq --no-cleanup -t 4
>> >>>      >    sec Cur ops   started  finished  avg MB/s  cur MB/s  last
>> >>> lat
>> >>>       avg lat
>> >>>      >      0       0         0         0         0         0
>> >>> -
>> >>>             0
>> >>>      >      1       4       394       390   1559.52      1560
>> >>> 0.0148888
>> >>>     0.0102236
>> >>>      >      2       4       753       749   1496.68      1436
>> >>> 0.0129162
>> >>>     0.0106595
>> >>>      >      3       4      1137      1133   1509.65      1536
>> >>> 0.0101854
>> >>>     0.0105731
>> >>>      >      4       4      1526      1522   1521.17      1556
>> >>> 0.0122154
>> >>>     0.0103827
>> >>>      >      5       4      1890      1886   1508.07
>> >>> 14560.00825445
>> >>>     0.0105908
>> >>>      >  Total time run:        5.675418
>> >>>      > Total reads made:     2141
>> >>>      > Read size:            4194304
>> >>>      > Bandwidth (MB/sec):    1508.964
>> >>>      >
>> >>>      > Average Latency:       0.0105951
>> >>>      > Max latency:           0.211469
>> >>>      > Min latency:           0.00603694
>> >>>      >
>> >>>      >
>> >>>      > I'm not even close to those numbers that you are getting... :(
>> >>>     any ideas? or
>> >>>      > hints? Also I've configured NOOP as the scheduler for all the
>> >>> SSD
>> >>>     disks. I
>> >>>      > don't know really what else to look for, in order to improve
>> >>>     performance and
>> >>>      > get some similar numbers from what you are getting
>> >>>      >
>> >>>      >
>> >>>      > Thanks in advance,
>> >>>      >
>> >>>      > Cheers,
>> >>>      >
>> >>>      >
>> >>>      > German
>> >>>      >
>> >>>      > 2015-11-23 13:32 GMT-03:00 Mark Nelson :
>> >>>      >>
>> >>>      >> Hi German,
>> >>>      >>
>> >>>      >> I don't have exactly the same setup, but on the ceph community
>> >>>     cluster I
>> >>>      >> have tests with:
>> >>>      >>
>> >>>      >> 4 nodes, each of which are configured in some tests with:
>> >>>      >>
>> >>>      >> 2 x Intel Xeon E5-2650
>> >>>      >> 1 x Intel XL710 40GbE (currently limited to about 2.5GB/s
>> >>> each)
>> >>>      >> 1 x Intel P3700 800GB (4 OSDs per card using 4 data and 4
>> >>> journal
>> >>>      >> partitions)
>> >>>      >> 64GB RAM
>> >>>      >>
>> >>>      >> With filestore, I can get an aggregate throughput of:
>> >>>      >>
>> >>>      >> 1MB randread: 8715.3MB/s
>> >>>      >> 4MB randread: 8046.2MB/s
>> >>>      >>
>> >>>      >> This is with 4 fio instances on the same nodes as the OSDs
>> >>> using
>> >>>     the fio
>> >>>      >> librbd engine.
>> >>>      >>
>> >>>      >> A couple of things I would suggest trying:
>> >>>      >>
>> >>>      >> 1) See how rados bench does.  This is an easy test and you can
>> >>>     see how
>> >>>      >> different the numbers look.
>> >>>      >>
>> >>>      >> 2) try fio with librbd to see if it might be a qemu
>> >>> limitation.
>> >>>      >>
>> >>>      >> 3) Assuming you are using IPoIB, try some iperf tests to see
>> >>> how
>> >>>     your
>> >>>      >> network is doing.
>> >>>      >>
>> >>>      >> Mark
>> >>>      >>
>> >>>      >>
>> >>>      >> On 11/23/2015 10:17 AM, German Anders wrote:
>> >>>      >>>
>> >>>      >>> Thanks a lot for the quick update Greg. This lead me to ask
>> >>> if
>> >>>     there's
>> >>>      >>> anything out there to improve performance in an Infiniband
>> >>>     environment
>> >>>      >>> with Ceph. In the cluster that I mentioned earlier. I've
>> >>> setup
>> >>>     4 OSD
>> >>>      >>> server nodes nodes each with 8 OSD daemons running with 800x
>> >>>     Intel SSD
>> >>>      >>> DC S3710 disks (740.2G for OSD and 5G for Journal) and also
>> >>>     using IB FDR
>> >>>      >>> 56Gb/s for the PUB and CLUS network, and I'm getting the
>> >>>     following fio
>> >>>      >>> numbers:
>> >>>      >>>
>> >>>      >>>
>> >>>      >>> # fio --rw=randread --bs=1m --numjobs=4 --iodepth=32
>> >>> --runtime=22
>> >>>      >>> --time_based --size=16777216k --loops=1 --ioengine=libaio
>> >>>     --direct=1
>> >>>      >>> --invalidate=1 --fsync_on_close=1 --randrepeat=1
>> >>> --norandommap
>> >>>      >>> --group_reporting --exitall --name
>> >>>      >>> dev-ceph-randread-1m-4thr-libaio-32iodepth-22sec
>> >>>      >>> --filename=/mnt/rbd/test1
>> >>>      >>> dev-ceph-randread-1m-4thr-libaio-32iodepth-22sec: (g=0):
>> >>>     rw=randread,
>> >>>      >>> bs=1M-1M/1M-1M/1M-1M, ioengine=libaio, iodepth=32
>> >>>      >>> ...
>> >>>      >>> dev-ceph-randread-1m-4thr-libaio-32iodepth-22sec: (g=0):
>> >>>     rw=randread,
>> >>>      >>> bs=1M-1M/1M-1M/1M-1M, ioengine=libaio, iodepth=32
>> >>>      >>> fio-2.1.3
>> >>>      >>> Starting 4 processes
>> >>>      >>> dev-ceph-randread-1m-4thr-libaio-32iodepth-22sec: Laying out
>> >>> IO
>> >>>     file(s)
>> >>>      >>> (1 file(s) / 16384MB)
>> >>>      >>> Jobs: 4 (f=4): [rrrr] [33.8% done] [1082MB/0KB/0KB /s]
>> >>>     [1081/0/0 iops]
>> >>>      >>> [eta 00m:45s]
>> >>>      >>> dev-ceph-randread-1m-4thr-libaio-32iodepth-22sec: (groupid=0,
>> >>>     jobs=4):
>> >>>      >>> err= 0: pid=63852: Mon Nov 23 10:48:07 2015
>> >>>      >>>    read : io=21899MB, bw=988.23MB/s, iops=988, runt=
>> >>> 22160msec
>> >>>      >>>      slat (usec): min=192, max=186274, avg=3990.48,
>> >>> stdev=7533.77
>> >>>      >>>      clat (usec): min=10, max=808610, avg=125099.41,
>> >>> stdev=90717.56
>> >>>      >>>       lat (msec): min=6, max=809, avg=129.09, stdev=91.14
>> >>>      >>>      clat percentiles (msec):
>> >>>      >>>       |  1.00th=[   27],  5.00th=[   38], 10.00th=[   45],
>> >>>     20.00th=[
>> >>>      >>> 61],
>> >>>      >>>       | 30.00th=[   74], 40.00th=[   85], 50.00th=[  100],
>> >>>     60.00th=[
>> >>>      >>> 117],
>> >>>      >>>       | 70.00th=[  141], 80.00th=[  174], 90.00th=[  235],
>> >>>     95.00th=[
>> >>>      >>> 297],
>> >>>      >>>       | 99.00th=[  482], 99.50th=[  578], 99.90th=[  717],
>> >>>     99.95th=[
>> >>>      >>> 750],
>> >>>      >>>       | 99.99th=[  775]
>> >>>      >>>      bw (KB  /s): min=134691, max=335872, per=25.08%,
>> >>>     avg=253748.08,
>> >>>      >>> stdev=40454.88
>> >>>      >>>      lat (usec) : 20=0.01%
>> >>>      >>>      lat (msec) : 10=0.02%, 20=0.27%, 50=12.90%, 100=36.93%,
>> >>>     250=41.39%
>> >>>      >>>      lat (msec) : 500=7.59%, 750=0.84%, 1000=0.05%
>> >>>      >>>    cpu          : usr=0.11%, sys=26.76%, ctx=39695, majf=0,
>> >>>     minf=405
>> >>>      >>>    IO depths    : 1=0.1%, 2=0.1%, 4=0.1%, 8=0.1%, 16=0.3%,
>> >>>     32=99.4%,
>> >>>      >>>  >=64=0.0%
>> >>>      >>>       submit    : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%,
>> >>>     64=0.0%,
>> >>>      >>>  >=64=0.0%
>> >>>      >>>       complete  : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.1%,
>> >>>     64=0.0%,
>> >>>      >>>  >=64=0.0%
>> >>>      >>>       issued    : total=r=21899/w=0/d=0, short=r=0/w=0/d=0
>> >>>      >>>
>> >>>      >>> Run status group 0 (all jobs):
>> >>>      >>>     READ: io=21899MB, aggrb=988.23MB/s, minb=988.23MB/s,
>> >>>      >>> maxb=988.23MB/s, mint=22160msec, maxt=22160msec
>> >>>      >>>
>> >>>      >>> Disk stats (read/write):
>> >>>      >>>    rbd1: ios=43736/163, merge=0/5, ticks=3189484/15276,
>> >>>      >>> in_queue=3214988, util=99.78%
>> >>>      >>>
>> >>>      >>>
>> >>>      >>>
>> >>>      >>>
>> >>>
>> >>>
>> >>> ############################################################################################################################################################
>> >>>      >>>
>> >>>      >>>
>> >>>      >>> # fio --rw=randread --bs=4m --numjobs=4 --iodepth=32
>> >>> --runtime=22
>> >>>      >>> --time_based --size=16777216k --loops=1 --ioengine=libaio
>> >>>     --direct=1
>> >>>      >>> --invalidate=1 --fsync_on_close=1 --randrepeat=1
>> >>> --norandommap
>> >>>      >>> --group_reporting --exitall --name
>> >>>      >>> dev-ceph-randread-4m-4thr-libaio-32iodepth-22sec
>> >>>      >>> --filename=/mnt/rbd/test2
>> >>>      >>>
>> >>>      >>> fio-2.1.3
>> >>>      >>> Starting 4 processes
>> >>>      >>> dev-ceph-randread-4m-4thr-libaio-32iodepth-22sec: Laying out
>> >>> IO
>> >>>     file(s)
>> >>>      >>> (1 file(s) / 16384MB)
>> >>>      >>> Jobs: 4 (f=4): [rrrr] [28.7% done] [894.3MB/0KB/0KB /s]
>> >>>     [223/0/0 iops]
>> >>>      >>> [eta 00m:57s]
>> >>>      >>> dev-ceph-randread-4m-4thr-libaio-32iodepth-22sec: (groupid=0,
>> >>>     jobs=4):
>> >>>      >>> err= 0: pid=64654: Mon Nov 23 10:51:58 2015
>> >>>      >>>    read : io=18952MB, bw=876868KB/s, iops=214, runt=
>> >>> 22132msec
>> >>>      >>>      slat (usec): min=518, max=81398, avg=18576.88,
>> >>> stdev=14840.55
>> >>>      >>>      clat (msec): min=90, max=1915, avg=570.37, stdev=166.51
>> >>>      >>>       lat (msec): min=123, max=1936, avg=588.95, stdev=169.19
>> >>>      >>>      clat percentiles (msec):
>> >>>      >>>       |  1.00th=[  258],  5.00th=[  343], 10.00th=[  383],
>> >>>     20.00th=[
>> >>>      >>> 437],
>> >>>      >>>       | 30.00th=[  482], 40.00th=[  519], 50.00th=[  553],
>> >>>     60.00th=[
>> >>>      >>> 594],
>> >>>      >>>       | 70.00th=[  627], 80.00th=[  685], 90.00th=[  775],
>> >>>     95.00th=[
>> >>>      >>> 865],
>> >>>      >>>       | 99.00th=[ 1057], 99.50th=[ 1156], 99.90th=[ 1680],
>> >>>     99.95th=[
>> >>>      >>> 1860],
>> >>>      >>>       | 99.99th=[ 1909]
>> >>>      >>>      bw (KB  /s): min= 5665, max=383251, per=24.61%,
>> >>> avg=215755.74,
>> >>>      >>> stdev=61735.70
>> >>>      >>>      lat (msec) : 100=0.02%, 250=0.80%, 500=33.88%,
>> >>> 750=53.31%,
>> >>>      >>> 1000=10.26%
>> >>>      >>>      lat (msec) : 2000=1.73%
>> >>>      >>>    cpu          : usr=0.07%, sys=12.52%, ctx=32466, majf=0,
>> >>>     minf=372
>> >>>      >>>    IO depths    : 1=0.1%, 2=0.2%, 4=0.3%, 8=0.7%, 16=1.4%,
>> >>>     32=97.4%,
>> >>>      >>>  >=64=0.0%
>> >>>      >>>       submit    : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%,
>> >>>     64=0.0%,
>> >>>      >>>  >=64=0.0%
>> >>>      >>>       complete  : 0=0.0%, 4=99.9%, 8=0.0%, 16=0.0%, 32=0.1%,
>> >>>     64=0.0%,
>> >>>      >>>  >=64=0.0%
>> >>>      >>>       issued    : total=r=4738/w=0/d=0, short=r=0/w=0/d=0
>> >>>      >>>
>> >>>      >>> Run status group 0 (all jobs):
>> >>>      >>>     READ: io=18952MB, aggrb=876868KB/s, minb=876868KB/s,
>> >>>      >>> maxb=876868KB/s, mint=22132msec, maxt=22132msec
>> >>>      >>>
>> >>>      >>> Disk stats (read/write):
>> >>>      >>>    rbd1: ios=37721/177, merge=0/5, ticks=3075924/11408,
>> >>>      >>> in_queue=3097448, util=99.77%
>> >>>      >>>
>> >>>      >>>
>> >>>      >>> Can anyone share some results from a similar environment?
>> >>>      >>>
>> >>>      >>> Thanks in advance,
>> >>>      >>>
>> >>>      >>> Best,
>> >>>      >>>
>> >>>      >>> **
>> >>>      >>>
>> >>>      >>> *German*
>> >>>      >>>
>> >>>      >>> 2015-11-23 13:08 GMT-03:00 Gregory Farnum >> >:
>> >>>     >>>
>> >>>     >>>     On Mon, Nov 23, 2015 at 10:05 AM, German Anders
>> >>>      >>>     > wrote:
>> >>>      >>>     > Hi all,
>> >>>      >>>     >
>> >>>      >>>     > I want to know if there's any improvement or update
>> >>>     regarding ceph
>> >>>      >>> 0.94.5
>> >>>      >>>     > with accelio, I've an already configured cluster (with
>> >>> no
>> >>>     data on
>> >>>      >>> it) and I
>> >>>      >>>     > would like to know if there's a way to 'modify' the
>> >>>     cluster in
>> >>>      >>> order to use
>> >>>      >>>     > accelio. Any info would be really appreciated.
>> >>>      >>>
>> >>>      >>>     The XioMessenger is still experimental. As far as I know
>> >>>     it's not
>> >>>      >>>     expected to be stable any time soon and I can't imagine
>> >>> it
>> >>>     will be
>> >>>      >>>     backported to Hammer even when done.
>> >>>      >>>     -Greg
>> >>>      >>>
>> >>>      >>>
>> >>>      >>>
>> >>>      >>>
>> >>>      >>> _______________________________________________
>> >>>      >>> ceph-users mailing list
>> >>>      >>> ceph-users@xxxxxxxxxxxxxx
>> >>>      >>> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>> >>>      >>>
>> >>>      >> _______________________________________________
>> >>>      >> ceph-users mailing list
>> >>>      >> ceph-users@xxxxxxxxxxxxxx
>> >>>      >> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>> >>>      >
>> >>>      >
>> >>>      >
>> >>>      > _______________________________________________
>> >>>      > ceph-users mailing list
>> >>>      > ceph-users@xxxxxxxxxxxxxx
>> >>>      > http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>> >>>      >
>> >>>
>> >>>     -----BEGIN PGP SIGNATURE-----
>> >>>     Version: Mailvelope v1.2.3
>> >>>     Comment: https://www.mailvelope.com
>> >>>
>> >>>     wsFcBAEBCAAQBQJWU1WqCRDmVDuy+mK58QAAo5cQALjuZB+dyjbcRDyScvj/
>> >>>     qjurMqCHlScgG9U8CE4L6/E/QUfCNmdvE4KaeQC82oj/SplXYOuglTHJkUMg
>> >>>     KPyjb9jJs+ZyS560IoUB/l/XQZpO9WL+DNnSAg96Hpb3eG+G5jukW9/E/QHQ
>> >>>     aDjn/c1njEqUhxMAosUFZR58CxejyyI5Vr/SXX+oE6y2tCF31Z3KPiOVTOtj
>> >>>     BPIx74xpigXMSP+zaK4UelhjPzrRnefkN2sLpQS5uwJlOY1f35KoM3dX+LHO
>> >>>     2BWpyrLUtL6ZzpalKr/QbaWko1VM109vjAoPZ3X82ig9DZp2DW8ZVX4abVcy
>> >>>     +Zyre4SCncKFJZcL9VkQHPJxRFhqXHC43mpSHIKmhuhmGVwr9ngiKGUY1Q7t
>> >>>     O0aks06KHfqSRxjWmuhtP0eMLwsH7gLAEqqtAjnIhRTCDDkhRdp/MdZJ7ftO
>> >>>     LHF9+Eqdp/KiVrGK7BX9zwVshr608bR4g7JCfK4/ukSHXOWFVR6GZ8jue85q
>> >>>     e6dWhHsdwrPt1QnSrfhnKjoMdhTpvPVzlxqo2jHDXEyE57RxW/zXr776HxcQ
>> >>>     cISj4zDZ0nGZ1F8w4DdB0ql8CpsCDAEoaNG0ZQPXcItyrHIB0lFOJYDi5m+4
>> >>>     YqOCG8TWh7b28IbEEwwUSpx3pi2iyH0ObJZM5dgf62AOCKCEsixf+UguFVwd
>> >>>     /jdL
>> >>>     =6LtO
>> >>>     -----END PGP SIGNATURE-----
>> >>>
>> >>>
>> >
>
>

-----BEGIN PGP SIGNATURE-----
Version: Mailvelope v1.2.3
Comment: https://www.mailvelope.com

wsFcBAEBCAAQBQJWVJ/DCRDmVDuy+mK58QAAku0P/0u8bDO/bAK9YxPHE0iP
ZFOI+mz8QSZ7Omu/3DKlMBtj81Amv781j0jc6iIFBijYGUBeTlr8HPZgsBry
Y5MsDXxrQPiboWSQ7kF8cOfOlZd+JmSSnmum6Gw9P7XzZAXfLwuEIvZitIHv
ivHJPykj34A3ZdJByuXdQG929nIOLRKUKdXnIXOMrNQQWSwn4u/v4pGbYchN
EIhjAFLN+maKAE+KXgARkwDpXKMw+//Tu435GzDGzwXND6m9Vk1JKsJ42qvv
3D5In0xtNuKayUDLwv0WCsQrkysomY/H+PgMoa2Nu8Uo9jQQqH36S/KMyq6e
vjbhuwrJ1ZnWKtdhixCk5fC66D41kchOPmFqXwKBBczAWj2HfO8/naFRqX6G
8IePIkCZ80PVqRO4n8/IDp/JMMU6y6PfWhU+1VI/HHPaxqBvHM6RiaGLy+b7
F3wbhcM6WrFfLMn5jjhkBhL+A/s1Z+Zwzg19wVGFzVFuFj6Tzn+mgWvocIF3
GY1Ii3R5QR+z1IC6RJYHbf/jJgPiwnh5/sPy9WV/td2sQrrt2Bg4Bj8mNa9l
3adcWQnaZuATi9wOClGPI/R1mZrXLna3QnDTuxlibEYf/XpEi4iE96/tH6N1
92HwHpnWhO5roS0sug7YS3uqZ81EX5t+5SoX07Y6ZOln2i8f7TvhZbXaDB13
beNr
=ZIWh
-----END PGP SIGNATURE-----
_______________________________________________
ceph-users mailing list
ceph-users@xxxxxxxxxxxxxx
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com