Re: Ceph InfiniBand Cluster - Jewel - Performance

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 



-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA256

Ceph is not able to use native Infiniband protocols yet and so it is
only leveraging IPoIB at the moment. The most likely reason you are
only getting ~10 Gb performance is that IPoIB heavily leverages
multicast in Infiniband (if you do so research in this area you will
understand why unicast IP still uses multicast on an Inifiniband
network). To be extremely compatible with all adapters, the subnet
manager will set the speed of multicast to 10 Gb/s so that SDR
adapters can be used and not drop packets. If you know that you will
never have adapters under a certain speed, you can configure the
subnet manager to use a higher speed. This does not change IPoIB
networks that are already configured (I had to down all the IPoIB
adapter at the same time and bring them back up to upgrade the speed).
Even after that, there still wasn't similar performance to native
Infiniband, but I got at least a 2x improvement (along with setting
the MTU to 64K) on the FDR adapters. There is still a ton of overhead
for doing IPoIB so it is not an ideal transport to get performance on
Infiniband, I think of it as a compatibility feature. Hopefully, that
will give you enough information to perform the research. If you
search the OFED mailing list, you will see some posts from me 2-3
years ago regarding this very topic.

Good luck and keep holding out for Ceph with XIO.
-----BEGIN PGP SIGNATURE-----
Version: Mailvelope v1.3.6
Comment: https://www.mailvelope.com

wsFcBAEBCAAQBQJXBrtICRDmVDuy+mK58QAAqVkP/2hpe93FYIbQtpV4Qta4
9Fohqf478kVPX/v6XkAYOlAFFAISxfbDdm0FxOjbGSEOMKGNs/oSaFRCsqb9
+T5dfMUHyhY51wyaNeVF3k3zgvGpNUO1xEQ1IenUquZp9825VRBze5/T6r8Z
PMFySNtuHBp8AhARisPJcXqKv/Vowfy/LqyvlL6ytIHfwqsVHngbtVN7L/HX
vzMZM93cLwwV44v2bT8t63U76GKyQpbksDx02CktMIFzNbfApsiMaA1dyx1O
9HEgirtddMO358f+1DN/OjNc/Z3zECILaw3tq/HUWJyBJqO95uBw++znIacb
UKwqJ1HmUeDvdqY72ZQa2fQT7ayMMlPPwzoVtdQGMZnSaAjn8MlunDFCrdLw
+JPT+kt0qnjzs9qK0zEp5drfUwnV5BXS4hZhKUvuxWmVjUv1EfJrIFCszSFO
2be/xLxqBTpCEcHL9fsc16P7HsrdBW8GDy3X5PC2sOl/2DSes4y2TpCfr7w9
V8Mhs7mmkEQtwcvyaYQ0bx0Bs3o4cvTTeYbJUpLWEgMmGAEBZbf7Sx+y3dIp
jUHb2jPEchBb83BGeLvAkCTfouq/J3pzQK96gA2Kh/KOlVJTpFdKUU5x+wpM
ACqD+S/AFkgnfGm4fcgBexhro7GImiO6VIaOdxvTSdQbSsaoKckZOxFhVWih
XyBJ
=EF9A
-----END PGP SIGNATURE-----
----------------
Robert LeBlanc
PGP Fingerprint 79A2 9CA4 6CC4 45DD A904  C70E E654 3BB2 FA62 B9F1


On Thu, Apr 7, 2016 at 1:43 PM, German Anders <ganders@xxxxxxxxxxxx> wrote:
> Hi Cephers,
>
> I've setup a production environment Ceph cluster with the Jewel release
> (10.1.0 (96ae8bd25f31862dbd5302f304ebf8bf1166aba6)) consisting of 3 MON
> Servers and 6 OSD Servers:
>
> 3x MON Servers:
> 2x Intel Xeon E5-2630v3@2.40Ghz
> 384GB RAM
> 2x 200G Intel DC3700 in RAID-1 for OS
> 1x InfiniBand ConnectX-3 ADPT DP
>
> 6x OSD Servers:
> 2x Intel Xeon E5-2650v2@2.60Ghz
> 128GB RAM
> 2x 200G Intel DC3700 in RAID-1 for OS
> 12x 800G Intel DC3510 (osd & journal) on same device
> 1x InfiniBand ConnectX-3 ADPT DP (one port on PUB network and the other on
> the CLUS network)
>
> ceph.conf file is:
>
> [global]
> fsid = xxxxxxxxxxxxxxxxxxxxxxxxxxx
> mon_initial_members = cibm01, cibm02, cibm03
> mon_host = xx.xx.xx.1,xx.xx.xx.2,xx.xx.xx.3
> auth_cluster_required = cephx
> auth_service_required = cephx
> auth_client_required = cephx
> filestore_xattr_use_omap = true
> public_network = xx.xx.16.0/20
> cluster_network = xx.xx.32.0/20
>
> [mon]
>
> [mon.cibm01]
> host = cibm01
> mon_addr = xx.xx.xx.1:6789
>
> [mon.cibm02]
> host = cibm02
> mon_addr = xx.xx.xx.2:6789
>
> [mon.cibm03]
> host = cibm03
> mon_addr = xx.xx.xx.3:6789
>
> [osd]
> osd_pool_default_size = 2
> osd_pool_default_min_size = 1
>
> ## OSD Configuration ##
> [osd.0]
> host = cibn01
> public_addr = xx.xx.17.1
> cluster_addr = xx.xx.32.1
>
> [osd.1]
> host = cibn01
> public_addr = xx.xx.17.1
> cluster_addr = xx.xx.32.1
>
> ...
>
>
>
> They are all running Ubuntu 14.04.4 LTS. Journals are 5GB partitions on each
> disk, since all the OSD daemons are SSD disks (Intel DC3510 800G). For
> example:
>
> sdc                              8:32   0 745.2G  0 disk
> |-sdc1                           8:33   0 740.2G  0 part
> /var/lib/ceph/osd/ceph-0
> `-sdc2                           8:34   0     5G  0 part
>
> The purpose of this cluster will be to serve as a backend storage for Cinder
> volumes (RBD) and Glance images in an OpenStack cloud, most of the clusters
> on OpenStack will be non-relational databases like Cassandra with many
> instances each.
>
> All of the nodes of the cluster are running InfiniBand FDR 56Gb/s with
> Mellanox Technologies MT27500 Family [ConnectX-3] adapters.
>
>
> So I assume that performance will be really nice, right?...but.. I'm getting
> some numbers that I think they could be really more important.
>
> # rados --pool rbd bench 10 write -t 16
>
> Total writes made:      1964
> Write size:             4194304
> Object size:            4194304
> Bandwidth (MB/sec):     755.435
>
> Stddev Bandwidth:       90.3288
> Max bandwidth (MB/sec): 884
> Min bandwidth (MB/sec): 612
> Average IOPS:           188
> Stddev IOPS:            22
> Max IOPS:               221
> Min IOPS:               153
> Average Latency(s):     0.0836802
> Stddev Latency(s):      0.147561
> Max latency(s):         1.50925
> Min latency(s):         0.0192736
>
>
> Then I connect to another server (this one is running on QDR - so I would
> expect something between 2-3Gb/s), I map a RBD on the host, then create a
> ext4 fs and mount it, and finally run a fio test:
>
> # fio --rw=randwrite --bs=4M --numjobs=8 --iodepth=32 --runtime=22
> --time_based --size=10G --loops=1 --ioengine=libaio --direct=1
> --invalidate=1 --fsync_on_close=1 --randrepeat=1 --norandommap
> --group_reporting --exitall --name cephV1 --filename=/mnt/host01v1/test1
>
> fio-2.1.3
> Starting 8 processes
> cephIBV1: Laying out IO file(s) (1 file(s) / 10240MB)
> Jobs: 7 (f=7): [wwwwww_w] [100.0% done] [0KB/431.6MB/0KB /s] [0/107/0 iops]
> [eta 00m:00s]
> cephIBV1: (groupid=0, jobs=8): err= 0: pid=6203: Thu Apr  7 15:24:12 2016
>   write: io=15284MB, bw=676412KB/s, iops=165, runt= 23138msec
>     slat (msec): min=1, max=480, avg=46.15, stdev=63.68
>     clat (msec): min=64, max=8966, avg=1459.91, stdev=1252.64
>      lat (msec): min=87, max=8969, avg=1506.06, stdev=1253.63
>     clat percentiles (msec):
>      |  1.00th=[  235],  5.00th=[  478], 10.00th=[  611], 20.00th=[  766],
>      | 30.00th=[  889], 40.00th=[  988], 50.00th=[ 1106], 60.00th=[ 1237],
>      | 70.00th=[ 1434], 80.00th=[ 1680], 90.00th=[ 2474], 95.00th=[ 4555],
>      | 99.00th=[ 6915], 99.50th=[ 7439], 99.90th=[ 8291], 99.95th=[ 8586],
>      | 99.99th=[ 8979]
>     bw (KB  /s): min= 3091, max=209877, per=12.31%, avg=83280.51,
> stdev=35226.98
>     lat (msec) : 100=0.16%, 250=0.97%, 500=4.61%, 750=12.93%, 1000=22.61%
>     lat (msec) : 2000=45.04%, >=2000=13.69%
>   cpu          : usr=0.87%, sys=4.77%, ctx=6803, majf=0, minf=16337
>   IO depths    : 1=0.2%, 2=0.4%, 4=0.8%, 8=1.7%, 16=3.3%, 32=93.5%,
>>=64=0.0%
>      submit    : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%,
>>=64=0.0%
>      complete  : 0=0.0%, 4=99.8%, 8=0.0%, 16=0.0%, 32=0.2%, 64=0.0%,
>>=64=0.0%
>      issued    : total=r=0/w=3821/d=0, short=r=0/w=0/d=0
>
> Run status group 0 (all jobs):
>   WRITE: io=15284MB, aggrb=676411KB/s, minb=676411KB/s, maxb=676411KB/s,
> mint=23138msec, maxt=23138msec
>
> Disk stats (read/write):
>   rbd1: ios=0/4189, merge=0/26613, ticks=0/2852032, in_queue=2857996,
> util=99.08%
>
>
> Does it look acceptable? I mean for an InfiniBand network, I guess that
> throughput need to be better. How much more can I expect to achieve by
> tuning the servers? The MTU on the OSD servers is:
>
> MTU: 65520
> Any drop packet found
> txqueuelen:256
>
> Also I've setup on the openib.conf file:
> ...
> SET_IPOIB_CM=yes
> IPOIB_MTU=65520
> ...
>
> And on mlnx.conf file:
> ...
>
> options mlx4_core enable_sys_tune=1
> options mlx4_core log_num_mgm_entry_size=-7
>
>
> Anyone here with experience on Infiniband setups can give me any hint in
> order to 'improve' performance, I'm getting similar numbers with another
> cluster on a 10GbE network :S
>
>
> Thanks,
>
> German
>
> _______________________________________________
> ceph-users mailing list
> ceph-users@xxxxxxxxxxxxxx
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>
_______________________________________________
ceph-users mailing list
ceph-users@xxxxxxxxxxxxxx
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com



[Index of Archives]     [Information on CEPH]     [Linux Filesystem Development]     [Ceph Development]     [Ceph Large]     [Ceph Dev]     [Linux USB Development]     [Video for Linux]     [Linux Audio Users]     [Yosemite News]     [Linux Kernel]     [Linux SCSI]     [xfs]


  Powered by Linux