Re: Ceph InfiniBand Cluster - Jewel - Performance

German Anders <ganders@xxxxxxxxxxxx> · Thu, 7 Apr 2016 17:09:29 -0300

also jewel does not supposed to get more 'performance', since it used bluestore in order to store metadata. Or do I need to specify during install to use bluestore?

Thanks,

German

2016-04-07 16:55 GMT-03:00 Robert LeBlanc <robert@xxxxxxxxxxxxx>:
-----BEGIN PGP SIGNED MESSAGE-----

Hash: SHA256

Ceph is not able to use native Infiniband protocols yet and so it is

only leveraging IPoIB at the moment. The most likely reason you are

only getting ~10 Gb performance is that IPoIB heavily leverages

multicast in Infiniband (if you do so research in this area you will

understand why unicast IP still uses multicast on an Inifiniband

network). To be extremely compatible with all adapters, the subnet

manager will set the speed of multicast to 10 Gb/s so that SDR

adapters can be used and not drop packets. If you know that you will

never have adapters under a certain speed, you can configure the

subnet manager to use a higher speed. This does not change IPoIB

networks that are already configured (I had to down all the IPoIB

adapter at the same time and bring them back up to upgrade the speed).

Even after that, there still wasn't similar performance to native

Infiniband, but I got at least a 2x improvement (along with setting

the MTU to 64K) on the FDR adapters. There is still a ton of overhead

for doing IPoIB so it is not an ideal transport to get performance on

Infiniband, I think of it as a compatibility feature. Hopefully, that

will give you enough information to perform the research. If you

search the OFED mailing list, you will see some posts from me 2-3

years ago regarding this very topic.

Good luck and keep holding out for Ceph with XIO.

-----BEGIN PGP SIGNATURE-----

Version: Mailvelope v1.3.6

Comment: https://www.mailvelope.com

wsFcBAEBCAAQBQJXBrtICRDmVDuy+mK58QAAqVkP/2hpe93FYIbQtpV4Qta4

9Fohqf478kVPX/v6XkAYOlAFFAISxfbDdm0FxOjbGSEOMKGNs/oSaFRCsqb9

+T5dfMUHyhY51wyaNeVF3k3zgvGpNUO1xEQ1IenUquZp9825VRBze5/T6r8Z

PMFySNtuHBp8AhARisPJcXqKv/Vowfy/LqyvlL6ytIHfwqsVHngbtVN7L/HX

vzMZM93cLwwV44v2bT8t63U76GKyQpbksDx02CktMIFzNbfApsiMaA1dyx1O

9HEgirtddMO358f+1DN/OjNc/Z3zECILaw3tq/HUWJyBJqO95uBw++znIacb

UKwqJ1HmUeDvdqY72ZQa2fQT7ayMMlPPwzoVtdQGMZnSaAjn8MlunDFCrdLw

+JPT+kt0qnjzs9qK0zEp5drfUwnV5BXS4hZhKUvuxWmVjUv1EfJrIFCszSFO

2be/xLxqBTpCEcHL9fsc16P7HsrdBW8GDy3X5PC2sOl/2DSes4y2TpCfr7w9

V8Mhs7mmkEQtwcvyaYQ0bx0Bs3o4cvTTeYbJUpLWEgMmGAEBZbf7Sx+y3dIp

jUHb2jPEchBb83BGeLvAkCTfouq/J3pzQK96gA2Kh/KOlVJTpFdKUU5x+wpM

ACqD+S/AFkgnfGm4fcgBexhro7GImiO6VIaOdxvTSdQbSsaoKckZOxFhVWih

XyBJ

=EF9A

-----END PGP SIGNATURE-----

----------------

Robert LeBlanc

PGP Fingerprint 79A2 9CA4 6CC4 45DD A904  C70E E654 3BB2 FA62 B9F1

On Thu, Apr 7, 2016 at 1:43 PM, German Anders <ganders@xxxxxxxxxxxx> wrote:

> Hi Cephers,

>

> I've setup a production environment Ceph cluster with the Jewel release

> (10.1.0 (96ae8bd25f31862dbd5302f304ebf8bf1166aba6)) consisting of 3 MON

> Servers and 6 OSD Servers:

>

> 3x MON Servers:

> 2x Intel Xeon E5-2630v3@2.40Ghz

> 384GB RAM

> 2x 200G Intel DC3700 in RAID-1 for OS

> 1x InfiniBand ConnectX-3 ADPT DP

>

> 6x OSD Servers:

> 2x Intel Xeon E5-2650v2@2.60Ghz

> 128GB RAM

> 2x 200G Intel DC3700 in RAID-1 for OS

> 12x 800G Intel DC3510 (osd & journal) on same device

> 1x InfiniBand ConnectX-3 ADPT DP (one port on PUB network and the other on

> the CLUS network)

>

> ceph.conf file is:

>

> [global]

> fsid = xxxxxxxxxxxxxxxxxxxxxxxxxxx

> mon_initial_members = cibm01, cibm02, cibm03

> mon_host = xx.xx.xx.1,xx.xx.xx.2,xx.xx.xx.3

> auth_cluster_required = cephx

> auth_service_required = cephx

> auth_client_required = cephx

> filestore_xattr_use_omap = true

> public_network = xx.xx.16.0/20

> cluster_network = xx.xx.32.0/20

>

> [mon]

>

> [mon.cibm01]

> host = cibm01

> mon_addr = xx.xx.xx.1:6789

>

> [mon.cibm02]

> host = cibm02

> mon_addr = xx.xx.xx.2:6789

>

> [mon.cibm03]

> host = cibm03

> mon_addr = xx.xx.xx.3:6789

>

> [osd]

> osd_pool_default_size = 2

> osd_pool_default_min_size = 1

>

> ## OSD Configuration ##

> [osd.0]

> host = cibn01

> public_addr = xx.xx.17.1

> cluster_addr = xx.xx.32.1

>

> [osd.1]

> host = cibn01

> public_addr = xx.xx.17.1

> cluster_addr = xx.xx.32.1

>

> ...

>

>

>

> They are all running Ubuntu 14.04.4 LTS. Journals are 5GB partitions on each

> disk, since all the OSD daemons are SSD disks (Intel DC3510 800G). For

> example:

>

> sdc                              8:32   0 745.2G  0 disk

> |-sdc1                           8:33   0 740.2G  0 part

> /var/lib/ceph/osd/ceph-0

> `-sdc2                           8:34   0     5G  0 part

>

> The purpose of this cluster will be to serve as a backend storage for Cinder

> volumes (RBD) and Glance images in an OpenStack cloud, most of the clusters

> on OpenStack will be non-relational databases like Cassandra with many

> instances each.

>

> All of the nodes of the cluster are running InfiniBand FDR 56Gb/s with

> Mellanox Technologies MT27500 Family [ConnectX-3] adapters.

>

>

> So I assume that performance will be really nice, right?...but.. I'm getting

> some numbers that I think they could be really more important.

>

> # rados --pool rbd bench 10 write -t 16

>

> Total writes made:      1964

> Write size:             4194304

> Object size:            4194304

> Bandwidth (MB/sec):     755.435

>

> Stddev Bandwidth:       90.3288

> Max bandwidth (MB/sec): 884

> Min bandwidth (MB/sec): 612

> Average IOPS:           188

> Stddev IOPS:            22

> Max IOPS:               221

> Min IOPS:               153

> Average Latency(s):     0.0836802

> Stddev Latency(s):      0.147561

> Max latency(s):         1.50925

> Min latency(s):         0.0192736

>

>

> Then I connect to another server (this one is running on QDR - so I would

> expect something between 2-3Gb/s), I map a RBD on the host, then create a

> ext4 fs and mount it, and finally run a fio test:

>

> # fio --rw=randwrite --bs=4M --numjobs=8 --iodepth=32 --runtime=22

> --time_based --size=10G --loops=1 --ioengine=libaio --direct=1

> --invalidate=1 --fsync_on_close=1 --randrepeat=1 --norandommap

> --group_reporting --exitall --name cephV1 --filename=/mnt/host01v1/test1

>

> fio-2.1.3

> Starting 8 processes

> cephIBV1: Laying out IO file(s) (1 file(s) / 10240MB)

> Jobs: 7 (f=7): [wwwwww_w] [100.0% done] [0KB/431.6MB/0KB /s] [0/107/0 iops]

> [eta 00m:00s]

> cephIBV1: (groupid=0, jobs=8): err= 0: pid=6203: Thu Apr  7 15:24:12 2016

>   write: io=15284MB, bw=676412KB/s, iops=165, runt= 23138msec

>     slat (msec): min=1, max=480, avg=46.15, stdev=63.68

>     clat (msec): min=64, max=8966, avg=1459.91, stdev=1252.64

>      lat (msec): min=87, max=8969, avg=1506.06, stdev=1253.63

>     clat percentiles (msec):

>      |  1.00th=[  235],  5.00th=[  478], 10.00th=[  611], 20.00th=[  766],

>      | 30.00th=[  889], 40.00th=[  988], 50.00th=[ 1106], 60.00th=[ 1237],

>      | 70.00th=[ 1434], 80.00th=[ 1680], 90.00th=[ 2474], 95.00th=[ 4555],

>      | 99.00th=[ 6915], 99.50th=[ 7439], 99.90th=[ 8291], 99.95th=[ 8586],

>      | 99.99th=[ 8979]

>     bw (KB  /s): min= 3091, max=209877, per=12.31%, avg=83280.51,

> stdev=35226.98

>     lat (msec) : 100=0.16%, 250=0.97%, 500=4.61%, 750=12.93%, 1000=22.61%

>     lat (msec) : 2000=45.04%, >=2000=13.69%

>   cpu          : usr=0.87%, sys=4.77%, ctx=6803, majf=0, minf=16337

>   IO depths    : 1=0.2%, 2=0.4%, 4=0.8%, 8=1.7%, 16=3.3%, 32=93.5%,

>>=64=0.0%

>      submit    : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%,

>>=64=0.0%

>      complete  : 0=0.0%, 4=99.8%, 8=0.0%, 16=0.0%, 32=0.2%, 64=0.0%,

>>=64=0.0%

>      issued    : total=r=0/w=3821/d=0, short=r=0/w=0/d=0

>

> Run status group 0 (all jobs):

>   WRITE: io=15284MB, aggrb=676411KB/s, minb=676411KB/s, maxb=676411KB/s,

> mint=23138msec, maxt=23138msec

>

> Disk stats (read/write):

>   rbd1: ios=0/4189, merge=0/26613, ticks=0/2852032, in_queue=2857996,

> util=99.08%

>

>

> Does it look acceptable? I mean for an InfiniBand network, I guess that

> throughput need to be better. How much more can I expect to achieve by

> tuning the servers? The MTU on the OSD servers is:

>

> MTU: 65520

> Any drop packet found

> txqueuelen:256

>

> Also I've setup on the openib.conf file:

> ...

> SET_IPOIB_CM=yes

> IPOIB_MTU=65520

> ...

>

> And on mlnx.conf file:

> ...

>

> options mlx4_core enable_sys_tune=1

> options mlx4_core log_num_mgm_entry_size=-7

>

>

> Anyone here with experience on Infiniband setups can give me any hint in

> order to 'improve' performance, I'm getting similar numbers with another

> cluster on a 10GbE network :S

>

>

> Thanks,

>

> German

>

> _______________________________________________

> ceph-users mailing list

> ceph-users@xxxxxxxxxxxxxx

> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

>

_______________________________________________
ceph-users mailing list
ceph-users@xxxxxxxxxxxxxx
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com