Ceph InfiniBand Cluster - Jewel - Performance

German Anders <ganders@xxxxxxxxxxxx> · Thu, 7 Apr 2016 16:43:40 -0300

Hi Cephers,
I've setup a production environment Ceph cluster with the Jewel release (10.1.0 (96ae8bd25f31862dbd5302f304ebf8bf1166aba6))
 consisting of 3 MON Servers and 6 OSD Servers:

3x MON Servers:
2x Intel Xeon E5-2630v3@2.40Ghz
384GB RAM
2x 200G Intel DC3700 in RAID-1 for OS
1x InfiniBand ConnectX-3 ADPT DP

6x OSD Servers:
2x Intel Xeon E5-2650v2@2.60Ghz
128GB RAM
2x 200G Intel DC3700 in RAID-1 for OS
12x 800G Intel DC3510 (osd & journal) on same device
1x InfiniBand ConnectX-3 ADPT DP (one port on PUB network and the other on the CLUS network)

ceph.conf file is:

[global]
fsid = xxxxxxxxxxxxxxxxxxxxxxxxxxx
mon_initial_members = cibm01, cibm02, cibm03
mon_host = xx.xx.xx.1,xx.xx.xx.2,xx.xx.xx.3
auth_cluster_required = cephx
auth_service_required = cephx
auth_client_required = cephx
filestore_xattr_use_omap = true
public_network = xx.xx.16.0/20
cluster_network = xx.xx.32.0/20

[mon]

[mon.cibm01]
host = cibm01
mon_addr = xx.xx.xx.1:6789

[mon.cibm02]
host = cibm02
mon_addr = xx.xx.xx.2:6789

[mon.cibm03]
host = cibm03
mon_addr = xx.xx.xx.3:6789

[osd]
osd_pool_default_size = 2
osd_pool_default_min_size = 1

## OSD Configuration ##
[osd.0]
host = cibn01
public_addr = xx.xx.17.1
cluster_addr = xx.xx.32.1

[osd.1]
host = cibn01
public_addr = xx.xx.17.1
cluster_addr = xx.xx.32.1

...

They are all running Ubuntu 14.04.4 LTS. Journals are 5GB partitions on each disk, since all the OSD daemons are SSD disks (Intel DC3510 800G). For example:

sdc                              8:32   0 745.2G  0 disk  
|-sdc1                           8:33   0 740.2G  0 part  /var/lib/ceph/osd/ceph-0
`-sdc2                           8:34   0     5G  0 part 

The purpose of this cluster 
will be to serve as a backend storage for Cinder volumes (RBD) and Glance 
images in an OpenStack cloud, most of the clusters on OpenStack will be non-relational databases like Cassandra with many instances each.

All of the nodes of the cluster are running InfiniBand FDR 56Gb/s with Mellanox Technologies MT27500 Family [ConnectX-3] adapters.

So I assume that performance will be really nice, right?...but.. I'm getting some numbers that I think they could be really more important.

# rados --pool rbd bench 10 write -t 16

Total writes made:      1964
Write size:             4194304
Object size:            4194304
Bandwidth (MB/sec):     755.435

Stddev Bandwidth:       90.3288
Max bandwidth (MB/sec): 884
Min bandwidth (MB/sec): 612
Average IOPS:           188
Stddev IOPS:            22
Max IOPS:               221
Min IOPS:               153
Average Latency(s):     0.0836802
Stddev Latency(s):      0.147561
Max latency(s):         1.50925
Min latency(s):         0.0192736

Then I connect to another server (this one is running on QDR - so I would expect something between 2-3Gb/s), I map a RBD on the host, then create a ext4 fs and mount it, and finally run a fio test:

# fio --rw=randwrite --bs=4M --numjobs=8 --iodepth=32 --runtime=22 --time_based --size=10G --loops=1 --ioengine=libaio --direct=1 --invalidate=1 --fsync_on_close=1 --randrepeat=1 --norandommap --group_reporting --exitall --name cephV1 --filename=/mnt/host01v1/test1

fio-2.1.3
Starting 8 processes
cephIBV1: Laying out IO file(s) (1 file(s) / 10240MB)
Jobs: 7 (f=7): [wwwwww_w] [100.0% done] [0KB/431.6MB/0KB /s] [0/107/0 iops] [eta 00m:00s]
cephIBV1: (groupid=0, jobs=8): err= 0: pid=6203: Thu Apr  7 15:24:12 2016
  write: io=15284MB, bw=676412KB/s, iops=165, runt= 23138msec
    slat (msec): min=1, max=480, avg=46.15, stdev=63.68
    clat (msec): min=64, max=8966, avg=1459.91, stdev=1252.64
     lat (msec): min=87, max=8969, avg=1506.06, stdev=1253.63
    clat percentiles (msec):
     |  1.00th=[  235],  5.00th=[  478], 10.00th=[  611], 20.00th=[  766],
     | 30.00th=[  889], 40.00th=[  988], 50.00th=[ 1106], 60.00th=[ 1237],
     | 70.00th=[ 1434], 80.00th=[ 1680], 90.00th=[ 2474], 95.00th=[ 4555],
     | 99.00th=[ 6915], 99.50th=[ 7439], 99.90th=[ 8291], 99.95th=[ 8586],
     | 99.99th=[ 8979]
    bw (KB  /s): min= 3091, max=209877, per=12.31%, avg=83280.51, stdev=35226.98
    lat (msec) : 100=0.16%, 250=0.97%, 500=4.61%, 750=12.93%, 1000=22.61%
    lat (msec) : 2000=45.04%, >=2000=13.69%
  cpu          : usr=0.87%, sys=4.77%, ctx=6803, majf=0, minf=16337
  IO depths    : 1=0.2%, 2=0.4%, 4=0.8%, 8=1.7%, 16=3.3%, 32=93.5%, >=64=0.0%
     submit    : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
     complete  : 0=0.0%, 4=99.8%, 8=0.0%, 16=0.0%, 32=0.2%, 64=0.0%, >=64=0.0%
     issued    : total=r=0/w=3821/d=0, short=r=0/w=0/d=0

Run status group 0 (all jobs):
  WRITE: io=15284MB, aggrb=676411KB/s, minb=676411KB/s, maxb=676411KB/s, mint=23138msec, maxt=23138msec

Disk stats (read/write):
  rbd1: ios=0/4189, merge=0/26613, ticks=0/2852032, in_queue=2857996, util=99.08%

Does it look acceptable? I mean for an InfiniBand network, I guess that throughput need to be better. How much more can I expect to achieve by tuning the servers? The MTU on the OSD servers is:

MTU: 65520
Any drop packet found
txqueuelen:256

Also I've setup on the openib.conf file:
...
SET_IPOIB_CM=yes
IPOIB_MTU=65520
...

And on mlnx.conf file:
...

options mlx4_core enable_sys_tune=1
options mlx4_core log_num_mgm_entry_size=-7

Anyone here with experience on Infiniband setups can give me any hint in order to 'improve' performance, I'm getting similar numbers with another cluster on a 10GbE network :S

Thanks,

German

_______________________________________________
ceph-users mailing list
ceph-users@xxxxxxxxxxxxxx
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com