Re: Ceph InfiniBand Cluster - Jewel - Performance

Mark Nelson <mnelson@xxxxxxxxxx> · Thu, 7 Apr 2016 15:37:24 -0500

On 04/07/2016 02:43 PM, German Anders wrote:
Hi Cephers,

I've setup a production environment Ceph cluster with the Jewel release
(10.1.0 (96ae8bd25f31862dbd5302f304ebf8bf1166aba6)) consisting of 3 MON
Servers and 6 OSD Servers:

3x MON Servers:
2x Intel Xeon E5-2630v3@2.40Ghz
384GB RAM
2x 200G Intel DC3700 in RAID-1 for OS
1x InfiniBand ConnectX-3 ADPT DP

6x OSD Servers:
2x Intel Xeon E5-2650v2@2.60Ghz
128GB RAM
2x 200G Intel DC3700 in RAID-1 for OS
12x 800G Intel DC3510 (osd & journal) on same device
1x InfiniBand ConnectX-3 ADPT DP (one port on PUB network and the other
on the CLUS network)

ceph.conf file is:

[global]
fsid = xxxxxxxxxxxxxxxxxxxxxxxxxxx
mon_initial_members = cibm01, cibm02, cibm03
mon_host = xx.xx.xx.1,xx.xx.xx.2,xx.xx.xx.3
auth_cluster_required = cephx
auth_service_required = cephx
auth_client_required = cephx
filestore_xattr_use_omap = true
public_network = xx.xx.16.0/20
cluster_network = xx.xx.32.0/20

[mon]

[mon.cibm01]
host = cibm01
mon_addr = xx.xx.xx.1:6789

[mon.cibm02]
host = cibm02
mon_addr = xx.xx.xx.2:6789

[mon.cibm03]
host = cibm03
mon_addr = xx.xx.xx.3:6789

[osd]
osd_pool_default_size = 2
osd_pool_default_min_size = 1

## OSD Configuration ##
[osd.0]
host = cibn01
public_addr = xx.xx.17.1
cluster_addr = xx.xx.32.1

[osd.1]
host = cibn01
public_addr = xx.xx.17.1
cluster_addr = xx.xx.32.1

...

They are all running *Ubuntu 14.04.4 LTS*. Journals are 5GB partitions
on each disk, since all the OSD daemons are SSD disks (Intel DC3510
800G). For example:

sdc                              8:32   0 745.2G  0 disk
|-sdc1                           8:33   0 740.2G  0 part
/var/lib/ceph/osd/ceph-0
`-sdc2                           8:34   0     5G  0 part

The purpose of this cluster will be to serve as a backend storage for
Cinder volumes (RBD) and Glance images in an OpenStack cloud, most of
the clusters on OpenStack will be non-relational databases like
Cassandra with many instances each.

All of the nodes of the cluster are running InfiniBand FDR 56Gb/s with
Mellanox Technologies MT27500 Family [ConnectX-3] adapters.

So I assume that performance will be really nice, right?...but.. I'm
getting some numbers that I think they could be really more important.

# rados --pool rbd bench 10 write -t 16

Total writes made:      1964
Write size:             4194304
Object size:            4194304
Bandwidth (MB/sec): *755.435*

Stddev Bandwidth:       90.3288
Max bandwidth (MB/sec): 884
Min bandwidth (MB/sec): 612
Average IOPS:           188
Stddev IOPS:            22
Max IOPS:               221
Min IOPS:               153
Average Latency(s):     0.0836802
Stddev Latency(s):      0.147561
Max latency(s):         1.50925
Min latency(s):         0.0192736

Then I connect to another server (this one is running on QDR - so I
would expect something between 2-3Gb/s), I map a RBD on the host, then
create a ext4 fs and mount it, and finally run a fio test:

# fio --rw=randwrite --bs=4M --numjobs=8 --iodepth=32 --runtime=22
--time_based --size=10G --loops=1 --ioengine=libaio --direct=1
--invalidate=1 --fsync_on_close=1 --randrepeat=1 --norandommap
--group_reporting --exitall --name cephV1 --filename=/mnt/host01v1/test1

fio-2.1.3
Starting 8 processes
cephIBV1: Laying out IO file(s) (1 file(s) / 10240MB)
Jobs: 7 (f=7): [wwwwww_w] [100.0% done] [0KB/431.6MB/0KB /s] [0/107/0
iops] [eta 00m:00s]
cephIBV1: (groupid=0, jobs=8): err= 0: pid=6203: Thu Apr  7 15:24:12 2016
   write: io=15284MB, bw=676412KB/s, iops=165, runt= 23138msec
     slat (msec): min=1, max=480, avg=46.15, stdev=63.68
     clat (msec): min=64, max=8966, avg=1459.91, stdev=1252.64
      lat (msec): min=87, max=8969, avg=1506.06, stdev=1253.63
     clat percentiles (msec):
      |  1.00th=[  235],  5.00th=[  478], 10.00th=[  611], 20.00th=[  766],
      | 30.00th=[  889], 40.00th=[  988], 50.00th=[ 1106], 60.00th=[ 1237],
      | 70.00th=[ 1434], 80.00th=[ 1680], 90.00th=[ 2474], 95.00th=[ 4555],
      | 99.00th=[ 6915], 99.50th=[ 7439], 99.90th=[ 8291], 99.95th=[ 8586],
      | 99.99th=[ 8979]
     bw (KB  /s): min= 3091, max=209877, per=12.31%, avg=83280.51,
stdev=35226.98
     lat (msec) : 100=0.16%, 250=0.97%, 500=4.61%, 750=12.93%, 1000=22.61%
     lat (msec) : 2000=45.04%, >=2000=13.69%
   cpu          : usr=0.87%, sys=4.77%, ctx=6803, majf=0, minf=16337
   IO depths    : 1=0.2%, 2=0.4%, 4=0.8%, 8=1.7%, 16=3.3%, 32=93.5%,
 >=64=0.0%
      submit    : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%,
 >=64=0.0%
      complete  : 0=0.0%, 4=99.8%, 8=0.0%, 16=0.0%, 32=0.2%, 64=0.0%,
 >=64=0.0%
      issued    : total=r=0/w=3821/d=0, short=r=0/w=0/d=0

Run status group 0 (all jobs):
   WRITE: io=15284MB, aggrb=676411KB/s, minb=676411KB/s,
maxb=676411KB/s, mint=23138msec, maxt=23138msec

Disk stats (read/write):
   rbd1: ios=0/4189, merge=0/26613, ticks=0/2852032, in_queue=2857996,
util=99.08%

Does it look acceptable? I mean for an InfiniBand network, I guess that
throughput need to be better. How much more can I expect to achieve by
tuning the servers? The MTU on the OSD servers is:

MTU: 65520
Any drop packet found
txqueuelen:256

Also I've setup on the openib.conf file:
...
SET_IPOIB_CM=yes
IPOIB_MTU=65520
...

And on mlnx.conf file:
...

options mlx4_core enable_sys_tune=1
options mlx4_core log_num_mgm_entry_size=-7

Anyone here with experience on Infiniband setups can give me any hint in
order to 'improve' performance, I'm getting similar numbers with another
cluster on a 10GbE network :S

Couple of thoughts:

1) First I'd use something like iperf/iperf3/netperf to do some 
point-to-point and potentially all-to-all tests on the network outside 
of ceph just to get a baseline for what to expect both in terms of 
throughput and in terms of latency.  you may find that interrupt 
affinity tuning will help, but without a baseline it's hard to say.  I 
believe the guys out at ORNL have a fair amount of experience with this 
so perhaps someone will chime in.  Probably 2.5-2.8GB/s is a reasonable 
target to shoot for with QDR.  See:

http://www.mellanox.com/related-docs/prod_software/Performance_Tuning_Guide_for_Mellanox_Network_Adapters.pdf

2) Verify that you have tcmalloc 2.4 and that ceph is being run with 
128MB of threadcache.  See:

https://bugs.launchpad.net/ubuntu/+source/google-perftools/+bug/1439277
https://drive.google.com/open?id=0B2gTBZrkrnpZek0zWlE5aVVuRlk

I would expect this would be far more important for small random write 
workloads given what we've seen though.

3) Increase the concurrency. You've got 72 OSDs and a concurrency of 
only 16 in one case and 32 in the other.  With enough replication you 
might be able to saturate the OSDs but it'd be worth seeing if higher 
concurrency helped here.

Thanks,

**

*German*

_______________________________________________
ceph-users mailing list
ceph-users@xxxxxxxxxxxxxx
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

_______________________________________________
ceph-users mailing list
ceph-users@xxxxxxxxxxxxxx
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com