Hi Cephers,
They are all running Ubuntu 14.04.4 LTS. Journals are 5GB partitions on each disk, since all the OSD daemons are SSD disks (Intel DC3510 800G). For example:
sdc 8:32 0 745.2G 0 disk
|-sdc1 8:33 0 740.2G 0 part /var/lib/ceph/osd/ceph-0
`-sdc2 8:34 0 5G 0 part
The purpose of this cluster will be to serve as a backend storage for Cinder volumes (RBD) and Glance images in an OpenStack cloud, most of the clusters on OpenStack will be non-relational databases like Cassandra with many instances each.
I've setup a production environment Ceph cluster with the Jewel release (10.1.0 (96ae8bd25f31862dbd5302f304ebf8bf1166aba6))
consisting of 3 MON Servers and 6 OSD Servers:
3x MON Servers:
2x Intel Xeon E5-2630v3@2.40Ghz
384GB RAM
2x 200G Intel DC3700 in RAID-1 for OS
1x InfiniBand ConnectX-3 ADPT DP
6x OSD Servers:
2x Intel Xeon E5-2650v2@2.60Ghz
128GB RAM
2x 200G Intel DC3700 in RAID-1 for OS
12x 800G Intel DC3510 (osd & journal) on same device
1x InfiniBand ConnectX-3 ADPT DP (one port on PUB network and the other on the CLUS network)
ceph.conf file is:
[global]
fsid = xxxxxxxxxxxxxxxxxxxxxxxxxxx
mon_initial_members = cibm01, cibm02, cibm03
mon_host = xx.xx.xx.1,xx.xx.xx.2,xx.xx.xx.3
auth_cluster_required = cephx
auth_service_required = cephx
auth_client_required = cephx
filestore_xattr_use_omap = true
public_network = xx.xx.16.0/20
cluster_network = xx.xx.32.0/20
[mon]
[mon.cibm01]
host = cibm01
mon_addr = xx.xx.xx.1:6789
[mon.cibm02]
host = cibm02
mon_addr = xx.xx.xx.2:6789
[mon.cibm03]
host = cibm03
mon_addr = xx.xx.xx.3:6789
[osd]
osd_pool_default_size = 2
osd_pool_default_min_size = 1
[global]
fsid = xxxxxxxxxxxxxxxxxxxxxxxxxxx
mon_initial_members = cibm01, cibm02, cibm03
mon_host = xx.xx.xx.1,xx.xx.xx.2,xx.xx.xx.3
auth_cluster_required = cephx
auth_service_required = cephx
auth_client_required = cephx
filestore_xattr_use_omap = true
public_network = xx.xx.16.0/20
cluster_network = xx.xx.32.0/20
[mon]
[mon.cibm01]
host = cibm01
mon_addr = xx.xx.xx.1:6789
[mon.cibm02]
host = cibm02
mon_addr = xx.xx.xx.2:6789
[mon.cibm03]
host = cibm03
mon_addr = xx.xx.xx.3:6789
[osd]
osd_pool_default_size = 2
osd_pool_default_min_size = 1
## OSD Configuration ##
[osd.0]
host = cibn01
public_addr = xx.xx.17.1
cluster_addr = xx.xx.32.1
[osd.1]
host = cibn01
public_addr = xx.xx.17.1
cluster_addr = xx.xx.32.1
...
host = cibn01
public_addr = xx.xx.17.1
cluster_addr = xx.xx.32.1
[osd.1]
host = cibn01
public_addr = xx.xx.17.1
cluster_addr = xx.xx.32.1
...
They are all running Ubuntu 14.04.4 LTS. Journals are 5GB partitions on each disk, since all the OSD daemons are SSD disks (Intel DC3510 800G). For example:
sdc 8:32 0 745.2G 0 disk
|-sdc1 8:33 0 740.2G 0 part /var/lib/ceph/osd/ceph-0
`-sdc2 8:34 0 5G 0 part
The purpose of this cluster will be to serve as a backend storage for Cinder volumes (RBD) and Glance images in an OpenStack cloud, most of the clusters on OpenStack will be non-relational databases like Cassandra with many instances each.
All of the nodes of the cluster are running InfiniBand FDR 56Gb/s with Mellanox Technologies MT27500 Family [ConnectX-3] adapters.
So I assume that performance will be really nice, right?...but.. I'm getting some numbers that I think they could be really more important.
# rados --pool rbd bench 10 write -t 16
Total writes made: 1964
Write size: 4194304
Object size: 4194304
Bandwidth (MB/sec): 755.435
Stddev Bandwidth: 90.3288
Max bandwidth (MB/sec): 884
Min bandwidth (MB/sec): 612
Average IOPS: 188
Stddev IOPS: 22
Max IOPS: 221
Min IOPS: 153
Average Latency(s): 0.0836802
Stddev Latency(s): 0.147561
Max latency(s): 1.50925
Min latency(s): 0.0192736
Total writes made: 1964
Write size: 4194304
Object size: 4194304
Bandwidth (MB/sec): 755.435
Stddev Bandwidth: 90.3288
Max bandwidth (MB/sec): 884
Min bandwidth (MB/sec): 612
Average IOPS: 188
Stddev IOPS: 22
Max IOPS: 221
Min IOPS: 153
Average Latency(s): 0.0836802
Stddev Latency(s): 0.147561
Max latency(s): 1.50925
Min latency(s): 0.0192736
Then I connect to another server (this one is running on QDR - so I would expect something between 2-3Gb/s), I map a RBD on the host, then create a ext4 fs and mount it, and finally run a fio test:
# fio --rw=randwrite --bs=4M --numjobs=8 --iodepth=32 --runtime=22 --time_based --size=10G --loops=1 --ioengine=libaio --direct=1 --invalidate=1 --fsync_on_close=1 --randrepeat=1 --norandommap --group_reporting --exitall --name cephV1 --filename=/mnt/host01v1/test1
fio-2.1.3
Starting 8 processes
cephIBV1: Laying out IO file(s) (1 file(s) / 10240MB)
Jobs: 7 (f=7): [wwwwww_w] [100.0% done] [0KB/431.6MB/0KB /s] [0/107/0 iops] [eta 00m:00s]
cephIBV1: (groupid=0, jobs=8): err= 0: pid=6203: Thu Apr 7 15:24:12 2016
write: io=15284MB, bw=676412KB/s, iops=165, runt= 23138msec
slat (msec): min=1, max=480, avg=46.15, stdev=63.68
clat (msec): min=64, max=8966, avg=1459.91, stdev=1252.64
lat (msec): min=87, max=8969, avg=1506.06, stdev=1253.63
clat percentiles (msec):
| 1.00th=[ 235], 5.00th=[ 478], 10.00th=[ 611], 20.00th=[ 766],
| 30.00th=[ 889], 40.00th=[ 988], 50.00th=[ 1106], 60.00th=[ 1237],
| 70.00th=[ 1434], 80.00th=[ 1680], 90.00th=[ 2474], 95.00th=[ 4555],
| 99.00th=[ 6915], 99.50th=[ 7439], 99.90th=[ 8291], 99.95th=[ 8586],
| 99.99th=[ 8979]
bw (KB /s): min= 3091, max=209877, per=12.31%, avg=83280.51, stdev=35226.98
lat (msec) : 100=0.16%, 250=0.97%, 500=4.61%, 750=12.93%, 1000=22.61%
lat (msec) : 2000=45.04%, >=2000=13.69%
cpu : usr=0.87%, sys=4.77%, ctx=6803, majf=0, minf=16337
IO depths : 1=0.2%, 2=0.4%, 4=0.8%, 8=1.7%, 16=3.3%, 32=93.5%, >=64=0.0%
submit : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
complete : 0=0.0%, 4=99.8%, 8=0.0%, 16=0.0%, 32=0.2%, 64=0.0%, >=64=0.0%
issued : total=r=0/w=3821/d=0, short=r=0/w=0/d=0
Run status group 0 (all jobs):
WRITE: io=15284MB, aggrb=676411KB/s, minb=676411KB/s, maxb=676411KB/s, mint=23138msec, maxt=23138msec
Disk stats (read/write):
rbd1: ios=0/4189, merge=0/26613, ticks=0/2852032, in_queue=2857996, util=99.08%
# fio --rw=randwrite --bs=4M --numjobs=8 --iodepth=32 --runtime=22 --time_based --size=10G --loops=1 --ioengine=libaio --direct=1 --invalidate=1 --fsync_on_close=1 --randrepeat=1 --norandommap --group_reporting --exitall --name cephV1 --filename=/mnt/host01v1/test1
fio-2.1.3
Starting 8 processes
cephIBV1: Laying out IO file(s) (1 file(s) / 10240MB)
Jobs: 7 (f=7): [wwwwww_w] [100.0% done] [0KB/431.6MB/0KB /s] [0/107/0 iops] [eta 00m:00s]
cephIBV1: (groupid=0, jobs=8): err= 0: pid=6203: Thu Apr 7 15:24:12 2016
write: io=15284MB, bw=676412KB/s, iops=165, runt= 23138msec
slat (msec): min=1, max=480, avg=46.15, stdev=63.68
clat (msec): min=64, max=8966, avg=1459.91, stdev=1252.64
lat (msec): min=87, max=8969, avg=1506.06, stdev=1253.63
clat percentiles (msec):
| 1.00th=[ 235], 5.00th=[ 478], 10.00th=[ 611], 20.00th=[ 766],
| 30.00th=[ 889], 40.00th=[ 988], 50.00th=[ 1106], 60.00th=[ 1237],
| 70.00th=[ 1434], 80.00th=[ 1680], 90.00th=[ 2474], 95.00th=[ 4555],
| 99.00th=[ 6915], 99.50th=[ 7439], 99.90th=[ 8291], 99.95th=[ 8586],
| 99.99th=[ 8979]
bw (KB /s): min= 3091, max=209877, per=12.31%, avg=83280.51, stdev=35226.98
lat (msec) : 100=0.16%, 250=0.97%, 500=4.61%, 750=12.93%, 1000=22.61%
lat (msec) : 2000=45.04%, >=2000=13.69%
cpu : usr=0.87%, sys=4.77%, ctx=6803, majf=0, minf=16337
IO depths : 1=0.2%, 2=0.4%, 4=0.8%, 8=1.7%, 16=3.3%, 32=93.5%, >=64=0.0%
submit : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
complete : 0=0.0%, 4=99.8%, 8=0.0%, 16=0.0%, 32=0.2%, 64=0.0%, >=64=0.0%
issued : total=r=0/w=3821/d=0, short=r=0/w=0/d=0
Run status group 0 (all jobs):
WRITE: io=15284MB, aggrb=676411KB/s, minb=676411KB/s, maxb=676411KB/s, mint=23138msec, maxt=23138msec
Disk stats (read/write):
rbd1: ios=0/4189, merge=0/26613, ticks=0/2852032, in_queue=2857996, util=99.08%
Does it look acceptable? I mean for an InfiniBand network, I guess that throughput need to be better. How much more can I expect to achieve by tuning the servers? The MTU on the OSD servers is:
MTU: 65520
Any drop packet found
txqueuelen:256
txqueuelen:256
Also I've setup on the openib.conf file:
...
...
SET_IPOIB_CM=yes
IPOIB_MTU=65520
...
IPOIB_MTU=65520
...
And on mlnx.conf file:
...
...
options mlx4_core enable_sys_tune=1
options mlx4_core log_num_mgm_entry_size=-7
options mlx4_core log_num_mgm_entry_size=-7
Anyone here with experience on Infiniband setups can give me any hint in order to 'improve' performance, I'm getting similar numbers with another cluster on a 10GbE network :S
Thanks,
German
_______________________________________________ ceph-users mailing list ceph-users@xxxxxxxxxxxxxx http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com