Hi All,
I've a performance question, we recently install a brand new Ceph cluster with all-nvme disks, using ceph version 12.2.0 with bluestore configured. The back-end of the cluster is using a bond IPoIB (active/passive) , and for the front-end we are using a bonding config with active/active (20GbE) to communicate with the clients.
The cluster configuration is the following:
MON Nodes:
OS: Ubuntu 16.04.3 LTS | kernel 4.12.14
3x 1U servers:
2x Intel Xeon E5-2630v4 @2.2Ghz
128G RAM
2x Intel SSD DC S3520 150G (in RAID-1 for OS)
2x 82599ES 10-Gigabit SFI/SFP+ Network Connection
OSD Nodes:
OS: Ubuntu 16.04.3 LTS | kernel 4.12.14
4x 2U servers:
2x Intel Xeon E5-2640v4 @2.4Ghz
128G RAM
2x Intel SSD DC S3520 150G (in RAID-1 for OS)
1x Ethernet Controller 10G X550T
1x 82599ES 10-Gigabit SFI/SFP+ Network Connection
12x Intel SSD DC P3520 1.2T (NVMe) for OSD daemons
1x Mellanox ConnectX-3 InfiniBand FDR 56Gb/s Adapter (dual port)
Here's the tree:
ID CLASS WEIGHT TYPE NAME STATUS REWEIGHT PRI-AFF
-7 48.00000 root root
-5 24.00000 rack rack1
-1 12.00000 node cpn01
0 nvme 1.00000 osd.0 up 1.00000 1.00000
1 nvme 1.00000 osd.1 up 1.00000 1.00000
2 nvme 1.00000 osd.2 up 1.00000 1.00000
3 nvme 1.00000 osd.3 up 1.00000 1.00000
4 nvme 1.00000 osd.4 up 1.00000 1.00000
5 nvme 1.00000 osd.5 up 1.00000 1.00000
6 nvme 1.00000 osd.6 up 1.00000 1.00000
7 nvme 1.00000 osd.7 up 1.00000 1.00000
8 nvme 1.00000 osd.8 up 1.00000 1.00000
9 nvme 1.00000 osd.9 up 1.00000 1.00000
10 nvme 1.00000 osd.10 up 1.00000 1.00000
11 nvme 1.00000 osd.11 up 1.00000 1.00000
-3 12.00000 node cpn03
24 nvme 1.00000 osd.24 up 1.00000 1.00000
25 nvme 1.00000 osd.25 up 1.00000 1.00000
26 nvme 1.00000 osd.26 up 1.00000 1.00000
27 nvme 1.00000 osd.27 up 1.00000 1.00000
28 nvme 1.00000 osd.28 up 1.00000 1.00000
29 nvme 1.00000 osd.29 up 1.00000 1.00000
30 nvme 1.00000 osd.30 up 1.00000 1.00000
31 nvme 1.00000 osd.31 up 1.00000 1.00000
32 nvme 1.00000 osd.32 up 1.00000 1.00000
33 nvme 1.00000 osd.33 up 1.00000 1.00000
34 nvme 1.00000 osd.34 up 1.00000 1.00000
35 nvme 1.00000 osd.35 up 1.00000 1.00000
-6 24.00000 rack rack2
-2 12.00000 node cpn02
12 nvme 1.00000 osd.12 up 1.00000 1.00000
13 nvme 1.00000 osd.13 up 1.00000 1.00000
14 nvme 1.00000 osd.14 up 1.00000 1.00000
15 nvme 1.00000 osd.15 up 1.00000 1.00000
16 nvme 1.00000 osd.16 up 1.00000 1.00000
17 nvme 1.00000 osd.17 up 1.00000 1.00000
18 nvme 1.00000 osd.18 up 1.00000 1.00000
19 nvme 1.00000 osd.19 up 1.00000 1.00000
20 nvme 1.00000 osd.20 up 1.00000 1.00000
21 nvme 1.00000 osd.21 up 1.00000 1.00000
22 nvme 1.00000 osd.22 up 1.00000 1.00000
23 nvme 1.00000 osd.23 up 1.00000 1.00000
-4 12.00000 node cpn04
36 nvme 1.00000 osd.36 up 1.00000 1.00000
37 nvme 1.00000 osd.37 up 1.00000 1.00000
38 nvme 1.00000 osd.38 up 1.00000 1.00000
39 nvme 1.00000 osd.39 up 1.00000 1.00000
40 nvme 1.00000 osd.40 up 1.00000 1.00000
41 nvme 1.00000 osd.41 up 1.00000 1.00000
42 nvme 1.00000 osd.42 up 1.00000 1.00000
43 nvme 1.00000 osd.43 up 1.00000 1.00000
44 nvme 1.00000 osd.44 up 1.00000 1.00000
45 nvme 1.00000 osd.45 up 1.00000 1.00000
46 nvme 1.00000 osd.46 up 1.00000 1.00000
47 nvme 1.00000 osd.47 up 1.00000 1.00000
The disk partition of one of the OSD nodes:
NAME MAJ:MIN RM SIZE RO TYPE MOUNTPOINT
nvme6n1 259:1 0 1.1T 0 disk
├─nvme6n1p2 259:15 0 1.1T 0 part
└─nvme6n1p1 259:13 0 100M 0 part /var/lib/ceph/osd/ceph-6
nvme9n1 259:0 0 1.1T 0 disk
├─nvme9n1p2 259:8 0 1.1T 0 part
└─nvme9n1p1 259:7 0 100M 0 part /var/lib/ceph/osd/ceph-9
sdb 8:16 0 139.8G 0 disk
└─sdb1 8:17 0 139.8G 0 part
└─md0 9:0 0 139.6G 0 raid1
├─md0p2 259:31 0 1K 0 md
├─md0p5 259:32 0 139.1G 0 md
│ ├─cpn01--vg-swap 253:1 0 27.4G 0 lvm [SWAP]
│ └─cpn01--vg-root 253:0 0 111.8G 0 lvm /
└─md0p1 259:30 0 486.3M 0 md /boot
nvme11n1 259:2 0 1.1T 0 disk
├─nvme11n1p1 259:12 0 100M 0 part /var/lib/ceph/osd/ceph-11
└─nvme11n1p2 259:14 0 1.1T 0 part
nvme2n1 259:6 0 1.1T 0 disk
├─nvme2n1p2 259:21 0 1.1T 0 part
└─nvme2n1p1 259:20 0 100M 0 part /var/lib/ceph/osd/ceph-2
nvme5n1 259:3 0 1.1T 0 disk
├─nvme5n1p1 259:9 0 100M 0 part /var/lib/ceph/osd/ceph-5
└─nvme5n1p2 259:10 0 1.1T 0 part
nvme8n1 259:24 0 1.1T 0 disk
├─nvme8n1p1 259:26 0 100M 0 part /var/lib/ceph/osd/ceph-8
└─nvme8n1p2 259:28 0 1.1T 0 part
nvme10n1 259:11 0 1.1T 0 disk
├─nvme10n1p1 259:22 0 100M 0 part /var/lib/ceph/osd/ceph-10
└─nvme10n1p2 259:23 0 1.1T 0 part
nvme1n1 259:33 0 1.1T 0 disk
├─nvme1n1p1 259:34 0 100M 0 part /var/lib/ceph/osd/ceph-1
└─nvme1n1p2 259:35 0 1.1T 0 part
nvme4n1 259:5 0 1.1T 0 disk
├─nvme4n1p1 259:18 0 100M 0 part /var/lib/ceph/osd/ceph-4
└─nvme4n1p2 259:19 0 1.1T 0 part
nvme7n1 259:25 0 1.1T 0 disk
├─nvme7n1p1 259:27 0 100M 0 part /var/lib/ceph/osd/ceph-7
└─nvme7n1p2 259:29 0 1.1T 0 part
sda 8:0 0 139.8G 0 disk
└─sda1 8:1 0 139.8G 0 part
└─md0 9:0 0 139.6G 0 raid1
├─md0p2 259:31 0 1K 0 md
├─md0p5 259:32 0 139.1G 0 md
│ ├─cpn01--vg-swap 253:1 0 27.4G 0 lvm [SWAP]
│ └─cpn01--vg-root 253:0 0 111.8G 0 lvm /
└─md0p1 259:30 0 486.3M 0 md /boot
nvme0n1 259:36 0 1.1T 0 disk
├─nvme0n1p1 259:37 0 100M 0 part /var/lib/ceph/osd/ceph-0
└─nvme0n1p2 259:38 0 1.1T 0 part
nvme3n1 259:4 0 1.1T 0 disk
├─nvme3n1p1 259:16 0 100M 0 part /var/lib/ceph/osd/ceph-3
└─nvme3n1p2 259:17 0 1.1T 0 part
For the disk scheduler we're using [kyber], for the read_ahead_kb we try different values (0,128 and 2048), the rq_affinity set to 2, and the rotational parameter set to 0.
We've also set the CPU governor to performance on all the cores, and tune some sysctl parameters also:
# for Ceph
net.ipv4.ip_forward=0
net.ipv4.conf.default.rp_filter=1
kernel.sysrq=0
kernel.core_uses_pid=1
net.ipv4.tcp_syncookies=0
#net.netfilter.nf_conntrack_max=2621440
#net.netfilter.nf_conntrack_tcp_timeout_established = 1800
# disable netfilter on bridges
#net.bridge.bridge-nf-call-ip6tables = 0
#net.bridge.bridge-nf-call-iptables = 0
#net.bridge.bridge-nf-call-arptables = 0
vm.min_free_kbytes=1000000
# Controls the maximum size of a message, in bytes
kernel.msgmnb = 65536
# Controls the default maxmimum size of a mesage queue
kernel.msgmax = 65536
# Controls the maximum shared segment size, in bytes
kernel.shmmax = 68719476736
# Controls the maximum number of shared memory segments, in pages
kernel.shmall = 4294967296
The ceph.conf file is:
...
osd_pool_default_size = 3
osd_pool_default_min_size = 2
osd_pool_default_pg_num = 1600
osd_pool_default_pgp_num = 1600
debug_crush = 1/1
debug_buffer = 0/1
debug_timer = 0/0
debug_filer = 0/1
debug_objecter = 0/1
debug_rados = 0/5
debug_rbd = 0/5
debug_ms = 0/5
debug_throttle = 1/1
debug_journaler = 0/0
debug_objectcatcher = 0/0
debug_client = 0/0
debug_osd = 0/0
debug_optracker = 0/0
debug_objclass = 0/0
debug_journal = 0/0
debug_filestore = 0/0
debug_mon = 0/0
debug_paxos = 0/0
osd_crush_chooseleaf_type = 0
filestore_xattr_use_omap = true
rbd_cache = true
mon_compact_on_trim = false
[osd]
osd_crush_update_on_start = false
[client]
rbd_cache = true
rbd_cache_writethrough_until_flush = true
rbd_default_features = 1
admin_socket = /var/run/ceph/$cluster-$type.$id.$pid.$cctid.asok
log_file = /var/log/ceph/
The cluster has two production pools on for openstack (volumes) with RF of 3 and another pool for db (db) with RF of 2. The DBA team has perform several tests with a volume mounted on the DB server (with RBD). The DB server has the following configuration:
OS: CentOS 6.9 | kernel 4.14.1
DB: MySQL
ProLiant BL685c G7
4x AMD Opteron Processor 6376 (total of 64 cores)
128G RAM
1x OneConnect 10Gb NIC (quad-port) - in a bond configuration (active/active) with 3 vlans
We also did some tests with sysbench on different storage types:
Is there any specific tuning that I can apply to the ceph cluster, in order to improve those numbers? Or are those numbers ok for the type and size of the cluster that we have? Any advice would be really appreciated.
Thanks,
German
_______________________________________________ ceph-users mailing list ceph-users@xxxxxxxxxxxxxx http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com