Hi Maged,
Thanks a lot for the response. We try with different number of threads and we're getting almost the same kind of difference between the storage types. Going to try with different rbd stripe size, object size values and see if we get more competitive numbers. Will get back with more tests and param changes to see if we get better :)
Thanks,
Best,
German
2017-11-27 11:36 GMT-03:00 Maged Mokhtar <mmokhtar@xxxxxxxxxxx>:
On 2017-11-27 15:02, German Anders wrote:
Hi All,I've a performance question, we recently install a brand new Ceph cluster with all-nvme disks, using ceph version 12.2.0 with bluestore configured. The back-end of the cluster is using a bond IPoIB (active/passive) , and for the front-end we are using a bonding config with active/active (20GbE) to communicate with the clients.The cluster configuration is the following:MON Nodes:OS: Ubuntu 16.04.3 LTS | kernel 4.12.143x 1U servers:2x Intel Xeon E5-2630v4 @2.2Ghz128G RAM2x Intel SSD DC S3520 150G (in RAID-1 for OS)2x 82599ES 10-Gigabit SFI/SFP+ Network ConnectionOSD Nodes:OS: Ubuntu 16.04.3 LTS | kernel 4.12.144x 2U servers:2x Intel Xeon E5-2640v4 @2.4Ghz128G RAM2x Intel SSD DC S3520 150G (in RAID-1 for OS)1x Ethernet Controller 10G X550T1x 82599ES 10-Gigabit SFI/SFP+ Network Connection12x Intel SSD DC P3520 1.2T (NVMe) for OSD daemons1x Mellanox ConnectX-3 InfiniBand FDR 56Gb/s Adapter (dual port)Here's the tree:ID CLASS WEIGHT TYPE NAME STATUS REWEIGHT PRI-AFF-7 48.00000 root root-5 24.00000 rack rack1-1 12.00000 node cpn010 nvme 1.00000 osd.0 up 1.00000 1.000001 nvme 1.00000 osd.1 up 1.00000 1.000002 nvme 1.00000 osd.2 up 1.00000 1.000003 nvme 1.00000 osd.3 up 1.00000 1.000004 nvme 1.00000 osd.4 up 1.00000 1.000005 nvme 1.00000 osd.5 up 1.00000 1.000006 nvme 1.00000 osd.6 up 1.00000 1.000007 nvme 1.00000 osd.7 up 1.00000 1.000008 nvme 1.00000 osd.8 up 1.00000 1.000009 nvme 1.00000 osd.9 up 1.00000 1.0000010 nvme 1.00000 osd.10 up 1.00000 1.0000011 nvme 1.00000 osd.11 up 1.00000 1.00000-3 12.00000 node cpn0324 nvme 1.00000 osd.24 up 1.00000 1.0000025 nvme 1.00000 osd.25 up 1.00000 1.0000026 nvme 1.00000 osd.26 up 1.00000 1.0000027 nvme 1.00000 osd.27 up 1.00000 1.0000028 nvme 1.00000 osd.28 up 1.00000 1.0000029 nvme 1.00000 osd.29 up 1.00000 1.0000030 nvme 1.00000 osd.30 up 1.00000 1.0000031 nvme 1.00000 osd.31 up 1.00000 1.0000032 nvme 1.00000 osd.32 up 1.00000 1.0000033 nvme 1.00000 osd.33 up 1.00000 1.0000034 nvme 1.00000 osd.34 up 1.00000 1.0000035 nvme 1.00000 osd.35 up 1.00000 1.00000-6 24.00000 rack rack2-2 12.00000 node cpn0212 nvme 1.00000 osd.12 up 1.00000 1.0000013 nvme 1.00000 osd.13 up 1.00000 1.0000014 nvme 1.00000 osd.14 up 1.00000 1.0000015 nvme 1.00000 osd.15 up 1.00000 1.0000016 nvme 1.00000 osd.16 up 1.00000 1.0000017 nvme 1.00000 osd.17 up 1.00000 1.0000018 nvme 1.00000 osd.18 up 1.00000 1.0000019 nvme 1.00000 osd.19 up 1.00000 1.0000020 nvme 1.00000 osd.20 up 1.00000 1.0000021 nvme 1.00000 osd.21 up 1.00000 1.0000022 nvme 1.00000 osd.22 up 1.00000 1.0000023 nvme 1.00000 osd.23 up 1.00000 1.00000-4 12.00000 node cpn0436 nvme 1.00000 osd.36 up 1.00000 1.0000037 nvme 1.00000 osd.37 up 1.00000 1.0000038 nvme 1.00000 osd.38 up 1.00000 1.0000039 nvme 1.00000 osd.39 up 1.00000 1.0000040 nvme 1.00000 osd.40 up 1.00000 1.0000041 nvme 1.00000 osd.41 up 1.00000 1.0000042 nvme 1.00000 osd.42 up 1.00000 1.0000043 nvme 1.00000 osd.43 up 1.00000 1.0000044 nvme 1.00000 osd.44 up 1.00000 1.0000045 nvme 1.00000 osd.45 up 1.00000 1.0000046 nvme 1.00000 osd.46 up 1.00000 1.0000047 nvme 1.00000 osd.47 up 1.00000 1.00000The disk partition of one of the OSD nodes:NAME MAJ:MIN RM SIZE RO TYPE MOUNTPOINTnvme6n1 259:1 0 1.1T 0 disk├─nvme6n1p2 259:15 0 1.1T 0 part└─nvme6n1p1 259:13 0 100M 0 part /var/lib/ceph/osd/ceph-6nvme9n1 259:0 0 1.1T 0 disk├─nvme9n1p2 259:8 0 1.1T 0 part└─nvme9n1p1 259:7 0 100M 0 part /var/lib/ceph/osd/ceph-9sdb 8:16 0 139.8G 0 disk└─sdb1 8:17 0 139.8G 0 part└─md0 9:0 0 139.6G 0 raid1├─md0p2 259:31 0 1K 0 md├─md0p5 259:32 0 139.1G 0 md│ ├─cpn01--vg-swap 253:1 0 27.4G 0 lvm [SWAP]│ └─cpn01--vg-root 253:0 0 111.8G 0 lvm /└─md0p1 259:30 0 486.3M 0 md /bootnvme11n1 259:2 0 1.1T 0 disk├─nvme11n1p1 259:12 0 100M 0 part /var/lib/ceph/osd/ceph-11└─nvme11n1p2 259:14 0 1.1T 0 partnvme2n1 259:6 0 1.1T 0 disk├─nvme2n1p2 259:21 0 1.1T 0 part└─nvme2n1p1 259:20 0 100M 0 part /var/lib/ceph/osd/ceph-2nvme5n1 259:3 0 1.1T 0 disk├─nvme5n1p1 259:9 0 100M 0 part /var/lib/ceph/osd/ceph-5└─nvme5n1p2 259:10 0 1.1T 0 partnvme8n1 259:24 0 1.1T 0 disk├─nvme8n1p1 259:26 0 100M 0 part /var/lib/ceph/osd/ceph-8└─nvme8n1p2 259:28 0 1.1T 0 partnvme10n1 259:11 0 1.1T 0 disk├─nvme10n1p1 259:22 0 100M 0 part /var/lib/ceph/osd/ceph-10└─nvme10n1p2 259:23 0 1.1T 0 partnvme1n1 259:33 0 1.1T 0 disk├─nvme1n1p1 259:34 0 100M 0 part /var/lib/ceph/osd/ceph-1└─nvme1n1p2 259:35 0 1.1T 0 partnvme4n1 259:5 0 1.1T 0 disk├─nvme4n1p1 259:18 0 100M 0 part /var/lib/ceph/osd/ceph-4└─nvme4n1p2 259:19 0 1.1T 0 partnvme7n1 259:25 0 1.1T 0 disk├─nvme7n1p1 259:27 0 100M 0 part /var/lib/ceph/osd/ceph-7└─nvme7n1p2 259:29 0 1.1T 0 partsda 8:0 0 139.8G 0 disk└─sda1 8:1 0 139.8G 0 part└─md0 9:0 0 139.6G 0 raid1├─md0p2 259:31 0 1K 0 md├─md0p5 259:32 0 139.1G 0 md│ ├─cpn01--vg-swap 253:1 0 27.4G 0 lvm [SWAP]│ └─cpn01--vg-root 253:0 0 111.8G 0 lvm /└─md0p1 259:30 0 486.3M 0 md /bootnvme0n1 259:36 0 1.1T 0 disk├─nvme0n1p1 259:37 0 100M 0 part /var/lib/ceph/osd/ceph-0└─nvme0n1p2 259:38 0 1.1T 0 partnvme3n1 259:4 0 1.1T 0 disk├─nvme3n1p1 259:16 0 100M 0 part /var/lib/ceph/osd/ceph-3└─nvme3n1p2 259:17 0 1.1T 0 partFor the disk scheduler we're using [kyber], for the read_ahead_kb we try different values (0,128 and 2048), the rq_affinity set to 2, and the rotational parameter set to 0.We've also set the CPU governor to performance on all the cores, and tune some sysctl parameters also:# for Cephnet.ipv4.ip_forward=0net.ipv4.conf.default.rp_filter=1 kernel.sysrq=0kernel.core_uses_pid=1net.ipv4.tcp_syncookies=0#net.netfilter.nf_conntrack_max=2621440 #net.netfilter.nf_conntrack_tcp_timeout_established = 1800 # disable netfilter on bridges#net.bridge.bridge-nf-call-ip6tables = 0 #net.bridge.bridge-nf-call-iptables = 0 #net.bridge.bridge-nf-call-arptables = 0 vm.min_free_kbytes=1000000# Controls the maximum size of a message, in byteskernel.msgmnb = 65536# Controls the default maxmimum size of a mesage queuekernel.msgmax = 65536# Controls the maximum shared segment size, in byteskernel.shmmax = 68719476736# Controls the maximum number of shared memory segments, in pageskernel.shmall = 4294967296The ceph.conf file is:...osd_pool_default_size = 3osd_pool_default_min_size = 2osd_pool_default_pg_num = 1600osd_pool_default_pgp_num = 1600debug_crush = 1/1debug_buffer = 0/1debug_timer = 0/0debug_filer = 0/1debug_objecter = 0/1debug_rados = 0/5debug_rbd = 0/5debug_ms = 0/5debug_throttle = 1/1debug_journaler = 0/0debug_objectcatcher = 0/0debug_client = 0/0debug_osd = 0/0debug_optracker = 0/0debug_objclass = 0/0debug_journal = 0/0debug_filestore = 0/0debug_mon = 0/0debug_paxos = 0/0osd_crush_chooseleaf_type = 0filestore_xattr_use_omap = truerbd_cache = truemon_compact_on_trim = false[osd]osd_crush_update_on_start = false[client]rbd_cache = truerbd_cache_writethrough_until_flush = true rbd_default_features = 1admin_socket = /var/run/ceph/$cluster-$type.$id.$pid.$cctid.asok log_file = /var/log/ceph/The cluster has two production pools on for openstack (volumes) with RF of 3 and another pool for db (db) with RF of 2. The DBA team has perform several tests with a volume mounted on the DB server (with RBD). The DB server has the following configuration:OS: CentOS 6.9 | kernel 4.14.1DB: MySQLProLiant BL685c G74x AMD Opteron Processor 6376 (total of 64 cores)128G RAM1x OneConnect 10Gb NIC (quad-port) - in a bond configuration (active/active) with 3 vlansWe also did some tests with sysbench on different storage types:
Is there any specific tuning that I can apply to the ceph cluster, in order to improve those numbers? Or are those numbers ok for the type and size of the cluster that we have? Any advice would be really appreciated.Thanks,German
_______________________________________________
ceph-users mailing list
ceph-users@xxxxxxxxxxxxxx
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph. com Hi,
What is the value of --num-threads (def value is 1) ? Ceph will be better with more threads: 32 or 64.
What is the value of --file-block-size (def 16k) and file-test-mode ? If you are using sequential seqwr/seqrd you will be hitting the same OSD, so maybe try random (rndrd/rndwr) or better use rbd stripe size of 16kb (default rbd stripe is 4M). rbd striping is ideal for small block sequential io pattern typical in databases./Maged
_______________________________________________ ceph-users mailing list ceph-users@xxxxxxxxxxxxxx http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com