Re: ceph all-nvme mysql performance tuning

Wido den Hollander <wido@xxxxxxxx> · Mon, 27 Nov 2017 14:17:33 +0100 (CET)

> Op 27 november 2017 om 14:14 schreef German Anders <ganders@xxxxxxxxxxxx>:
> 
> 
> Hi Jason,
> 
> We are using librbd (librbd1-0.80.5-9.el6.x86_64), ok I will change those
> parameters and see if that changes something
> 

0.80? Is that a typo? You should really use 12.2.1 on the client.

Wido

> thanks a lot
> 
> best,
> 
> 
> *German*
> 
> 2017-11-27 10:09 GMT-03:00 Jason Dillaman <jdillama@xxxxxxxxxx>:
> 
> > Are you using krbd or librbd? You might want to consider "debug_ms = 0/0"
> > as well since per-message log gathering takes a large hit on small IO
> > performance.
> >
> > On Mon, Nov 27, 2017 at 8:02 AM, German Anders <ganders@xxxxxxxxxxxx>
> > wrote:
> >
> >> Hi All,
> >>
> >> I've a performance question, we recently install a brand new Ceph cluster
> >> with all-nvme disks, using ceph version 12.2.0 with bluestore configured.
> >> The back-end of the cluster is using a bond IPoIB (active/passive) , and
> >> for the front-end we are using a bonding config with active/active (20GbE)
> >> to communicate with the clients.
> >>
> >> The cluster configuration is the following:
> >>
> >> *MON Nodes:*
> >> OS: Ubuntu 16.04.3 LTS | kernel 4.12.14
> >> 3x 1U servers:
> >>   2x Intel Xeon E5-2630v4 @2.2Ghz
> >>   128G RAM
> >>   2x Intel SSD DC S3520 150G (in RAID-1 for OS)
> >>   2x 82599ES 10-Gigabit SFI/SFP+ Network Connection
> >>
> >> *OSD Nodes:*
> >> OS: Ubuntu 16.04.3 LTS | kernel 4.12.14
> >> 4x 2U servers:
> >>   2x Intel Xeon E5-2640v4 @2.4Ghz
> >>   128G RAM
> >>   2x Intel SSD DC S3520 150G (in RAID-1 for OS)
> >>   1x Ethernet Controller 10G X550T
> >>   1x 82599ES 10-Gigabit SFI/SFP+ Network Connection
> >>   12x Intel SSD DC P3520 1.2T (NVMe) for OSD daemons
> >>   1x Mellanox ConnectX-3 InfiniBand FDR 56Gb/s Adapter (dual port)
> >>
> >>
> >> Here's the tree:
> >>
> >> ID CLASS WEIGHT   TYPE NAME          STATUS REWEIGHT PRI-AFF
> >> -7       48.00000 root root
> >> -5       24.00000     rack rack1
> >> -1       12.00000         node cpn01
> >>  0  nvme  1.00000             osd.0      up  1.00000 1.00000
> >>  1  nvme  1.00000             osd.1      up  1.00000 1.00000
> >>  2  nvme  1.00000             osd.2      up  1.00000 1.00000
> >>  3  nvme  1.00000             osd.3      up  1.00000 1.00000
> >>  4  nvme  1.00000             osd.4      up  1.00000 1.00000
> >>  5  nvme  1.00000             osd.5      up  1.00000 1.00000
> >>  6  nvme  1.00000             osd.6      up  1.00000 1.00000
> >>  7  nvme  1.00000             osd.7      up  1.00000 1.00000
> >>  8  nvme  1.00000             osd.8      up  1.00000 1.00000
> >>  9  nvme  1.00000             osd.9      up  1.00000 1.00000
> >> 10  nvme  1.00000             osd.10     up  1.00000 1.00000
> >> 11  nvme  1.00000             osd.11     up  1.00000 1.00000
> >> -3       12.00000         node cpn03
> >> 24  nvme  1.00000             osd.24     up  1.00000 1.00000
> >> 25  nvme  1.00000             osd.25     up  1.00000 1.00000
> >> 26  nvme  1.00000             osd.26     up  1.00000 1.00000
> >> 27  nvme  1.00000             osd.27     up  1.00000 1.00000
> >> 28  nvme  1.00000             osd.28     up  1.00000 1.00000
> >> 29  nvme  1.00000             osd.29     up  1.00000 1.00000
> >> 30  nvme  1.00000             osd.30     up  1.00000 1.00000
> >> 31  nvme  1.00000             osd.31     up  1.00000 1.00000
> >> 32  nvme  1.00000             osd.32     up  1.00000 1.00000
> >> 33  nvme  1.00000             osd.33     up  1.00000 1.00000
> >> 34  nvme  1.00000             osd.34     up  1.00000 1.00000
> >> 35  nvme  1.00000             osd.35     up  1.00000 1.00000
> >> -6       24.00000     rack rack2
> >> -2       12.00000         node cpn02
> >> 12  nvme  1.00000             osd.12     up  1.00000 1.00000
> >> 13  nvme  1.00000             osd.13     up  1.00000 1.00000
> >> 14  nvme  1.00000             osd.14     up  1.00000 1.00000
> >> 15  nvme  1.00000             osd.15     up  1.00000 1.00000
> >> 16  nvme  1.00000             osd.16     up  1.00000 1.00000
> >> 17  nvme  1.00000             osd.17     up  1.00000 1.00000
> >> 18  nvme  1.00000             osd.18     up  1.00000 1.00000
> >> 19  nvme  1.00000             osd.19     up  1.00000 1.00000
> >> 20  nvme  1.00000             osd.20     up  1.00000 1.00000
> >> 21  nvme  1.00000             osd.21     up  1.00000 1.00000
> >> 22  nvme  1.00000             osd.22     up  1.00000 1.00000
> >> 23  nvme  1.00000             osd.23     up  1.00000 1.00000
> >> -4       12.00000         node cpn04
> >> 36  nvme  1.00000             osd.36     up  1.00000 1.00000
> >> 37  nvme  1.00000             osd.37     up  1.00000 1.00000
> >> 38  nvme  1.00000             osd.38     up  1.00000 1.00000
> >> 39  nvme  1.00000             osd.39     up  1.00000 1.00000
> >> 40  nvme  1.00000             osd.40     up  1.00000 1.00000
> >> 41  nvme  1.00000             osd.41     up  1.00000 1.00000
> >> 42  nvme  1.00000             osd.42     up  1.00000 1.00000
> >> 43  nvme  1.00000             osd.43     up  1.00000 1.00000
> >> 44  nvme  1.00000             osd.44     up  1.00000 1.00000
> >> 45  nvme  1.00000             osd.45     up  1.00000 1.00000
> >> 46  nvme  1.00000             osd.46     up  1.00000 1.00000
> >> 47  nvme  1.00000             osd.47     up  1.00000 1.00000
> >>
> >> The disk partition of one of the OSD nodes:
> >>
> >> NAME                   MAJ:MIN RM   SIZE RO TYPE  MOUNTPOINT
> >> nvme6n1                259:1    0   1.1T  0 disk
> >> ├─nvme6n1p2            259:15   0   1.1T  0 part
> >> └─nvme6n1p1            259:13   0   100M  0 part  /var/lib/ceph/osd/ceph-6
> >> nvme9n1                259:0    0   1.1T  0 disk
> >> ├─nvme9n1p2            259:8    0   1.1T  0 part
> >> └─nvme9n1p1            259:7    0   100M  0 part  /var/lib/ceph/osd/ceph-9
> >> sdb                      8:16   0 139.8G  0 disk
> >> └─sdb1                   8:17   0 139.8G  0 part
> >>   └─md0                  9:0    0 139.6G  0 raid1
> >>     ├─md0p2            259:31   0     1K  0 md
> >>     ├─md0p5            259:32   0 139.1G  0 md
> >>     │ ├─cpn01--vg-swap 253:1    0  27.4G  0 lvm   [SWAP]
> >>     │ └─cpn01--vg-root 253:0    0 111.8G  0 lvm   /
> >>     └─md0p1            259:30   0 486.3M  0 md    /boot
> >> nvme11n1               259:2    0   1.1T  0 disk
> >> ├─nvme11n1p1           259:12   0   100M  0 part
> >> /var/lib/ceph/osd/ceph-11
> >> └─nvme11n1p2           259:14   0   1.1T  0 part
> >> nvme2n1                259:6    0   1.1T  0 disk
> >> ├─nvme2n1p2            259:21   0   1.1T  0 part
> >> └─nvme2n1p1            259:20   0   100M  0 part  /var/lib/ceph/osd/ceph-2
> >> nvme5n1                259:3    0   1.1T  0 disk
> >> ├─nvme5n1p1            259:9    0   100M  0 part  /var/lib/ceph/osd/ceph-5
> >> └─nvme5n1p2            259:10   0   1.1T  0 part
> >> nvme8n1                259:24   0   1.1T  0 disk
> >> ├─nvme8n1p1            259:26   0   100M  0 part  /var/lib/ceph/osd/ceph-8
> >> └─nvme8n1p2            259:28   0   1.1T  0 part
> >> nvme10n1               259:11   0   1.1T  0 disk
> >> ├─nvme10n1p1           259:22   0   100M  0 part
> >> /var/lib/ceph/osd/ceph-10
> >> └─nvme10n1p2           259:23   0   1.1T  0 part
> >> nvme1n1                259:33   0   1.1T  0 disk
> >> ├─nvme1n1p1            259:34   0   100M  0 part  /var/lib/ceph/osd/ceph-1
> >> └─nvme1n1p2            259:35   0   1.1T  0 part
> >> nvme4n1                259:5    0   1.1T  0 disk
> >> ├─nvme4n1p1            259:18   0   100M  0 part  /var/lib/ceph/osd/ceph-4
> >> └─nvme4n1p2            259:19   0   1.1T  0 part
> >> nvme7n1                259:25   0   1.1T  0 disk
> >> ├─nvme7n1p1            259:27   0   100M  0 part  /var/lib/ceph/osd/ceph-7
> >> └─nvme7n1p2            259:29   0   1.1T  0 part
> >> sda                      8:0    0 139.8G  0 disk
> >> └─sda1                   8:1    0 139.8G  0 part
> >>   └─md0                  9:0    0 139.6G  0 raid1
> >>     ├─md0p2            259:31   0     1K  0 md
> >>     ├─md0p5            259:32   0 139.1G  0 md
> >>     │ ├─cpn01--vg-swap 253:1    0  27.4G  0 lvm   [SWAP]
> >>     │ └─cpn01--vg-root 253:0    0 111.8G  0 lvm   /
> >>     └─md0p1            259:30   0 486.3M  0 md    /boot
> >> nvme0n1                259:36   0   1.1T  0 disk
> >> ├─nvme0n1p1            259:37   0   100M  0 part  /var/lib/ceph/osd/ceph-0
> >> └─nvme0n1p2            259:38   0   1.1T  0 part
> >> nvme3n1                259:4    0   1.1T  0 disk
> >> ├─nvme3n1p1            259:16   0   100M  0 part  /var/lib/ceph/osd/ceph-3
> >> └─nvme3n1p2            259:17   0   1.1T  0 part
> >>
> >>
> >> For the disk scheduler we're using [kyber], for the read_ahead_kb we try
> >> different values (0,128 and 2048), the rq_affinity set to 2, and the
> >> rotational parameter set to 0.
> >> We've also set the CPU governor to performance on all the cores, and tune
> >> some sysctl parameters also:
> >>
> >> # for Ceph
> >> net.ipv4.ip_forward=0
> >> net.ipv4.conf.default.rp_filter=1
> >> kernel.sysrq=0
> >> kernel.core_uses_pid=1
> >> net.ipv4.tcp_syncookies=0
> >> #net.netfilter.nf_conntrack_max=2621440
> >> #net.netfilter.nf_conntrack_tcp_timeout_established = 1800
> >> # disable netfilter on bridges
> >> #net.bridge.bridge-nf-call-ip6tables = 0
> >> #net.bridge.bridge-nf-call-iptables = 0
> >> #net.bridge.bridge-nf-call-arptables = 0
> >> vm.min_free_kbytes=1000000
> >>
> >> # Controls the maximum size of a message, in bytes
> >> kernel.msgmnb = 65536
> >>
> >> # Controls the default maxmimum size of a mesage queue
> >> kernel.msgmax = 65536
> >>
> >> # Controls the maximum shared segment size, in bytes
> >> kernel.shmmax = 68719476736
> >>
> >> # Controls the maximum number of shared memory segments, in pages
> >> kernel.shmall = 4294967296
> >>
> >>
> >> The ceph.conf file is:
> >>
> >> ...
> >> osd_pool_default_size = 3
> >> osd_pool_default_min_size = 2
> >> osd_pool_default_pg_num = 1600
> >> osd_pool_default_pgp_num = 1600
> >>
> >> debug_crush = 1/1
> >> debug_buffer = 0/1
> >> debug_timer = 0/0
> >> debug_filer = 0/1
> >> debug_objecter = 0/1
> >> debug_rados = 0/5
> >> debug_rbd = 0/5
> >> debug_ms = 0/5
> >> debug_throttle = 1/1
> >>
> >> debug_journaler = 0/0
> >> debug_objectcatcher = 0/0
> >> debug_client = 0/0
> >> debug_osd = 0/0
> >> debug_optracker = 0/0
> >> debug_objclass = 0/0
> >> debug_journal = 0/0
> >> debug_filestore = 0/0
> >> debug_mon = 0/0
> >> debug_paxos = 0/0
> >>
> >> osd_crush_chooseleaf_type = 0
> >> filestore_xattr_use_omap = true
> >>
> >> rbd_cache = true
> >> mon_compact_on_trim = false
> >>
> >> [osd]
> >> osd_crush_update_on_start = false
> >>
> >> [client]
> >> rbd_cache = true
> >> rbd_cache_writethrough_until_flush = true
> >> rbd_default_features = 1
> >> admin_socket = /var/run/ceph/$cluster-$type.$id.$pid.$cctid.asok
> >> log_file = /var/log/ceph/
> >>
> >>
> >> The cluster has two production pools on for openstack (volumes) with RF
> >> of 3 and another pool for db (db) with RF of 2. The DBA team has perform
> >> several tests with a volume mounted on the DB server (with RBD). The DB
> >> server has the following configuration:
> >>
> >> OS: CentOS 6.9 | kernel 4.14.1
> >> DB: MySQL
> >> ProLiant BL685c G7
> >> 4x AMD Opteron Processor 6376 (total of 64 cores)
> >> 128G RAM
> >> 1x OneConnect 10Gb NIC (quad-port) - in a bond configuration
> >> (active/active) with 3 vlans
> >>
> >>
> >>
> >> We also did some tests with *sysbench* on different storage types:
> >>
> >> sysbench
> >> disk tps qps latency (ms) 95th percentile
> >> Local SSD 261,28 5.225,61 5,18
> >> Ceph NVMe 95,18 1.903,53 12,3
> >> Pure Storage 196,49 3.929,71 6,32
> >> NetApp FAS 189,83 3.796,59 6,67
> >> EMC VMAX 196,14 3.922,82 6,32
> >>
> >>
> >> Is there any specific tuning that I can apply to the ceph cluster, in
> >> order to improve those numbers? Or are those numbers ok for the type and
> >> size of the cluster that we have? Any advice would be really appreciated.
> >>
> >> Thanks,
> >>
> >>
> >>
> >>
> >> *German*
> >>
> >> _______________________________________________
> >> ceph-users mailing list
> >> ceph-users@xxxxxxxxxxxxxx
> >> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
> >>
> >>
> >
> >
> > --
> > Jason
> >
> _______________________________________________
> ceph-users mailing list
> ceph-users@xxxxxxxxxxxxxx
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
_______________________________________________
ceph-users mailing list
ceph-users@xxxxxxxxxxxxxx
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com