Re: ceph all-nvme mysql performance tuning

Maged Mokhtar <mmokhtar@xxxxxxxxxxx> · Wed, 29 Nov 2017 10:24:44 +0200

Hi German,
I would personally prefer to use rados bench/ fio which are more common to benchmark the cluster first then later do mysql specific tests using sysbench. Another thing is to run the client test simultaneously on more than 1 machine and aggregate/add the performance numbers of each, the limitation can be caused by client side resources which could be stressed differently based on the different storage backends you tried.
Maged

On 2017-11-28 21:20, German Anders wrote:

Don't know if there's any statistics available really, but Im running some sysbench tests with mysql before the changes and the idea is to run those tests again after the 'tuning' and see if numbers get better in any way, also I'm gathering numbers from some collectd and statsd collectors running on the osd nodes so, I hope to get some info about that :)

German

2017-11-28 16:12 GMT-03:00 Marc Roos <M.Roos@xxxxxxxxxxxxxxxxx>:

 I was wondering if there are any statistics available that show the
 performance increase of doing such things?

 -----Original Message-----
 From: German Anders [mailto:ganders@xxxxxxxxxxxx]
 Sent: dinsdag 28 november 2017 19:34
 To: Luis Periquito
 Cc: ceph-users
 Subject: Re:  ceph all-nvme mysql performance tuning

Thanks a lot Luis, I agree with you regarding the CPUs, but
 unfortunately those were the best CPU model that we can afford :S

 For the NUMA part, I manage to pinned the OSDs by changing the
 /usr/lib/systemd/system/ceph-osd@.service file and adding the
 CPUAffinity list to it. But, this is for ALL the OSDs to specific nodes
 or specific CPU list. But I can't find the way to specify a list for
 only a specific number of OSDs.

 Also, I notice that the NVMe disks are all on the same node (since I'm
 using half of the shelf - so the other half will be pinned to the other
 node), so the lanes of the NVMe disks are all on the same CPU (in this
 case 0). Also, I find that the IB adapter that is mapped to the OSD
 network (osd replication) is pinned to CPU 1, so this will cross the QPI
 path.

 And for the memory, from the other email, we are already using the
 TCMALLOC_MAX_TOTAL_THREAD_CACHE_BYTES parameter with a value of
 134217728

 In this case I can pinned all the actual OSDs to CPU 0, but in the near
 future when I add more nvme disks to the OSD nodes, I'll definitely need
 to pinned the other half OSDs to CPU 1, someone already did this?

 Thanks a lot,

 Best,

 German

 2017-11-28 6:36 GMT-03:00 Luis Periquito <periquito@xxxxxxxxx>:

         There are a few things I don't like about your machines... If you
 want latency/IOPS (as you seemingly do) you really want the highest
 frequency CPUs, even over number of cores. These are not too bad, but
 not great either.

         Also you have 2x CPU meaning NUMA. Have you pinned OSDs to NUMA
 nodes? Ideally OSD is pinned to same NUMA node the NVMe device is
 connected to. Each NVMe device will be running on PCIe lanes generated
 by one of the CPUs...

         What versions of TCMalloc (or jemalloc) are you running? Have you
 tuned them to have a bigger cache?

         These are from what I've learned using filestore - I've yet to run
 full tests on bluestore - but they should still apply...

         On Mon, Nov 27, 2017 at 5:10 PM, German Anders
 <ganders@xxxxxxxxxxxx> wrote:

                 Hi Nick,

                 yeah, we are using the same nvme disk with an additional
 partition to use as journal/wal. We double check the c-state and it was
 not configure to use c1, so we change that on all the osd nodes and mon
 nodes and we're going to make some new tests, and see how it goes. I'll
 get back as soon as get got those tests running.

                 Thanks a lot,

                 Best,

                 German

                 2017-11-27 12:16 GMT-03:00 Nick Fisk <nick@xxxxxxxxxx>:

                         From: ceph-users
 [mailto:ceph-users-bounces@lists.ceph.com
 <mailto:ceph-users-bounces@lists.ceph.com> ] On Behalf Of German Anders
                         Sent: 27 November 2017 14:44
                         To: Maged Mokhtar <mmokhtar@xxxxxxxxxxx>
                         Cc: ceph-users <ceph-users@xxxxxxxxxxxxxx>
                         Subject: Re:  ceph all-nvme mysql performance
 tuning

                         Hi Maged,

                         Thanks a lot for the response. We try with different
 number of threads and we're getting almost the same kind of difference
 between the storage types. Going to try with different rbd stripe size,
 object size values and see if we get more competitive numbers. Will get
 back with more tests and param changes to see if we get better :)

                         Just to echo a couple of comments. Ceph will always
 struggle to match the performance of a traditional array for mainly 2
 reasons.

                        1.      You are replacing some sort of dual ported SAS or
internally RDMA connected device with a network for Ceph replication
 traffic. This will instantly have a large impact on write latency
                        2.      Ceph locks at the PG level and a PG will most

likely cover at least one 4MB object, so lots of small accesses to the
 same blocks (on a block device) will wait on each other and go
 effectively at a single threaded rate.

                         The best thing you can do to mitigate these, is to run
 the fastest journal/WAL devices you can, fastest network connections (ie
 25Gb/s) and run your CPU's at max C and P states.

                         You stated that you are running the performance profile
 on the CPU's. Could you also just double check that the C-states are
 being held at C1(e)? There are a few utilities that can show this in
 realtime.

                         Other than that, although there could be some minor
 tweaks, you are probably nearing the limit of what you can hope to
 achieve.

                         Nick

                         Thanks,

                         Best,

                         German

                         2017-11-27 11:36 GMT-03:00 Maged Mokhtar
 <mmokhtar@xxxxxxxxxxx>:

                                 On 2017-11-27 15:02, German Anders wrote:

                                         Hi All,

                                         I've a performance question, we recently
 install a brand new Ceph cluster with all-nvme disks, using ceph version
 12.2.0 with bluestore configured. The back-end of the cluster is using a
 bond IPoIB (active/passive) , and for the front-end we are using a
 bonding config with active/active (20GbE) to communicate with the
 clients.

                                         The cluster configuration is the following:

                                         MON Nodes:

                                         OS: Ubuntu 16.04.3 LTS | kernel 4.12.14

                                         3x 1U servers:

                                           2x Intel Xeon E5-2630v4 @2.2Ghz

                                           128G RAM

                                           2x Intel SSD DC S3520 150G (in RAID-1 for OS)

                                           2x 82599ES 10-Gigabit SFI/SFP+ Network
 Connection

                                         OSD Nodes:

                                         OS: Ubuntu 16.04.3 LTS | kernel 4.12.14

                                         4x 2U servers:

                                           2x Intel Xeon E5-2640v4 @2.4Ghz

                                           128G RAM

                                           2x Intel SSD DC S3520 150G (in RAID-1 for OS)

                                           1x Ethernet Controller 10G X550T

                                           1x 82599ES 10-Gigabit SFI/SFP+ Network
 Connection

                                           12x Intel SSD DC P3520 1.2T (NVMe) for OSD
 daemons

                                           1x Mellanox ConnectX-3 InfiniBand FDR 56Gb/s
 Adapter (dual port)

                                         Here's the tree:

                                         ID CLASS WEIGHT   TYPE NAME          STATUS
 REWEIGHT PRI-AFF

                                         -7       48.00000 root root

                                         -5       24.00000     rack rack1

                                         -1       12.00000         node cpn01

                                          0  nvme  1.00000             osd.0      up
 1.00000 1.00000

                                          1  nvme  1.00000             osd.1      up
 1.00000 1.00000

                                          2  nvme  1.00000             osd.2      up
 1.00000 1.00000

                                          3  nvme  1.00000             osd.3      up
 1.00000 1.00000

                                          4  nvme  1.00000             osd.4      up
 1.00000 1.00000

                                          5  nvme  1.00000             osd.5      up
 1.00000 1.00000

                                          6  nvme  1.00000             osd.6      up
 1.00000 1.00000

                                          7  nvme  1.00000             osd.7      up
 1.00000 1.00000

                                          8  nvme  1.00000             osd.8      up
 1.00000 1.00000

                                          9  nvme  1.00000             osd.9      up
 1.00000 1.00000

                                         10  nvme  1.00000             osd.10     up
 1.00000 1.00000

                                         11  nvme  1.00000             osd.11     up
 1.00000 1.00000

                                         -3       12.00000         node cpn03

                                         24  nvme  1.00000             osd.24     up
 1.00000 1.00000

                                         25  nvme  1.00000             osd.25     up
 1.00000 1.00000

                                         26  nvme  1.00000             osd.26     up
 1.00000 1.00000

                                         27  nvme  1.00000             osd.27     up
 1.00000 1.00000

                                         28  nvme  1.00000             osd.28     up
 1.00000 1.00000

                                         29  nvme  1.00000             osd.29     up
 1.00000 1.00000

                                         30  nvme  1.00000             osd.30     up
 1.00000 1.00000

                                         31  nvme  1.00000             osd.31     up
 1.00000 1.00000

                                         32  nvme  1.00000             osd.32     up
 1.00000 1.00000

                                         33  nvme  1.00000             osd.33     up
 1.00000 1.00000

                                         34  nvme  1.00000             osd.34     up
 1.00000 1.00000

                                         35  nvme  1.00000             osd.35     up
 1.00000 1.00000

                                         -6       24.00000     rack rack2

                                         -2       12.00000         node cpn02

                                         12  nvme  1.00000             osd.12     up
 1.00000 1.00000

                                         13  nvme  1.00000             osd.13     up
 1.00000 1.00000

                                         14  nvme  1.00000             osd.14     up
 1.00000 1.00000

                                         15  nvme  1.00000             osd.15     up
 1.00000 1.00000

                                         16  nvme  1.00000             osd.16     up
 1.00000 1.00000

                                         17  nvme  1.00000             osd.17     up
 1.00000 1.00000

                                         18  nvme  1.00000             osd.18     up
 1.00000 1.00000

                                         19  nvme  1.00000             osd.19     up
 1.00000 1.00000

                                         20  nvme  1.00000             osd.20     up
 1.00000 1.00000

                                         21  nvme  1.00000             osd.21     up
 1.00000 1.00000

                                         22  nvme  1.00000             osd.22     up
 1.00000 1.00000

                                         23  nvme  1.00000             osd.23     up
 1.00000 1.00000

                                         -4       12.00000         node cpn04

                                         36  nvme  1.00000             osd.36     up
 1.00000 1.00000

                                         37  nvme  1.00000             osd.37     up
 1.00000 1.00000

                                         38  nvme  1.00000             osd.38     up
 1.00000 1.00000

                                         39  nvme  1.00000             osd.39     up
 1.00000 1.00000

                                         40  nvme  1.00000             osd.40     up
 1.00000 1.00000

                                         41  nvme  1.00000             osd.41     up
 1.00000 1.00000

                                         42  nvme  1.00000             osd.42     up
 1.00000 1.00000

                                         43  nvme  1.00000             osd.43     up
 1.00000 1.00000

                                         44  nvme  1.00000             osd.44     up
 1.00000 1.00000

                                         45  nvme  1.00000             osd.45     up
 1.00000 1.00000

                                         46  nvme  1.00000             osd.46     up
 1.00000 1.00000

                                         47  nvme  1.00000             osd.47     up
 1.00000 1.00000

                                         The disk partition of one of the OSD nodes:

                                         NAME                   MAJ:MIN RM   SIZE RO
 TYPE  MOUNTPOINT

                                         nvme6n1                259:1    0   1.1T  0
 disk

                                         ├─nvme6n1p2            259:15   0   1.1T  0
 part

                                         └─nvme6n1p1            259:13   0   100M  0
 part  /var/lib/ceph/osd/ceph-6

                                         nvme9n1                259:0    0   1.1T  0
 disk

                                         ├─nvme9n1p2            259:8    0   1.1T  0
 part

                                         └─nvme9n1p1            259:7    0   100M  0
 part  /var/lib/ceph/osd/ceph-9

                                         sdb                      8:16   0 139.8G  0
 disk

                                         └─sdb1                   8:17   0 139.8G  0
 part

                                           └─md0                  9:0    0 139.6G  0
 raid1

                                             ├─md0p2            259:31   0     1K  0
 md

                                             ├─md0p5            259:32   0 139.1G  0
 md

                                             │ ├─cpn01--vg-swap 253:1    0  27.4G  0
 lvm   [SWAP]

                                             │ └─cpn01--vg-root 253:0    0 111.8G  0
 lvm   /

                                             └─md0p1            259:30   0 486.3M  0
 md    /boot

                                         nvme11n1               259:2    0   1.1T  0
 disk

                                         ├─nvme11n1p1           259:12   0   100M  0
 part  /var/lib/ceph/osd/ceph-11

                                         └─nvme11n1p2           259:14   0   1.1T  0
 part

                                         nvme2n1                259:6    0   1.1T  0
 disk

                                         ├─nvme2n1p2            259:21   0   1.1T  0
 part

                                         └─nvme2n1p1            259:20   0   100M  0
 part  /var/lib/ceph/osd/ceph-2

                                         nvme5n1                259:3    0   1.1T  0
 disk

                                         ├─nvme5n1p1            259:9    0   100M  0
 part  /var/lib/ceph/osd/ceph-5

                                         └─nvme5n1p2            259:10   0   1.1T  0
 part

                                         nvme8n1                259:24   0   1.1T  0
 disk

                                         ├─nvme8n1p1            259:26   0   100M  0
 part  /var/lib/ceph/osd/ceph-8

                                         └─nvme8n1p2            259:28   0   1.1T  0
 part

                                         nvme10n1               259:11   0   1.1T  0
 disk

                                         ├─nvme10n1p1           259:22   0   100M  0
 part  /var/lib/ceph/osd/ceph-10

                                         └─nvme10n1p2           259:23   0   1.1T  0
 part

                                         nvme1n1                259:33   0   1.1T  0
 disk

                                         ├─nvme1n1p1            259:34   0   100M  0
 part  /var/lib/ceph/osd/ceph-1

                                         └─nvme1n1p2            259:35   0   1.1T  0
 part

                                         nvme4n1                259:5    0   1.1T  0
 disk

                                         ├─nvme4n1p1            259:18   0   100M  0
 part  /var/lib/ceph/osd/ceph-4

                                         └─nvme4n1p2            259:19   0   1.1T  0
 part

                                         nvme7n1                259:25   0   1.1T  0
 disk

                                         ├─nvme7n1p1            259:27   0   100M  0
 part  /var/lib/ceph/osd/ceph-7

                                         └─nvme7n1p2            259:29   0   1.1T  0
 part

                                         sda                      8:0    0 139.8G  0
 disk

                                         └─sda1                   8:1    0 139.8G  0
 part

                                           └─md0                  9:0    0 139.6G  0
 raid1

                                             ├─md0p2            259:31   0     1K  0
 md

                                             ├─md0p5            259:32   0 139.1G  0
 md

                                             │ ├─cpn01--vg-swap 253:1    0  27.4G  0
 lvm   [SWAP]

                                             │ └─cpn01--vg-root 253:0    0 111.8G  0
 lvm   /

                                             └─md0p1            259:30   0 486.3M  0
 md    /boot

                                         nvme0n1                259:36   0   1.1T  0
 disk

                                         ├─nvme0n1p1            259:37   0   100M  0
 part  /var/lib/ceph/osd/ceph-0

                                         └─nvme0n1p2            259:38   0   1.1T  0
 part

                                         nvme3n1                259:4    0   1.1T  0
 disk

                                         ├─nvme3n1p1            259:16   0   100M  0
 part  /var/lib/ceph/osd/ceph-3

                                         └─nvme3n1p2            259:17   0   1.1T  0
 part

                                         For the disk scheduler we're using [kyber], for
 the read_ahead_kb we try different values (0,128 and 2048), the
 rq_affinity set to 2, and the rotational parameter set to 0.

                                         We've also set the CPU governor to performance
 on all the cores, and tune some sysctl parameters also:

                                         # for Ceph

                                         net.ipv4.ip_forward=0

                                         net.ipv4.conf.default.rp_filter=1

                                         kernel.sysrq=0

                                         kernel.core_uses_pid=1

                                         net.ipv4.tcp_syncookies=0

                                         #net.netfilter.nf_conntrack_max=2621440

 #net.netfilter.nf_conntrack_tcp_timeout_established = 1800

                                         # disable netfilter on bridges

                                         #net.bridge.bridge-nf-call-ip6tables = 0

                                         #net.bridge.bridge-nf-call-iptables = 0

                                         #net.bridge.bridge-nf-call-arptables = 0

                                         vm.min_free_kbytes=1000000

                                         # Controls the maximum size of a message, in
 bytes

                                         kernel.msgmnb = 65536

                                         # Controls the default maxmimum size of a
 mesage queue

                                         kernel.msgmax = 65536

                                         # Controls the maximum shared segment size, in
 bytes

                                         kernel.shmmax = 68719476736

                                         # Controls the maximum number of shared memory
 segments, in pages

                                         kernel.shmall = 4294967296

                                         The ceph.conf file is:

                                         ...

                                         osd_pool_default_size = 3

                                         osd_pool_default_min_size = 2

                                         osd_pool_default_pg_num = 1600

                                         osd_pool_default_pgp_num = 1600

                                         debug_crush = 1/1

                                         debug_buffer = 0/1

                                         debug_timer = 0/0

                                         debug_filer = 0/1

                                         debug_objecter = 0/1

                                         debug_rados = 0/5

                                         debug_rbd = 0/5

                                         debug_ms = 0/5

                                         debug_throttle = 1/1

                                         debug_journaler = 0/0

                                         debug_objectcatcher = 0/0

                                         debug_client = 0/0

                                         debug_osd = 0/0

                                         debug_optracker = 0/0

                                         debug_objclass = 0/0

                                         debug_journal = 0/0

                                         debug_filestore = 0/0

                                         debug_mon = 0/0

                                         debug_paxos = 0/0

                                         osd_crush_chooseleaf_type = 0

                                         filestore_xattr_use_omap = true

                                         rbd_cache = true

                                         mon_compact_on_trim = false

                                         [osd]

                                         osd_crush_update_on_start = false

                                         [client]

                                         rbd_cache = true

                                         rbd_cache_writethrough_until_flush = true

                                         rbd_default_features = 1

                                         admin_socket =
 /var/run/ceph/$cluster-$type.$id.$pid.$cctid.asok

                                         log_file = /var/log/ceph/

                                         The cluster has two production pools on for
 openstack (volumes) with RF of 3 and another pool for db (db) with RF of
 2. The DBA team has perform several tests with a volume mounted on the
 DB server (with RBD). The DB server has the following configuration:

                                         OS: CentOS 6.9 | kernel 4.14.1

                                         DB: MySQL

                                         ProLiant BL685c G7

                                         4x AMD Opteron Processor 6376 (total of 64
 cores)

                                         128G RAM

                                         1x OneConnect 10Gb NIC (quad-port) - in a bond
 configuration (active/active) with 3 vlans

                                         We also did some tests with sysbench on
 different storage types:

 sysbench

 disk

 tps

 qps

 latency (ms) 95th percentile

 Local SSD

 261,28

 5.225,61

 5,18

 Ceph NVMe

 95,18

 1.903,53

 12,3

 Pure Storage

 196,49

 3.929,71

 6,32

 NetApp FAS

 189,83

 3.796,59

 6,67

 EMC VMAX

 196,14

 3.922,82

 6,32

                                         Is there any specific tuning that I can apply
 to the ceph cluster, in order to improve those numbers? Or are those
 numbers ok for the type and size of the cluster that we have? Any advice
 would be really appreciated.

                                         Thanks,

                                         German

                                         _______________________________________________
                                         ceph-users mailing list
                                         ceph-users@xxxxxxxxxxxxxx

http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
 <http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com>

                                 Hi,

                                 What is the value of --num-threads (def value is 1)
 ? Ceph will be better with more threads: 32 or 64.
                                 What is the value of --file-block-size (def 16k) and
 file-test-mode ? If you are using sequential seqwr/seqrd you will be
 hitting the same OSD, so maybe try random (rndrd/rndwr) or better use
 rbd stripe size of 16kb (default rbd stripe is 4M). rbd striping is
 ideal for small block sequential io pattern typical in databases.

                                 /Maged

                 _______________________________________________
                 ceph-users mailing list
                 ceph-users@xxxxxxxxxxxxxx
                 http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

<http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com>

_______________________________________________
 ceph-users mailing list
ceph-users@xxxxxxxxxxxxxx
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

_______________________________________________
ceph-users mailing list
ceph-users@xxxxxxxxxxxxxx
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com