Re: ceph all-nvme mysql performance tuning

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 



There are a few things I don't like about your machines... If you want latency/IOPS (as you seemingly do) you really want the highest frequency CPUs, even over number of cores. These are not too bad, but not great either.

Also you have 2x CPU meaning NUMA. Have you pinned OSDs to NUMA nodes? Ideally OSD is pinned to same NUMA node the NVMe device is connected to. Each NVMe device will be running on PCIe lanes generated by one of the CPUs...

What versions of TCMalloc (or jemalloc) are you running? Have you tuned them to have a bigger cache?

These are from what I've learned using filestore - I've yet to run full tests on bluestore - but they should still apply...

On Mon, Nov 27, 2017 at 5:10 PM, German Anders <ganders@xxxxxxxxxxxx> wrote:
Hi Nick, 

yeah, we are using the same nvme disk with an additional partition to use as journal/wal. We double check the c-state and it was not configure to use c1, so we change that on all the osd nodes and mon nodes and we're going to make some new tests, and see how it goes. I'll get back as soon as get got those tests running.

Thanks a lot,

Best,


German

2017-11-27 12:16 GMT-03:00 Nick Fisk <nick@xxxxxxxxxx>:

From: ceph-users [mailto:ceph-users-bounces@lists.ceph.com] On Behalf Of German Anders
Sent: 27 November 2017 14:44
To: Maged Mokhtar <mmokhtar@xxxxxxxxxxx>
Cc: ceph-users <ceph-users@xxxxxxxxxxxxxx>
Subject: Re: ceph all-nvme mysql performance tuning

 

Hi Maged,

 

Thanks a lot for the response. We try with different number of threads and we're getting almost the same kind of difference between the storage types. Going to try with different rbd stripe size, object size values and see if we get more competitive numbers. Will get back with more tests and param changes to see if we get better :)

 

 

Just to echo a couple of comments. Ceph will always struggle to match the performance of a traditional array for mainly 2 reasons.

 

  1. You are replacing some sort of dual ported SAS or internally RDMA connected device with a network for Ceph replication traffic. This will instantly have a large impact on write latency
  2. Ceph locks at the PG level and a PG will most likely cover at least one 4MB object, so lots of small accesses to the same blocks (on a block device) will wait on each other and go effectively at a single threaded rate.

 

The best thing you can do to mitigate these, is to run the fastest journal/WAL devices you can, fastest network connections (ie 25Gb/s) and run your CPU’s at max C and P states.

 

You stated that you are running the performance profile on the CPU’s. Could you also just double check that the C-states are being held at C1(e)? There are a few utilities that can show this in realtime.

 

Other than that, although there could be some minor tweaks, you are probably nearing the limit of what you can hope to achieve.

 

Nick

 

 

Thanks,

 

Best,


German

 

2017-11-27 11:36 GMT-03:00 Maged Mokhtar <mmokhtar@xxxxxxxxxxx>:

On 2017-11-27 15:02, German Anders wrote:

Hi All,

 

I've a performance question, we recently install a brand new Ceph cluster with all-nvme disks, using ceph version 12.2.0 with bluestore configured. The back-end of the cluster is using a bond IPoIB (active/passive) , and for the front-end we are using a bonding config with active/active (20GbE) to communicate with the clients.

 

The cluster configuration is the following:

 

MON Nodes:

OS: Ubuntu 16.04.3 LTS | kernel 4.12.14 

3x 1U servers:

  2x Intel Xeon E5-2630v4 @2.2Ghz

  128G RAM

  2x Intel SSD DC S3520 150G (in RAID-1 for OS)

  2x 82599ES 10-Gigabit SFI/SFP+ Network Connection

 

OSD Nodes:

OS: Ubuntu 16.04.3 LTS | kernel 4.12.14

4x 2U servers:

  2x Intel Xeon E5-2640v4 @2.4Ghz

  128G RAM

  2x Intel SSD DC S3520 150G (in RAID-1 for OS)

  1x Ethernet Controller 10G X550T

  1x 82599ES 10-Gigabit SFI/SFP+ Network Connection

  12x Intel SSD DC P3520 1.2T (NVMe) for OSD daemons

  1x Mellanox ConnectX-3 InfiniBand FDR 56Gb/s Adapter (dual port)

 

 

Here's the tree:

 

ID CLASS WEIGHT   TYPE NAME          STATUS REWEIGHT PRI-AFF

-7       48.00000 root root

-5       24.00000     rack rack1

-1       12.00000         node cpn01

 0  nvme  1.00000             osd.0      up  1.00000 1.00000

 1  nvme  1.00000             osd.1      up  1.00000 1.00000

 2  nvme  1.00000             osd.2      up  1.00000 1.00000

 3  nvme  1.00000             osd.3      up  1.00000 1.00000

 4  nvme  1.00000             osd.4      up  1.00000 1.00000

 5  nvme  1.00000             osd.5      up  1.00000 1.00000

 6  nvme  1.00000             osd.6      up  1.00000 1.00000

 7  nvme  1.00000             osd.7      up  1.00000 1.00000

 8  nvme  1.00000             osd.8      up  1.00000 1.00000

 9  nvme  1.00000             osd.9      up  1.00000 1.00000

10  nvme  1.00000             osd.10     up  1.00000 1.00000

11  nvme  1.00000             osd.11     up  1.00000 1.00000

-3       12.00000         node cpn03

24  nvme  1.00000             osd.24     up  1.00000 1.00000

25  nvme  1.00000             osd.25     up  1.00000 1.00000

26  nvme  1.00000             osd.26     up  1.00000 1.00000

27  nvme  1.00000             osd.27     up  1.00000 1.00000

28  nvme  1.00000             osd.28     up  1.00000 1.00000

29  nvme  1.00000             osd.29     up  1.00000 1.00000

30  nvme  1.00000             osd.30     up  1.00000 1.00000

31  nvme  1.00000             osd.31     up  1.00000 1.00000

32  nvme  1.00000             osd.32     up  1.00000 1.00000

33  nvme  1.00000             osd.33     up  1.00000 1.00000

34  nvme  1.00000             osd.34     up  1.00000 1.00000

35  nvme  1.00000             osd.35     up  1.00000 1.00000

-6       24.00000     rack rack2

-2       12.00000         node cpn02

12  nvme  1.00000             osd.12     up  1.00000 1.00000

13  nvme  1.00000             osd.13     up  1.00000 1.00000

14  nvme  1.00000             osd.14     up  1.00000 1.00000

15  nvme  1.00000             osd.15     up  1.00000 1.00000

16  nvme  1.00000             osd.16     up  1.00000 1.00000

17  nvme  1.00000             osd.17     up  1.00000 1.00000

18  nvme  1.00000             osd.18     up  1.00000 1.00000

19  nvme  1.00000             osd.19     up  1.00000 1.00000

20  nvme  1.00000             osd.20     up  1.00000 1.00000

21  nvme  1.00000             osd.21     up  1.00000 1.00000

22  nvme  1.00000             osd.22     up  1.00000 1.00000

23  nvme  1.00000             osd.23     up  1.00000 1.00000

-4       12.00000         node cpn04

36  nvme  1.00000             osd.36     up  1.00000 1.00000

37  nvme  1.00000             osd.37     up  1.00000 1.00000

38  nvme  1.00000             osd.38     up  1.00000 1.00000

39  nvme  1.00000             osd.39     up  1.00000 1.00000

40  nvme  1.00000             osd.40     up  1.00000 1.00000

41  nvme  1.00000             osd.41     up  1.00000 1.00000

42  nvme  1.00000             osd.42     up  1.00000 1.00000

43  nvme  1.00000             osd.43     up  1.00000 1.00000

44  nvme  1.00000             osd.44     up  1.00000 1.00000

45  nvme  1.00000             osd.45     up  1.00000 1.00000

46  nvme  1.00000             osd.46     up  1.00000 1.00000

47  nvme  1.00000             osd.47     up  1.00000 1.00000

 

The disk partition of one of the OSD nodes:

 

NAME                   MAJ:MIN RM   SIZE RO TYPE  MOUNTPOINT

nvme6n1                259:1    0   1.1T  0 disk

─nvme6n1p2            259:15   0   1.1T  0 part

─nvme6n1p1            259:13   0   100M  0 part  /var/lib/ceph/osd/ceph-6

nvme9n1                259:0    0   1.1T  0 disk

─nvme9n1p2            259:8    0   1.1T  0 part

─nvme9n1p1            259:7    0   100M  0 part  /var/lib/ceph/osd/ceph-9

sdb                      8:16   0 139.8G  0 disk

─sdb1                   8:17   0 139.8G  0 part

  ─md0                  9:0    0 139.6G  0 raid1

    ─md0p2            259:31   0     1K  0 md

    ─md0p5            259:32   0 139.1G  0 md

    │ ─cpn01--vg-swap 253:1    0  27.4G  0 lvm   [SWAP]

    │ ─cpn01--vg-root 253:0    0 111.8G  0 lvm   /

    ─md0p1            259:30   0 486.3M  0 md    /boot

nvme11n1               259:2    0   1.1T  0 disk

─nvme11n1p1           259:12   0   100M  0 part  /var/lib/ceph/osd/ceph-11

─nvme11n1p2           259:14   0   1.1T  0 part

nvme2n1                259:6    0   1.1T  0 disk

─nvme2n1p2            259:21   0   1.1T  0 part

─nvme2n1p1            259:20   0   100M  0 part  /var/lib/ceph/osd/ceph-2

nvme5n1                259:3    0   1.1T  0 disk

─nvme5n1p1            259:9    0   100M  0 part  /var/lib/ceph/osd/ceph-5

─nvme5n1p2            259:10   0   1.1T  0 part

nvme8n1                259:24   0   1.1T  0 disk

─nvme8n1p1            259:26   0   100M  0 part  /var/lib/ceph/osd/ceph-8

─nvme8n1p2            259:28   0   1.1T  0 part

nvme10n1               259:11   0   1.1T  0 disk

─nvme10n1p1           259:22   0   100M  0 part  /var/lib/ceph/osd/ceph-10

─nvme10n1p2           259:23   0   1.1T  0 part

nvme1n1                259:33   0   1.1T  0 disk

─nvme1n1p1            259:34   0   100M  0 part  /var/lib/ceph/osd/ceph-1

─nvme1n1p2            259:35   0   1.1T  0 part

nvme4n1                259:5    0   1.1T  0 disk

─nvme4n1p1            259:18   0   100M  0 part  /var/lib/ceph/osd/ceph-4

─nvme4n1p2            259:19   0   1.1T  0 part

nvme7n1                259:25   0   1.1T  0 disk

─nvme7n1p1            259:27   0   100M  0 part  /var/lib/ceph/osd/ceph-7

─nvme7n1p2            259:29   0   1.1T  0 part

sda                      8:0    0 139.8G  0 disk

─sda1                   8:1    0 139.8G  0 part

  ─md0                  9:0    0 139.6G  0 raid1

    ─md0p2            259:31   0     1K  0 md

    ─md0p5            259:32   0 139.1G  0 md

    │ ─cpn01--vg-swap 253:1    0  27.4G  0 lvm   [SWAP]

    │ ─cpn01--vg-root 253:0    0 111.8G  0 lvm   /

    ─md0p1            259:30   0 486.3M  0 md    /boot

nvme0n1                259:36   0   1.1T  0 disk

─nvme0n1p1            259:37   0   100M  0 part  /var/lib/ceph/osd/ceph-0

─nvme0n1p2            259:38   0   1.1T  0 part

nvme3n1                259:4    0   1.1T  0 disk

─nvme3n1p1            259:16   0   100M  0 part  /var/lib/ceph/osd/ceph-3

─nvme3n1p2            259:17   0   1.1T  0 part

 

 

For the disk scheduler we're using [kyber], for the read_ahead_kb we try different values (0,128 and 2048), the rq_affinity set to 2, and the rotational parameter set to 0.

We've also set the CPU governor to performance on all the cores, and tune some sysctl parameters also:

 

# for Ceph

net.ipv4.ip_forward=0

net.ipv4.conf.default.rp_filter=1

kernel.sysrq=0

kernel.core_uses_pid=1

net.ipv4.tcp_syncookies=0

#net.netfilter.nf_conntrack_max=2621440

#net.netfilter.nf_conntrack_tcp_timeout_established = 1800

# disable netfilter on bridges

#net.bridge.bridge-nf-call-ip6tables = 0

#net.bridge.bridge-nf-call-iptables = 0

#net.bridge.bridge-nf-call-arptables = 0

vm.min_free_kbytes=1000000

 

# Controls the maximum size of a message, in bytes

kernel.msgmnb = 65536

 

# Controls the default maxmimum size of a mesage queue

kernel.msgmax = 65536

 

# Controls the maximum shared segment size, in bytes

kernel.shmmax = 68719476736

 

# Controls the maximum number of shared memory segments, in pages

kernel.shmall = 4294967296

 

 

The ceph.conf file is:

 

...

osd_pool_default_size = 3

osd_pool_default_min_size = 2

osd_pool_default_pg_num = 1600

osd_pool_default_pgp_num = 1600

 

debug_crush = 1/1

debug_buffer = 0/1

debug_timer = 0/0

debug_filer = 0/1

debug_objecter = 0/1

debug_rados = 0/5

debug_rbd = 0/5

debug_ms = 0/5

debug_throttle = 1/1

 

debug_journaler = 0/0

debug_objectcatcher = 0/0

debug_client = 0/0

debug_osd = 0/0

debug_optracker = 0/0

debug_objclass = 0/0

debug_journal = 0/0

debug_filestore = 0/0

debug_mon = 0/0

debug_paxos = 0/0

 

osd_crush_chooseleaf_type = 0

filestore_xattr_use_omap = true

 

rbd_cache = true

mon_compact_on_trim = false

 

[osd]

osd_crush_update_on_start = false

 

[client]

rbd_cache = true

rbd_cache_writethrough_until_flush = true

rbd_default_features = 1

admin_socket = /var/run/ceph/$cluster-$type.$id.$pid.$cctid.asok

log_file = /var/log/ceph/

 

 

The cluster has two production pools on for openstack (volumes) with RF of 3 and another pool for db (db) with RF of 2. The DBA team has perform several tests with a volume mounted on the DB server (with RBD). The DB server has the following configuration:

 

OS: CentOS 6.9 | kernel 4.14.1

DB: MySQL

ProLiant BL685c G7

4x AMD Opteron Processor 6376 (total of 64 cores)

128G RAM

1x OneConnect 10Gb NIC (quad-port) - in a bond configuration (active/active) with 3 vlans

 

 

 

We also did some tests with sysbench on different storage types:

 

sysbench

disk

tps

qps

latency (ms) 95th percentile

Local SSD

261,28

5.225,61

5,18

Ceph NVMe

95,18

1.903,53

12,3

Pure Storage

196,49

3.929,71

6,32

NetApp FAS

189,83

3.796,59

6,67

EMC VMAX

196,14

3.922,82

6,32

 

 

Is there any specific tuning that I can apply to the ceph cluster, in order to improve those numbers? Or are those numbers ok for the type and size of the cluster that we have? Any advice would be really appreciated.

 

Thanks,

 

 

 

German

 

 

_______________________________________________
ceph-users mailing list
ceph-users@xxxxxxxxxxxxxx
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Hi,

What is the value of --num-threads (def value is 1) ? Ceph will be better with more threads: 32 or 64.
What is the value of --file-block-size (def 16k) and file-test-mode ? If you are using sequential seqwr/seqrd you will be hitting the same OSD, so maybe try random (rndrd/rndwr) or better use rbd stripe size of 16kb (default rbd stripe is 4M). rbd striping is ideal for small block sequential io pattern typical in databases.

/Maged

 



_______________________________________________
ceph-users mailing list
ceph-users@xxxxxxxxxxxxxx
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


_______________________________________________
ceph-users mailing list
ceph-users@xxxxxxxxxxxxxx
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

[Index of Archives]     [Information on CEPH]     [Linux Filesystem Development]     [Ceph Development]     [Ceph Large]     [Ceph Dev]     [Linux USB Development]     [Video for Linux]     [Linux Audio Users]     [Yosemite News]     [Linux Kernel]     [Linux SCSI]     [xfs]


  Powered by Linux