Re: ceph all-nvme mysql performance tuning

"Marc Roos" <M.Roos@xxxxxxxxxxxxxxxxx> · Tue, 28 Nov 2017 20:12:05 +0100

I was wondering if there are any statistics available that show the 
performance increase of doing such things?

-----Original Message-----
From: German Anders [mailto:ganders@xxxxxxxxxxxx] 
Sent: dinsdag 28 november 2017 19:34
To: Luis Periquito
Cc: ceph-users
Subject: Re:  ceph all-nvme mysql performance tuning

Thanks a lot Luis, I agree with you regarding the CPUs, but 
unfortunately those were the best CPU model that we can afford :S

For the NUMA part, I manage to pinned the OSDs by changing the 
/usr/lib/systemd/system/ceph-osd@.service file and adding the 
CPUAffinity list to it. But, this is for ALL the OSDs to specific nodes 
or specific CPU list. But I can't find the way to specify a list for 
only a specific number of OSDs. 

Also, I notice that the NVMe disks are all on the same node (since I'm 
using half of the shelf - so the other half will be pinned to the other 
node), so the lanes of the NVMe disks are all on the same CPU (in this 
case 0). Also, I find that the IB adapter that is mapped to the OSD 
network (osd replication) is pinned to CPU 1, so this will cross the QPI 
path.

And for the memory, from the other email, we are already using the 
TCMALLOC_MAX_TOTAL_THREAD_CACHE_BYTES parameter with a value of 
134217728

In this case I can pinned all the actual OSDs to CPU 0, but in the near 
future when I add more nvme disks to the OSD nodes, I'll definitely need 
to pinned the other half OSDs to CPU 1, someone already did this?

Thanks a lot,

Best,

German

2017-11-28 6:36 GMT-03:00 Luis Periquito <periquito@xxxxxxxxx>:

	There are a few things I don't like about your machines... If you 
want latency/IOPS (as you seemingly do) you really want the highest 
frequency CPUs, even over number of cores. These are not too bad, but 
not great either.

	Also you have 2x CPU meaning NUMA. Have you pinned OSDs to NUMA 
nodes? Ideally OSD is pinned to same NUMA node the NVMe device is 
connected to. Each NVMe device will be running on PCIe lanes generated 
by one of the CPUs...

	What versions of TCMalloc (or jemalloc) are you running? Have you 
tuned them to have a bigger cache?

	These are from what I've learned using filestore - I've yet to run 
full tests on bluestore - but they should still apply...

	On Mon, Nov 27, 2017 at 5:10 PM, German Anders 
<ganders@xxxxxxxxxxxx> wrote:

		Hi Nick, 

		yeah, we are using the same nvme disk with an additional 
partition to use as journal/wal. We double check the c-state and it was 
not configure to use c1, so we change that on all the osd nodes and mon 
nodes and we're going to make some new tests, and see how it goes. I'll 
get back as soon as get got those tests running.

		Thanks a lot,

		Best,

		German

		2017-11-27 12:16 GMT-03:00 Nick Fisk <nick@xxxxxxxxxx>:

			From: ceph-users 
[mailto:ceph-users-bounces@xxxxxxxxxxxxxx 
<mailto:ceph-users-bounces@xxxxxxxxxxxxxx> ] On Behalf Of German Anders
			Sent: 27 November 2017 14:44
			To: Maged Mokhtar <mmokhtar@xxxxxxxxxxx>
			Cc: ceph-users <ceph-users@xxxxxxxxxxxxxx>
			Subject: Re:  ceph all-nvme mysql performance 
tuning

			Hi Maged,

			Thanks a lot for the response. We try with different 
number of threads and we're getting almost the same kind of difference 
between the storage types. Going to try with different rbd stripe size, 
object size values and see if we get more competitive numbers. Will get 
back with more tests and param changes to see if we get better :)

			Just to echo a couple of comments. Ceph will always 
struggle to match the performance of a traditional array for mainly 2 
reasons.

			1.	You are replacing some sort of dual ported SAS or 
internally RDMA connected device with a network for Ceph replication 
traffic. This will instantly have a large impact on write latency
			2.	Ceph locks at the PG level and a PG will most 
likely cover at least one 4MB object, so lots of small accesses to the 
same blocks (on a block device) will wait on each other and go 
effectively at a single threaded rate.

			The best thing you can do to mitigate these, is to run 
the fastest journal/WAL devices you can, fastest network connections (ie 
25Gb/s) and run your CPU’s at max C and P states.

			You stated that you are running the performance profile 
on the CPU’s. Could you also just double check that the C-states are 
being held at C1(e)? There are a few utilities that can show this in 
realtime.

			Other than that, although there could be some minor 
tweaks, you are probably nearing the limit of what you can hope to 
achieve.

			Nick

			Thanks,

			Best,

			German

			2017-11-27 11:36 GMT-03:00 Maged Mokhtar 
<mmokhtar@xxxxxxxxxxx>:

				On 2017-11-27 15:02, German Anders wrote:

					Hi All,

					I've a performance question, we recently 
install a brand new Ceph cluster with all-nvme disks, using ceph version 
12.2.0 with bluestore configured. The back-end of the cluster is using a 
bond IPoIB (active/passive) , and for the front-end we are using a 
bonding config with active/active (20GbE) to communicate with the 
clients.

					The cluster configuration is the following:

					MON Nodes:

					OS: Ubuntu 16.04.3 LTS | kernel 4.12.14 

					3x 1U servers:

					  2x Intel Xeon E5-2630v4 @2.2Ghz

					  128G RAM

					  2x Intel SSD DC S3520 150G (in RAID-1 for OS)

					  2x 82599ES 10-Gigabit SFI/SFP+ Network 
Connection

					OSD Nodes:

					OS: Ubuntu 16.04.3 LTS | kernel 4.12.14

					4x 2U servers:

					  2x Intel Xeon E5-2640v4 @2.4Ghz

					  128G RAM

					  2x Intel SSD DC S3520 150G (in RAID-1 for OS)

					  1x Ethernet Controller 10G X550T

					  1x 82599ES 10-Gigabit SFI/SFP+ Network 
Connection

					  12x Intel SSD DC P3520 1.2T (NVMe) for OSD 
daemons

					  1x Mellanox ConnectX-3 InfiniBand FDR 56Gb/s 
Adapter (dual port)

					Here's the tree:

					ID CLASS WEIGHT   TYPE NAME          STATUS 
REWEIGHT PRI-AFF

					-7       48.00000 root root

					-5       24.00000     rack rack1

					-1       12.00000         node cpn01

					 0  nvme  1.00000             osd.0      up  
1.00000 1.00000

					 1  nvme  1.00000             osd.1      up  
1.00000 1.00000

					 2  nvme  1.00000             osd.2      up  
1.00000 1.00000

					 3  nvme  1.00000             osd.3      up  
1.00000 1.00000

					 4  nvme  1.00000             osd.4      up  
1.00000 1.00000

					 5  nvme  1.00000             osd.5      up  
1.00000 1.00000

					 6  nvme  1.00000             osd.6      up  
1.00000 1.00000

					 7  nvme  1.00000             osd.7      up  
1.00000 1.00000

					 8  nvme  1.00000             osd.8      up  
1.00000 1.00000

					 9  nvme  1.00000             osd.9      up  
1.00000 1.00000

					10  nvme  1.00000             osd.10     up  
1.00000 1.00000

					11  nvme  1.00000             osd.11     up  
1.00000 1.00000

					-3       12.00000         node cpn03

					24  nvme  1.00000             osd.24     up  
1.00000 1.00000

					25  nvme  1.00000             osd.25     up  
1.00000 1.00000

					26  nvme  1.00000             osd.26     up  
1.00000 1.00000

					27  nvme  1.00000             osd.27     up  
1.00000 1.00000

					28  nvme  1.00000             osd.28     up  
1.00000 1.00000

					29  nvme  1.00000             osd.29     up  
1.00000 1.00000

					30  nvme  1.00000             osd.30     up  
1.00000 1.00000

					31  nvme  1.00000             osd.31     up  
1.00000 1.00000

					32  nvme  1.00000             osd.32     up  
1.00000 1.00000

					33  nvme  1.00000             osd.33     up  
1.00000 1.00000

					34  nvme  1.00000             osd.34     up  
1.00000 1.00000

					35  nvme  1.00000             osd.35     up  
1.00000 1.00000

					-6       24.00000     rack rack2

					-2       12.00000         node cpn02

					12  nvme  1.00000             osd.12     up  
1.00000 1.00000

					13  nvme  1.00000             osd.13     up  
1.00000 1.00000

					14  nvme  1.00000             osd.14     up  
1.00000 1.00000

					15  nvme  1.00000             osd.15     up  
1.00000 1.00000

					16  nvme  1.00000             osd.16     up  
1.00000 1.00000

					17  nvme  1.00000             osd.17     up  
1.00000 1.00000

					18  nvme  1.00000             osd.18     up  
1.00000 1.00000

					19  nvme  1.00000             osd.19     up  
1.00000 1.00000

					20  nvme  1.00000             osd.20     up  
1.00000 1.00000

					21  nvme  1.00000             osd.21     up  
1.00000 1.00000

					22  nvme  1.00000             osd.22     up  
1.00000 1.00000

					23  nvme  1.00000             osd.23     up  
1.00000 1.00000

					-4       12.00000         node cpn04

					36  nvme  1.00000             osd.36     up  
1.00000 1.00000

					37  nvme  1.00000             osd.37     up  
1.00000 1.00000

					38  nvme  1.00000             osd.38     up  
1.00000 1.00000

					39  nvme  1.00000             osd.39     up  
1.00000 1.00000

					40  nvme  1.00000             osd.40     up  
1.00000 1.00000

					41  nvme  1.00000             osd.41     up  
1.00000 1.00000

					42  nvme  1.00000             osd.42     up  
1.00000 1.00000

					43  nvme  1.00000             osd.43     up  
1.00000 1.00000

					44  nvme  1.00000             osd.44     up  
1.00000 1.00000

					45  nvme  1.00000             osd.45     up  
1.00000 1.00000

					46  nvme  1.00000             osd.46     up  
1.00000 1.00000

					47  nvme  1.00000             osd.47     up  
1.00000 1.00000

					The disk partition of one of the OSD nodes:

					NAME                   MAJ:MIN RM   SIZE RO 
TYPE  MOUNTPOINT

					nvme6n1                259:1    0   1.1T  0 
disk

					├─nvme6n1p2            259:15   0   1.1T  0 
part

					└─nvme6n1p1            259:13   0   100M  0 
part  /var/lib/ceph/osd/ceph-6

					nvme9n1                259:0    0   1.1T  0 
disk

					├─nvme9n1p2            259:8    0   1.1T  0 
part

					└─nvme9n1p1            259:7    0   100M  0 
part  /var/lib/ceph/osd/ceph-9

					sdb                      8:16   0 139.8G  0 
disk

					└─sdb1                   8:17   0 139.8G  0 
part

					  └─md0                  9:0    0 139.6G  0 
raid1

					    ├─md0p2            259:31   0     1K  0 
md

					    ├─md0p5            259:32   0 139.1G  0 
md

					    │ ├─cpn01--vg-swap 253:1    0  27.4G  0 
lvm   [SWAP]

					    │ └─cpn01--vg-root 253:0    0 111.8G  0 
lvm   /

					    └─md0p1            259:30   0 486.3M  0 
md    /boot

					nvme11n1               259:2    0   1.1T  0 
disk

					├─nvme11n1p1           259:12   0   100M  0 
part  /var/lib/ceph/osd/ceph-11

					└─nvme11n1p2           259:14   0   1.1T  0 
part

					nvme2n1                259:6    0   1.1T  0 
disk

					├─nvme2n1p2            259:21   0   1.1T  0 
part

					└─nvme2n1p1            259:20   0   100M  0 
part  /var/lib/ceph/osd/ceph-2

					nvme5n1                259:3    0   1.1T  0 
disk

					├─nvme5n1p1            259:9    0   100M  0 
part  /var/lib/ceph/osd/ceph-5

					└─nvme5n1p2            259:10   0   1.1T  0 
part

					nvme8n1                259:24   0   1.1T  0 
disk

					├─nvme8n1p1            259:26   0   100M  0 
part  /var/lib/ceph/osd/ceph-8

					└─nvme8n1p2            259:28   0   1.1T  0 
part

					nvme10n1               259:11   0   1.1T  0 
disk

					├─nvme10n1p1           259:22   0   100M  0 
part  /var/lib/ceph/osd/ceph-10

					└─nvme10n1p2           259:23   0   1.1T  0 
part

					nvme1n1                259:33   0   1.1T  0 
disk

					├─nvme1n1p1            259:34   0   100M  0 
part  /var/lib/ceph/osd/ceph-1

					└─nvme1n1p2            259:35   0   1.1T  0 
part

					nvme4n1                259:5    0   1.1T  0 
disk

					├─nvme4n1p1            259:18   0   100M  0 
part  /var/lib/ceph/osd/ceph-4

					└─nvme4n1p2            259:19   0   1.1T  0 
part

					nvme7n1                259:25   0   1.1T  0 
disk

					├─nvme7n1p1            259:27   0   100M  0 
part  /var/lib/ceph/osd/ceph-7

					└─nvme7n1p2            259:29   0   1.1T  0 
part

					sda                      8:0    0 139.8G  0 
disk

					└─sda1                   8:1    0 139.8G  0 
part

					  └─md0                  9:0    0 139.6G  0 
raid1

					    ├─md0p2            259:31   0     1K  0 
md

					    ├─md0p5            259:32   0 139.1G  0 
md

					    │ ├─cpn01--vg-swap 253:1    0  27.4G  0 
lvm   [SWAP]

					    │ └─cpn01--vg-root 253:0    0 111.8G  0 
lvm   /

					    └─md0p1            259:30   0 486.3M  0 
md    /boot

					nvme0n1                259:36   0   1.1T  0 
disk

					├─nvme0n1p1            259:37   0   100M  0 
part  /var/lib/ceph/osd/ceph-0

					└─nvme0n1p2            259:38   0   1.1T  0 
part

					nvme3n1                259:4    0   1.1T  0 
disk

					├─nvme3n1p1            259:16   0   100M  0 
part  /var/lib/ceph/osd/ceph-3

					└─nvme3n1p2            259:17   0   1.1T  0 
part

					For the disk scheduler we're using [kyber], for 
the read_ahead_kb we try different values (0,128 and 2048), the 
rq_affinity set to 2, and the rotational parameter set to 0.

					We've also set the CPU governor to performance 
on all the cores, and tune some sysctl parameters also:

					# for Ceph

					net.ipv4.ip_forward=0

					net.ipv4.conf.default.rp_filter=1

					kernel.sysrq=0

					kernel.core_uses_pid=1

					net.ipv4.tcp_syncookies=0

					#net.netfilter.nf_conntrack_max=2621440

#net.netfilter.nf_conntrack_tcp_timeout_established = 1800

					# disable netfilter on bridges

					#net.bridge.bridge-nf-call-ip6tables = 0

					#net.bridge.bridge-nf-call-iptables = 0

					#net.bridge.bridge-nf-call-arptables = 0

					vm.min_free_kbytes=1000000

					# Controls the maximum size of a message, in 
bytes

					kernel.msgmnb = 65536

					# Controls the default maxmimum size of a 
mesage queue

					kernel.msgmax = 65536

					# Controls the maximum shared segment size, in 
bytes

					kernel.shmmax = 68719476736

					# Controls the maximum number of shared memory 
segments, in pages

					kernel.shmall = 4294967296

					The ceph.conf file is:

					...

					osd_pool_default_size = 3

					osd_pool_default_min_size = 2

					osd_pool_default_pg_num = 1600

					osd_pool_default_pgp_num = 1600

					debug_crush = 1/1

					debug_buffer = 0/1

					debug_timer = 0/0

					debug_filer = 0/1

					debug_objecter = 0/1

					debug_rados = 0/5

					debug_rbd = 0/5

					debug_ms = 0/5

					debug_throttle = 1/1

					debug_journaler = 0/0

					debug_objectcatcher = 0/0

					debug_client = 0/0

					debug_osd = 0/0

					debug_optracker = 0/0

					debug_objclass = 0/0

					debug_journal = 0/0

					debug_filestore = 0/0

					debug_mon = 0/0

					debug_paxos = 0/0

					osd_crush_chooseleaf_type = 0

					filestore_xattr_use_omap = true

					rbd_cache = true

					mon_compact_on_trim = false

					[osd]

					osd_crush_update_on_start = false

					[client]

					rbd_cache = true

					rbd_cache_writethrough_until_flush = true

					rbd_default_features = 1

					admin_socket = 
/var/run/ceph/$cluster-$type.$id.$pid.$cctid.asok

					log_file = /var/log/ceph/

					The cluster has two production pools on for 
openstack (volumes) with RF of 3 and another pool for db (db) with RF of 
2. The DBA team has perform several tests with a volume mounted on the 
DB server (with RBD). The DB server has the following configuration:

					OS: CentOS 6.9 | kernel 4.14.1

					DB: MySQL

					ProLiant BL685c G7

					4x AMD Opteron Processor 6376 (total of 64 
cores)

					128G RAM

					1x OneConnect 10Gb NIC (quad-port) - in a bond 
configuration (active/active) with 3 vlans

					We also did some tests with sysbench on 
different storage types:

sysbench

disk

tps

qps

latency (ms) 95th percentile

Local SSD

261,28

5.225,61

5,18

Ceph NVMe

95,18

1.903,53

12,3

Pure Storage

196,49

3.929,71

6,32

NetApp FAS

189,83

3.796,59

6,67

EMC VMAX

196,14

3.922,82

6,32

					Is there any specific tuning that I can apply 
to the ceph cluster, in order to improve those numbers? Or are those 
numbers ok for the type and size of the cluster that we have? Any advice 
would be really appreciated.

					Thanks,

					German

					_______________________________________________
					ceph-users mailing list
					ceph-users@xxxxxxxxxxxxxx

http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com 
<http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com> 

				Hi,

				What is the value of --num-threads (def value is 1) 
? Ceph will be better with more threads: 32 or 64.
				What is the value of --file-block-size (def 16k) and 
file-test-mode ? If you are using sequential seqwr/seqrd you will be 
hitting the same OSD, so maybe try random (rndrd/rndwr) or better use 
rbd stripe size of 16kb (default rbd stripe is 4M). rbd striping is 
ideal for small block sequential io pattern typical in databases.

				/Maged

		_______________________________________________
		ceph-users mailing list
		ceph-users@xxxxxxxxxxxxxx
		http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com 
<http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com> 

_______________________________________________
ceph-users mailing list
ceph-users@xxxxxxxxxxxxxx
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com