I was wondering if there are any statistics available that show the performance increase of doing such things? -----Original Message----- From: German Anders [mailto:ganders@xxxxxxxxxxxx] Sent: dinsdag 28 november 2017 19:34 To: Luis Periquito Cc: ceph-users Subject: Re: ceph all-nvme mysql performance tuning Thanks a lot Luis, I agree with you regarding the CPUs, but unfortunately those were the best CPU model that we can afford :S For the NUMA part, I manage to pinned the OSDs by changing the /usr/lib/systemd/system/ceph-osd@.service file and adding the CPUAffinity list to it. But, this is for ALL the OSDs to specific nodes or specific CPU list. But I can't find the way to specify a list for only a specific number of OSDs. Also, I notice that the NVMe disks are all on the same node (since I'm using half of the shelf - so the other half will be pinned to the other node), so the lanes of the NVMe disks are all on the same CPU (in this case 0). Also, I find that the IB adapter that is mapped to the OSD network (osd replication) is pinned to CPU 1, so this will cross the QPI path. And for the memory, from the other email, we are already using the TCMALLOC_MAX_TOTAL_THREAD_CACHE_BYTES parameter with a value of 134217728 In this case I can pinned all the actual OSDs to CPU 0, but in the near future when I add more nvme disks to the OSD nodes, I'll definitely need to pinned the other half OSDs to CPU 1, someone already did this? Thanks a lot, Best, German 2017-11-28 6:36 GMT-03:00 Luis Periquito <periquito@xxxxxxxxx>: There are a few things I don't like about your machines... If you want latency/IOPS (as you seemingly do) you really want the highest frequency CPUs, even over number of cores. These are not too bad, but not great either. Also you have 2x CPU meaning NUMA. Have you pinned OSDs to NUMA nodes? Ideally OSD is pinned to same NUMA node the NVMe device is connected to. Each NVMe device will be running on PCIe lanes generated by one of the CPUs... What versions of TCMalloc (or jemalloc) are you running? Have you tuned them to have a bigger cache? These are from what I've learned using filestore - I've yet to run full tests on bluestore - but they should still apply... On Mon, Nov 27, 2017 at 5:10 PM, German Anders <ganders@xxxxxxxxxxxx> wrote: Hi Nick, yeah, we are using the same nvme disk with an additional partition to use as journal/wal. We double check the c-state and it was not configure to use c1, so we change that on all the osd nodes and mon nodes and we're going to make some new tests, and see how it goes. I'll get back as soon as get got those tests running. Thanks a lot, Best, German 2017-11-27 12:16 GMT-03:00 Nick Fisk <nick@xxxxxxxxxx>: From: ceph-users [mailto:ceph-users-bounces@xxxxxxxxxxxxxx <mailto:ceph-users-bounces@xxxxxxxxxxxxxx> ] On Behalf Of German Anders Sent: 27 November 2017 14:44 To: Maged Mokhtar <mmokhtar@xxxxxxxxxxx> Cc: ceph-users <ceph-users@xxxxxxxxxxxxxx> Subject: Re: ceph all-nvme mysql performance tuning Hi Maged, Thanks a lot for the response. We try with different number of threads and we're getting almost the same kind of difference between the storage types. Going to try with different rbd stripe size, object size values and see if we get more competitive numbers. Will get back with more tests and param changes to see if we get better :) Just to echo a couple of comments. Ceph will always struggle to match the performance of a traditional array for mainly 2 reasons. 1. You are replacing some sort of dual ported SAS or internally RDMA connected device with a network for Ceph replication traffic. This will instantly have a large impact on write latency 2. Ceph locks at the PG level and a PG will most likely cover at least one 4MB object, so lots of small accesses to the same blocks (on a block device) will wait on each other and go effectively at a single threaded rate. The best thing you can do to mitigate these, is to run the fastest journal/WAL devices you can, fastest network connections (ie 25Gb/s) and run your CPU’s at max C and P states. You stated that you are running the performance profile on the CPU’s. Could you also just double check that the C-states are being held at C1(e)? There are a few utilities that can show this in realtime. Other than that, although there could be some minor tweaks, you are probably nearing the limit of what you can hope to achieve. Nick Thanks, Best, German 2017-11-27 11:36 GMT-03:00 Maged Mokhtar <mmokhtar@xxxxxxxxxxx>: On 2017-11-27 15:02, German Anders wrote: Hi All, I've a performance question, we recently install a brand new Ceph cluster with all-nvme disks, using ceph version 12.2.0 with bluestore configured. The back-end of the cluster is using a bond IPoIB (active/passive) , and for the front-end we are using a bonding config with active/active (20GbE) to communicate with the clients. The cluster configuration is the following: MON Nodes: OS: Ubuntu 16.04.3 LTS | kernel 4.12.14 3x 1U servers: 2x Intel Xeon E5-2630v4 @2.2Ghz 128G RAM 2x Intel SSD DC S3520 150G (in RAID-1 for OS) 2x 82599ES 10-Gigabit SFI/SFP+ Network Connection OSD Nodes: OS: Ubuntu 16.04.3 LTS | kernel 4.12.14 4x 2U servers: 2x Intel Xeon E5-2640v4 @2.4Ghz 128G RAM 2x Intel SSD DC S3520 150G (in RAID-1 for OS) 1x Ethernet Controller 10G X550T 1x 82599ES 10-Gigabit SFI/SFP+ Network Connection 12x Intel SSD DC P3520 1.2T (NVMe) for OSD daemons 1x Mellanox ConnectX-3 InfiniBand FDR 56Gb/s Adapter (dual port) Here's the tree: ID CLASS WEIGHT TYPE NAME STATUS REWEIGHT PRI-AFF -7 48.00000 root root -5 24.00000 rack rack1 -1 12.00000 node cpn01 0 nvme 1.00000 osd.0 up 1.00000 1.00000 1 nvme 1.00000 osd.1 up 1.00000 1.00000 2 nvme 1.00000 osd.2 up 1.00000 1.00000 3 nvme 1.00000 osd.3 up 1.00000 1.00000 4 nvme 1.00000 osd.4 up 1.00000 1.00000 5 nvme 1.00000 osd.5 up 1.00000 1.00000 6 nvme 1.00000 osd.6 up 1.00000 1.00000 7 nvme 1.00000 osd.7 up 1.00000 1.00000 8 nvme 1.00000 osd.8 up 1.00000 1.00000 9 nvme 1.00000 osd.9 up 1.00000 1.00000 10 nvme 1.00000 osd.10 up 1.00000 1.00000 11 nvme 1.00000 osd.11 up 1.00000 1.00000 -3 12.00000 node cpn03 24 nvme 1.00000 osd.24 up 1.00000 1.00000 25 nvme 1.00000 osd.25 up 1.00000 1.00000 26 nvme 1.00000 osd.26 up 1.00000 1.00000 27 nvme 1.00000 osd.27 up 1.00000 1.00000 28 nvme 1.00000 osd.28 up 1.00000 1.00000 29 nvme 1.00000 osd.29 up 1.00000 1.00000 30 nvme 1.00000 osd.30 up 1.00000 1.00000 31 nvme 1.00000 osd.31 up 1.00000 1.00000 32 nvme 1.00000 osd.32 up 1.00000 1.00000 33 nvme 1.00000 osd.33 up 1.00000 1.00000 34 nvme 1.00000 osd.34 up 1.00000 1.00000 35 nvme 1.00000 osd.35 up 1.00000 1.00000 -6 24.00000 rack rack2 -2 12.00000 node cpn02 12 nvme 1.00000 osd.12 up 1.00000 1.00000 13 nvme 1.00000 osd.13 up 1.00000 1.00000 14 nvme 1.00000 osd.14 up 1.00000 1.00000 15 nvme 1.00000 osd.15 up 1.00000 1.00000 16 nvme 1.00000 osd.16 up 1.00000 1.00000 17 nvme 1.00000 osd.17 up 1.00000 1.00000 18 nvme 1.00000 osd.18 up 1.00000 1.00000 19 nvme 1.00000 osd.19 up 1.00000 1.00000 20 nvme 1.00000 osd.20 up 1.00000 1.00000 21 nvme 1.00000 osd.21 up 1.00000 1.00000 22 nvme 1.00000 osd.22 up 1.00000 1.00000 23 nvme 1.00000 osd.23 up 1.00000 1.00000 -4 12.00000 node cpn04 36 nvme 1.00000 osd.36 up 1.00000 1.00000 37 nvme 1.00000 osd.37 up 1.00000 1.00000 38 nvme 1.00000 osd.38 up 1.00000 1.00000 39 nvme 1.00000 osd.39 up 1.00000 1.00000 40 nvme 1.00000 osd.40 up 1.00000 1.00000 41 nvme 1.00000 osd.41 up 1.00000 1.00000 42 nvme 1.00000 osd.42 up 1.00000 1.00000 43 nvme 1.00000 osd.43 up 1.00000 1.00000 44 nvme 1.00000 osd.44 up 1.00000 1.00000 45 nvme 1.00000 osd.45 up 1.00000 1.00000 46 nvme 1.00000 osd.46 up 1.00000 1.00000 47 nvme 1.00000 osd.47 up 1.00000 1.00000 The disk partition of one of the OSD nodes: NAME MAJ:MIN RM SIZE RO TYPE MOUNTPOINT nvme6n1 259:1 0 1.1T 0 disk ├─nvme6n1p2 259:15 0 1.1T 0 part └─nvme6n1p1 259:13 0 100M 0 part /var/lib/ceph/osd/ceph-6 nvme9n1 259:0 0 1.1T 0 disk ├─nvme9n1p2 259:8 0 1.1T 0 part └─nvme9n1p1 259:7 0 100M 0 part /var/lib/ceph/osd/ceph-9 sdb 8:16 0 139.8G 0 disk └─sdb1 8:17 0 139.8G 0 part └─md0 9:0 0 139.6G 0 raid1 ├─md0p2 259:31 0 1K 0 md ├─md0p5 259:32 0 139.1G 0 md │ ├─cpn01--vg-swap 253:1 0 27.4G 0 lvm [SWAP] │ └─cpn01--vg-root 253:0 0 111.8G 0 lvm / └─md0p1 259:30 0 486.3M 0 md /boot nvme11n1 259:2 0 1.1T 0 disk ├─nvme11n1p1 259:12 0 100M 0 part /var/lib/ceph/osd/ceph-11 └─nvme11n1p2 259:14 0 1.1T 0 part nvme2n1 259:6 0 1.1T 0 disk ├─nvme2n1p2 259:21 0 1.1T 0 part └─nvme2n1p1 259:20 0 100M 0 part /var/lib/ceph/osd/ceph-2 nvme5n1 259:3 0 1.1T 0 disk ├─nvme5n1p1 259:9 0 100M 0 part /var/lib/ceph/osd/ceph-5 └─nvme5n1p2 259:10 0 1.1T 0 part nvme8n1 259:24 0 1.1T 0 disk ├─nvme8n1p1 259:26 0 100M 0 part /var/lib/ceph/osd/ceph-8 └─nvme8n1p2 259:28 0 1.1T 0 part nvme10n1 259:11 0 1.1T 0 disk ├─nvme10n1p1 259:22 0 100M 0 part /var/lib/ceph/osd/ceph-10 └─nvme10n1p2 259:23 0 1.1T 0 part nvme1n1 259:33 0 1.1T 0 disk ├─nvme1n1p1 259:34 0 100M 0 part /var/lib/ceph/osd/ceph-1 └─nvme1n1p2 259:35 0 1.1T 0 part nvme4n1 259:5 0 1.1T 0 disk ├─nvme4n1p1 259:18 0 100M 0 part /var/lib/ceph/osd/ceph-4 └─nvme4n1p2 259:19 0 1.1T 0 part nvme7n1 259:25 0 1.1T 0 disk ├─nvme7n1p1 259:27 0 100M 0 part /var/lib/ceph/osd/ceph-7 └─nvme7n1p2 259:29 0 1.1T 0 part sda 8:0 0 139.8G 0 disk └─sda1 8:1 0 139.8G 0 part └─md0 9:0 0 139.6G 0 raid1 ├─md0p2 259:31 0 1K 0 md ├─md0p5 259:32 0 139.1G 0 md │ ├─cpn01--vg-swap 253:1 0 27.4G 0 lvm [SWAP] │ └─cpn01--vg-root 253:0 0 111.8G 0 lvm / └─md0p1 259:30 0 486.3M 0 md /boot nvme0n1 259:36 0 1.1T 0 disk ├─nvme0n1p1 259:37 0 100M 0 part /var/lib/ceph/osd/ceph-0 └─nvme0n1p2 259:38 0 1.1T 0 part nvme3n1 259:4 0 1.1T 0 disk ├─nvme3n1p1 259:16 0 100M 0 part /var/lib/ceph/osd/ceph-3 └─nvme3n1p2 259:17 0 1.1T 0 part For the disk scheduler we're using [kyber], for the read_ahead_kb we try different values (0,128 and 2048), the rq_affinity set to 2, and the rotational parameter set to 0. We've also set the CPU governor to performance on all the cores, and tune some sysctl parameters also: # for Ceph net.ipv4.ip_forward=0 net.ipv4.conf.default.rp_filter=1 kernel.sysrq=0 kernel.core_uses_pid=1 net.ipv4.tcp_syncookies=0 #net.netfilter.nf_conntrack_max=2621440 #net.netfilter.nf_conntrack_tcp_timeout_established = 1800 # disable netfilter on bridges #net.bridge.bridge-nf-call-ip6tables = 0 #net.bridge.bridge-nf-call-iptables = 0 #net.bridge.bridge-nf-call-arptables = 0 vm.min_free_kbytes=1000000 # Controls the maximum size of a message, in bytes kernel.msgmnb = 65536 # Controls the default maxmimum size of a mesage queue kernel.msgmax = 65536 # Controls the maximum shared segment size, in bytes kernel.shmmax = 68719476736 # Controls the maximum number of shared memory segments, in pages kernel.shmall = 4294967296 The ceph.conf file is: ... osd_pool_default_size = 3 osd_pool_default_min_size = 2 osd_pool_default_pg_num = 1600 osd_pool_default_pgp_num = 1600 debug_crush = 1/1 debug_buffer = 0/1 debug_timer = 0/0 debug_filer = 0/1 debug_objecter = 0/1 debug_rados = 0/5 debug_rbd = 0/5 debug_ms = 0/5 debug_throttle = 1/1 debug_journaler = 0/0 debug_objectcatcher = 0/0 debug_client = 0/0 debug_osd = 0/0 debug_optracker = 0/0 debug_objclass = 0/0 debug_journal = 0/0 debug_filestore = 0/0 debug_mon = 0/0 debug_paxos = 0/0 osd_crush_chooseleaf_type = 0 filestore_xattr_use_omap = true rbd_cache = true mon_compact_on_trim = false [osd] osd_crush_update_on_start = false [client] rbd_cache = true rbd_cache_writethrough_until_flush = true rbd_default_features = 1 admin_socket = /var/run/ceph/$cluster-$type.$id.$pid.$cctid.asok log_file = /var/log/ceph/ The cluster has two production pools on for openstack (volumes) with RF of 3 and another pool for db (db) with RF of 2. The DBA team has perform several tests with a volume mounted on the DB server (with RBD). The DB server has the following configuration: OS: CentOS 6.9 | kernel 4.14.1 DB: MySQL ProLiant BL685c G7 4x AMD Opteron Processor 6376 (total of 64 cores) 128G RAM 1x OneConnect 10Gb NIC (quad-port) - in a bond configuration (active/active) with 3 vlans We also did some tests with sysbench on different storage types: sysbench disk tps qps latency (ms) 95th percentile Local SSD 261,28 5.225,61 5,18 Ceph NVMe 95,18 1.903,53 12,3 Pure Storage 196,49 3.929,71 6,32 NetApp FAS 189,83 3.796,59 6,67 EMC VMAX 196,14 3.922,82 6,32 Is there any specific tuning that I can apply to the ceph cluster, in order to improve those numbers? Or are those numbers ok for the type and size of the cluster that we have? Any advice would be really appreciated. Thanks, German _______________________________________________ ceph-users mailing list ceph-users@xxxxxxxxxxxxxx http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com <http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com> Hi, What is the value of --num-threads (def value is 1) ? Ceph will be better with more threads: 32 or 64. What is the value of --file-block-size (def 16k) and file-test-mode ? If you are using sequential seqwr/seqrd you will be hitting the same OSD, so maybe try random (rndrd/rndwr) or better use rbd stripe size of 16kb (default rbd stripe is 4M). rbd striping is ideal for small block sequential io pattern typical in databases. /Maged _______________________________________________ ceph-users mailing list ceph-users@xxxxxxxxxxxxxx http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com <http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com> _______________________________________________ ceph-users mailing list ceph-users@xxxxxxxxxxxxxx http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com