Hi Wido, thanks a lot for the quick response, regarding the questions:
Have you tried to attach multiple RBD volumes:
- Root for OS (the root partition has local SSDs)
- MySQL data dir (the idea is to have all the storage tests with the same scheme, the first test is using one volume and put the data dir, innodb and bin log)
- MySQL InnoDB Logfile
- MySQL Binary Logging
So 4 disks in total where you spread out the I/O over. (the following tests are going to be spread into 3 disks, and we'll make a new compare between the arrays)
- Root for OS (the root partition has local SSDs)
- MySQL data dir (the idea is to have all the storage tests with the same scheme, the first test is using one volume and put the data dir, innodb and bin log)
- MySQL InnoDB Logfile
- MySQL Binary Logging
So 4 disks in total where you spread out the I/O over. (the following tests are going to be spread into 3 disks, and we'll make a new compare between the arrays)
Regarding the version of librbd it's not a type we use this server also with an old ceph cluster. we are going to upgrade the version and see if tests get better.
Thanks
German
2017-11-27 10:16 GMT-03:00 Wido den Hollander <wido@xxxxxxxx>:
> Op 27 november 2017 om 14:02 schreef German Anders <ganders@xxxxxxxxxxxx>:
>
>
> Hi All,
>
> I've a performance question, we recently install a brand new Ceph cluster
> with all-nvme disks, using ceph version 12.2.0 with bluestore configured.
> The back-end of the cluster is using a bond IPoIB (active/passive) , and
> for the front-end we are using a bonding config with active/active (20GbE)
> to communicate with the clients.
>
> The cluster configuration is the following:
>
> *MON Nodes:*
> OS: Ubuntu 16.04.3 LTS | kernel 4.12.14
> 3x 1U servers:
> 2x Intel Xeon E5-2630v4 @2.2Ghz
> 128G RAM
> 2x Intel SSD DC S3520 150G (in RAID-1 for OS)
> 2x 82599ES 10-Gigabit SFI/SFP+ Network Connection
>
> *OSD Nodes:*
> We also did some tests with *sysbench* on different storage types:> OS: Ubuntu 16.04.3 LTS | kernel 4.12.14
> 4x 2U servers:
> 2x Intel Xeon E5-2640v4 @2.4Ghz
> 128G RAM
> 2x Intel SSD DC S3520 150G (in RAID-1 for OS)
> 1x Ethernet Controller 10G X550T
> 1x 82599ES 10-Gigabit SFI/SFP+ Network Connection
> 12x Intel SSD DC P3520 1.2T (NVMe) for OSD daemons
> 1x Mellanox ConnectX-3 InfiniBand FDR 56Gb/s Adapter (dual port)
>
>
> Here's the tree:
>
> ID CLASS WEIGHT TYPE NAME STATUS REWEIGHT PRI-AFF
> -7 48.00000 root root
> -5 24.00000 rack rack1
> -1 12.00000 node cpn01
> 0 nvme 1.00000 osd.0 up 1.00000 1.00000
> 1 nvme 1.00000 osd.1 up 1.00000 1.00000
> 2 nvme 1.00000 osd.2 up 1.00000 1.00000
> 3 nvme 1.00000 osd.3 up 1.00000 1.00000
> 4 nvme 1.00000 osd.4 up 1.00000 1.00000
> 5 nvme 1.00000 osd.5 up 1.00000 1.00000
> 6 nvme 1.00000 osd.6 up 1.00000 1.00000
> 7 nvme 1.00000 osd.7 up 1.00000 1.00000
> 8 nvme 1.00000 osd.8 up 1.00000 1.00000
> 9 nvme 1.00000 osd.9 up 1.00000 1.00000
> 10 nvme 1.00000 osd.10 up 1.00000 1.00000
> 11 nvme 1.00000 osd.11 up 1.00000 1.00000
> -3 12.00000 node cpn03
> 24 nvme 1.00000 osd.24 up 1.00000 1.00000
> 25 nvme 1.00000 osd.25 up 1.00000 1.00000
> 26 nvme 1.00000 osd.26 up 1.00000 1.00000
> 27 nvme 1.00000 osd.27 up 1.00000 1.00000
> 28 nvme 1.00000 osd.28 up 1.00000 1.00000
> 29 nvme 1.00000 osd.29 up 1.00000 1.00000
> 30 nvme 1.00000 osd.30 up 1.00000 1.00000
> 31 nvme 1.00000 osd.31 up 1.00000 1.00000
> 32 nvme 1.00000 osd.32 up 1.00000 1.00000
> 33 nvme 1.00000 osd.33 up 1.00000 1.00000
> 34 nvme 1.00000 osd.34 up 1.00000 1.00000
> 35 nvme 1.00000 osd.35 up 1.00000 1.00000
> -6 24.00000 rack rack2
> -2 12.00000 node cpn02
> 12 nvme 1.00000 osd.12 up 1.00000 1.00000
> 13 nvme 1.00000 osd.13 up 1.00000 1.00000
> 14 nvme 1.00000 osd.14 up 1.00000 1.00000
> 15 nvme 1.00000 osd.15 up 1.00000 1.00000
> 16 nvme 1.00000 osd.16 up 1.00000 1.00000
> 17 nvme 1.00000 osd.17 up 1.00000 1.00000
> 18 nvme 1.00000 osd.18 up 1.00000 1.00000
> 19 nvme 1.00000 osd.19 up 1.00000 1.00000
> 20 nvme 1.00000 osd.20 up 1.00000 1.00000
> 21 nvme 1.00000 osd.21 up 1.00000 1.00000
> 22 nvme 1.00000 osd.22 up 1.00000 1.00000
> 23 nvme 1.00000 osd.23 up 1.00000 1.00000
> -4 12.00000 node cpn04
> 36 nvme 1.00000 osd.36 up 1.00000 1.00000
> 37 nvme 1.00000 osd.37 up 1.00000 1.00000
> 38 nvme 1.00000 osd.38 up 1.00000 1.00000
> 39 nvme 1.00000 osd.39 up 1.00000 1.00000
> 40 nvme 1.00000 osd.40 up 1.00000 1.00000
> 41 nvme 1.00000 osd.41 up 1.00000 1.00000
> 42 nvme 1.00000 osd.42 up 1.00000 1.00000
> 43 nvme 1.00000 osd.43 up 1.00000 1.00000
> 44 nvme 1.00000 osd.44 up 1.00000 1.00000
> 45 nvme 1.00000 osd.45 up 1.00000 1.00000
> 46 nvme 1.00000 osd.46 up 1.00000 1.00000
> 47 nvme 1.00000 osd.47 up 1.00000 1.00000
>
> The disk partition of one of the OSD nodes:
>
> NAME MAJ:MIN RM SIZE RO TYPE MOUNTPOINT
> nvme6n1 259:1 0 1.1T 0 disk
> ├─nvme6n1p2 259:15 0 1.1T 0 part
> └─nvme6n1p1 259:13 0 100M 0 part /var/lib/ceph/osd/ceph-6
> nvme9n1 259:0 0 1.1T 0 disk
> ├─nvme9n1p2 259:8 0 1.1T 0 part
> └─nvme9n1p1 259:7 0 100M 0 part /var/lib/ceph/osd/ceph-9
> sdb 8:16 0 139.8G 0 disk
> └─sdb1 8:17 0 139.8G 0 part
> └─md0 9:0 0 139.6G 0 raid1
> ├─md0p2 259:31 0 1K 0 md
> ├─md0p5 259:32 0 139.1G 0 md
> │ ├─cpn01--vg-swap 253:1 0 27.4G 0 lvm [SWAP]
> │ └─cpn01--vg-root 253:0 0 111.8G 0 lvm /
> └─md0p1 259:30 0 486.3M 0 md /boot
> nvme11n1 259:2 0 1.1T 0 disk
> ├─nvme11n1p1 259:12 0 100M 0 part /var/lib/ceph/osd/ceph-11
> └─nvme11n1p2 259:14 0 1.1T 0 part
> nvme2n1 259:6 0 1.1T 0 disk
> ├─nvme2n1p2 259:21 0 1.1T 0 part
> └─nvme2n1p1 259:20 0 100M 0 part /var/lib/ceph/osd/ceph-2
> nvme5n1 259:3 0 1.1T 0 disk
> ├─nvme5n1p1 259:9 0 100M 0 part /var/lib/ceph/osd/ceph-5
> └─nvme5n1p2 259:10 0 1.1T 0 part
> nvme8n1 259:24 0 1.1T 0 disk
> ├─nvme8n1p1 259:26 0 100M 0 part /var/lib/ceph/osd/ceph-8
> └─nvme8n1p2 259:28 0 1.1T 0 part
> nvme10n1 259:11 0 1.1T 0 disk
> ├─nvme10n1p1 259:22 0 100M 0 part /var/lib/ceph/osd/ceph-10
> └─nvme10n1p2 259:23 0 1.1T 0 part
> nvme1n1 259:33 0 1.1T 0 disk
> ├─nvme1n1p1 259:34 0 100M 0 part /var/lib/ceph/osd/ceph-1
> └─nvme1n1p2 259:35 0 1.1T 0 part
> nvme4n1 259:5 0 1.1T 0 disk
> ├─nvme4n1p1 259:18 0 100M 0 part /var/lib/ceph/osd/ceph-4
> └─nvme4n1p2 259:19 0 1.1T 0 part
> nvme7n1 259:25 0 1.1T 0 disk
> ├─nvme7n1p1 259:27 0 100M 0 part /var/lib/ceph/osd/ceph-7
> └─nvme7n1p2 259:29 0 1.1T 0 part
> sda 8:0 0 139.8G 0 disk
> └─sda1 8:1 0 139.8G 0 part
> └─md0 9:0 0 139.6G 0 raid1
> ├─md0p2 259:31 0 1K 0 md
> ├─md0p5 259:32 0 139.1G 0 md
> │ ├─cpn01--vg-swap 253:1 0 27.4G 0 lvm [SWAP]
> │ └─cpn01--vg-root 253:0 0 111.8G 0 lvm /
> └─md0p1 259:30 0 486.3M 0 md /boot
> nvme0n1 259:36 0 1.1T 0 disk
> ├─nvme0n1p1 259:37 0 100M 0 part /var/lib/ceph/osd/ceph-0
> └─nvme0n1p2 259:38 0 1.1T 0 part
> nvme3n1 259:4 0 1.1T 0 disk
> ├─nvme3n1p1 259:16 0 100M 0 part /var/lib/ceph/osd/ceph-3
> └─nvme3n1p2 259:17 0 1.1T 0 part
>
>
> For the disk scheduler we're using [kyber], for the read_ahead_kb we try
> different values (0,128 and 2048), the rq_affinity set to 2, and the
> rotational parameter set to 0.
> We've also set the CPU governor to performance on all the cores, and tune
> some sysctl parameters also:
>
> # for Ceph
> net.ipv4.ip_forward=0
> net.ipv4.conf.default.rp_filter=1
> kernel.sysrq=0
> kernel.core_uses_pid=1
> net.ipv4.tcp_syncookies=0
> #net.netfilter.nf_conntrack_max=2621440
> #net.netfilter.nf_conntrack_tcp_timeout_established = 1800
> # disable netfilter on bridges
> #net.bridge.bridge-nf-call-ip6tables = 0
> #net.bridge.bridge-nf-call-iptables = 0
> #net.bridge.bridge-nf-call-arptables = 0
> vm.min_free_kbytes=1000000
>
> # Controls the maximum size of a message, in bytes
> kernel.msgmnb = 65536
>
> # Controls the default maxmimum size of a mesage queue
> kernel.msgmax = 65536
>
> # Controls the maximum shared segment size, in bytes
> kernel.shmmax = 68719476736
>
> # Controls the maximum number of shared memory segments, in pages
> kernel.shmall = 4294967296
>
>
> The ceph.conf file is:
>
> ...
> osd_pool_default_size = 3
> osd_pool_default_min_size = 2
> osd_pool_default_pg_num = 1600
> osd_pool_default_pgp_num = 1600
>
> debug_crush = 1/1
> debug_buffer = 0/1
> debug_timer = 0/0
> debug_filer = 0/1
> debug_objecter = 0/1
> debug_rados = 0/5
> debug_rbd = 0/5
> debug_ms = 0/5
> debug_throttle = 1/1
>
> debug_journaler = 0/0
> debug_objectcatcher = 0/0
> debug_client = 0/0
> debug_osd = 0/0
> debug_optracker = 0/0
> debug_objclass = 0/0
> debug_journal = 0/0
> debug_filestore = 0/0
> debug_mon = 0/0
> debug_paxos = 0/0
>
> osd_crush_chooseleaf_type = 0
> filestore_xattr_use_omap = true
>
> rbd_cache = true
> mon_compact_on_trim = false
>
> [osd]
> osd_crush_update_on_start = false
>
> [client]
> rbd_cache = true
> rbd_cache_writethrough_until_flush = true
> rbd_default_features = 1
> admin_socket = /var/run/ceph/$cluster-$type.$id.$pid.$cctid.asok
> log_file = /var/log/ceph/
>
>
> The cluster has two production pools on for openstack (volumes) with RF of
> 3 and another pool for db (db) with RF of 2. The DBA team has perform
> several tests with a volume mounted on the DB server (with RBD). The DB
> server has the following configuration:
>
> OS: CentOS 6.9 | kernel 4.14.1
> DB: MySQL
> ProLiant BL685c G7
> 4x AMD Opteron Processor 6376 (total of 64 cores)
> 128G RAM
> 1x OneConnect 10Gb NIC (quad-port) - in a bond configuration
> (active/active) with 3 vlans
>
>
>
>
> sysbench
> disk tps qps latency (ms) 95th percentile
> Local SSD 261,28 5.225,61 5,18
> Ceph NVMe 95,18 1.903,53 12,3
> Pure Storage 196,49 3.929,71 6,32
> NetApp FAS 189,83 3.796,59 6,67
> EMC VMAX 196,14 3.922,82 6,32
>
>
> Is there any specific tuning that I can apply to the ceph cluster, in order
> to improve those numbers? Or are those numbers ok for the type and size of
> the cluster that we have? Any advice would be really appreciated.
>
Have you tried to attach multiple RBD volumes:
- Root for OS
- MySQL data dir
- MySQL InnoDB Logfile
- MySQL Binary Logging
So 4 disks in total where you spread out the I/O over.
Wido
> Thanks,
>
>
>
>
> *German*
> _______________________________________________
> ceph-users mailing list
> ceph-users@xxxxxxxxxxxxxx
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph. com
_______________________________________________ ceph-users mailing list ceph-users@xxxxxxxxxxxxxx http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com