Re: ceph all-nvme mysql performance tuning

German Anders <ganders@xxxxxxxxxxxx> · Thu, 30 Nov 2017 14:25:45 -0300

That's correct, IPoIB for the backend (already configured the irq affinity),  and 10GbE on the frontend. I would love to try rdma but like you said is not stable for production, so I think I'll have to wait for that. Yeah, the thing is that it's not my decision to go for 50GbE or 100GbE... :( so.. 10GbE for the front-end will be...

Would be really helpful if someone could run the following sysbench test on a mysql db so I could make some compares:

my.cnf configuration file:

[mysqld_safe]
nice                                    = 0
pid-file                                = /home/test_db/mysql/mysql.pid

[client]
port                                    = 33033
socket                                  = /home/test_db/mysql/mysql.sock

[mysqld]
user                                    = test_db
port                                    = 33033
socket                                  = /home/test_db/mysql/mysql.sock
pid-file                                = /home/test_db/mysql/mysql.pid
log-error                               = /home/test_db/mysql/mysql.err
datadir                                 = /home/test_db/mysql/data
tmpdir                                  = /tmp
server-id                               = 1

# ** Binlogging **
#log-bin                                = /home/test_db/mysql/binlog/mysql-bin
#log_bin_index                          = /home/test_db/mysql/binlog/mysql-bin.index
expire_logs_days                        = 1
max_binlog_size                         = 512MB

thread_handling                         = pool-of-threads
thread_pool_max_threads                 = 300

# ** Slow query log **
slow_query_log                          = 1
slow_query_log_file                     = /home/test_db/mysql/mysql-slow.log
long_query_time                         = 10
log_output                              = FILE
log_slow_slave_statements               = 1
log_slow_verbosity                      = query_plan,innodb,explain

# ** INNODB Specific options **
transaction_isolation                   = READ-COMMITTED
innodb_buffer_pool_size                 = 12G
innodb_data_file_path                   = ibdata1:256M:autoextend
innodb_thread_concurrency               = 16
innodb_log_file_size                    = 256M
innodb_log_files_in_group               = 3
innodb_file_per_table
innodb_log_buffer_size                  = 16M
innodb_stats_on_metadata                = 0
innodb_lock_wait_timeout                = 30
# innodb_flush_method                   = O_DSYNC
innodb_flush_method                     = O_DIRECT
max_connections                         = 10000
max_connect_errors                      = 999999
max_allowed_packet                      = 128M
skip-host-cache
skip-name-resolve
explicit_defaults_for_timestamp         = 1
performance_schema                      = OFF
log_warnings                            = 2
event_scheduler                         = ON

# ** Specific Galera Cluster Settings **
binlog_format                           = ROW
default-storage-engine                  = innodb
query_cache_size                        = 0
query_cache_type                        = 0

Volume is just an RBD (on a RF=3 pool) with the default 22 bit order mounted on /home/test_db/mysql/data

commands for the test:

sysbench --test=/usr/share/sysbench/tests/include/oltp_legacy/parallel_prepare.lua --mysql-host=<hostname> --mysql-port=33033 --mysql-user=sysbench --mysql-password=sysbench --mysql-db=sysbench --mysql-table-engine=innodb --db-driver=mysql --oltp_tables_count=10 --oltp-test-mode=complex --oltp-read-_only_=off --oltp-table-size=200000 --threads=10 --rand-type=uniform --rand-init=on cleanup > /dev/null 2>/dev/null

sysbench --test=/usr/share/sysbench/tests/include/oltp_legacy/parallel_prepare.lua --mysql-host=<hostname> --mysql-port=33033 --mysql-user=sysbench --mysql-password=sysbench --mysql-db=sysbench --mysql-table-engine=innodb --db-driver=mysql --oltp_tables_count=10 --oltp-test-mode=complex --oltp-read-_only_=off --oltp-table-size=200000 --threads=10 --rand-type=uniform --rand-init=on prepare > /dev/null 2>/dev/null

sysbench --test=/usr/share/sysbench/tests/include/oltp_legacy/oltp.lua --mysql-host=<hostname> --mysql-port=33033 --mysql-user=sysbench --mysql-password=sysbench --mysql-db=sysbench --mysql-table-engine=innodb --db-driver=mysql --oltp_tables_count=10 --oltp-test-mode=complex --oltp-read-_only_=off --oltp-table-size=200000 --threads=20 --rand-type=uniform --rand-init=on --time=120 run > result_sysbench_perf_test.out 2>/dev/null

Im looking for tps, qps and 95th perc, could anyone with a all-nvme cluster run the test and share the results? I would really appreciate the help :)

Thanks in advance,

Best,

German 

2017-11-29 19:14 GMT-03:00 Zoltan Arnold Nagy <zoltan@xxxxxxxxxxxxxxxxxx>:
On 2017-11-27 14:02, German Anders wrote:

4x 2U servers:

  1x 82599ES 10-Gigabit SFI/SFP+ Network Connection

  1x Mellanox ConnectX-3 InfiniBand FDR 56Gb/s Adapter (dual port)

so I assume you are using IPoIB as the cluster network for the replication...

1x OneConnect 10Gb NIC (quad-port) - in a bond configuration

(active/active) with 3 vlans

... and the 10GbE network for the front-end network?

At 4k writes your network latency will be very high (see the flame graphs at the Intel NVMe presentation from the Boston OpenStack Summit - not sure if there is a newer deck that somebody could link ;)) and the time will be spent in the kernel. You could give RDMAMessenger a try but it's not stable at the current LTS release.

If I were you I'd be looking at 100GbE - we've recently pulled in a bunch of 100GbE links and it's been wonderful to see 100+GB/s going over the network for just storage.

Some people suggested mounting multiple RBD volumes - unless I'm mistaken and you're using very recent qemu/libvirt combinations with the proper libvirt disk settings all IO will still be single threaded towards librbd thus not making any speedup.

_______________________________________________
ceph-users mailing list
ceph-users@xxxxxxxxxxxxxx
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com