Re: Ceph RBD, MySQL write IOPs - what is possible?

Maged Mokhtar <mmokhtar@xxxxxxxxxxx> · Sun, 9 Jun 2024 21:43:37 +0300

With good hardware and correct configuration, an all flash cluster 
should give:
approx 1-2K write iops per thread (0.5-1 ms  latency)
approx 2-5K read iops per thread (0.2-0.5 ms  latency)
This is dependent on quality of drives and cpu/frequency but independent 
on number of drives or cores.

Total iops should scale as you add threads, linearly at first but the 
rate will slowly saturate based on number of drives and cores.
With 8 threads, you should easily get 8k write, 16k read iops or more.
With only 8 threads/queue depth, you would not need more than 3 drives 
in total for performance, any extra drives will be for capacity as they 
will be idle most of the time.

Some things to consider to get decent speeds:
1) Make sure you use enterprise ssd/nvme drives. Ceph syncs its writes, 
it is worth testing raw drive write speed using fio with direct,sync 
flags. 10K write iops and more is good.
2) tune your cpu: disable wait states, set min frequency: 100%, set 
governor to performance. Make sure you disable anything in BIOS relating 
to energy savings :)
3) disable volatile cache on your nvme
4) set the i/o scheduler on nvme drives to "none"
5) lower read_ahead_kb to 64KB or lower so not to affect random reads.
6) There are advanced tuning like numa pinning, but you should get 
decent speeds without doing fancy stuff.

I would not recommend any drive caching. This could be good for bursty 
workloads but typically give worse results for consistent heavy loads.
Ceph is best for consistent high load, if all you need is a single MySQL 
db with a few threads that may at times have bursty load but is mostly 
quiet, then probably Ceph is not the best solution. If however you have 
this MySQL db in one of your vms among thousands of other vms, then Ceph 
will be ideal.

/Maged

On 07/06/2024 18:32, Mark Lehrer wrote:
I've been using MySQL on Ceph forever, and have been down this road
before but it's been a couple of years so I wanted to see if there is
anything new here.

So the TL:DR version of this email - is there a good way to improve
16K write IOPs with a small number of threads?  The OSDs themselves
are idle so is this just a weakness in the algorithms or do ceph
clients need some profiling?  Or "other"?

Basically, this is one of the worst possible Ceph workloads so it is
fun to try to push the limits.  I also happen have a MySQL instance
that is reaching the write IOPs limit so this is also a last-ditch
effort to keep it on Ceph.

This cluster is as straightforward as it gets... 6 servers with 10
SSDs each, 100 Gb networking.  I'm using size=3.  During operations,
the OSDs are more or less idle so I don't suspect any hardware
limitations.

MySQL has no parallelism so the number of threads and effective queue
depth stay pretty low.  Therefore, as a proxy for MySQL I use rados
bench with 16K writes and 8 threads.  The RBD actually gets about 2x
this level - still not so great.

I get about 2000 IOPs with this test:

# rados bench -p volumes 10 write -t 8 -b 16K
hints = 1
Maintaining 8 concurrent writes of 16384 bytes to objects of size
16384 for up to 10 seconds or 0 objects
Object prefix: benchmark_data_fstosinfra-5_3652583
   sec Cur ops   started  finished  avg MB/s  cur MB/s last lat(s)  avg lat(s)
     0       0         0         0         0         0           -           0
     1       8      2050      2042   31.9004   31.9062  0.00247633  0.00390848
     2       8      4306      4298   33.5728     35.25  0.00278488  0.00371784
     3       8      6607      6599   34.3645   35.9531  0.00277546  0.00363139
     4       7      8951      8944   34.9323   36.6406  0.00414908  0.00357249
     5       8     11292     11284    35.257   36.5625  0.00291434  0.00353997
     6       8     13588     13580   35.3588    35.875  0.00306094  0.00353084
     7       7     15933     15926   35.5432   36.6562  0.00308388   0.0035123
     8       8     18361     18353   35.8399   37.9219  0.00314996  0.00348327
     9       8     20629     20621   35.7947   35.4375  0.00352998   0.0034877
    10       5     23010     23005   35.9397     37.25  0.00395566  0.00347376
Total time run:         10.003
Total writes made:      23010
Write size:             16384
Object size:            16384
Bandwidth (MB/sec):     35.9423
Stddev Bandwidth:       1.63433
Max bandwidth (MB/sec): 37.9219
Min bandwidth (MB/sec): 31.9062
Average IOPS:           2300
Stddev IOPS:            104.597
Max IOPS:               2427
Min IOPS:               2042
Average Latency(s):     0.0034737
Stddev Latency(s):      0.00163661
Max latency(s):         0.115932
Min latency(s):         0.00179735
Cleaning up (deleting benchmark objects)
Removed 23010 objects
Clean up completed and total clean up time :7.44664

Are there any good options to improve this?  It seems like the client
side is the bottleneck since the OSD servers are at like 15%
utilization.

Thanks,
Mark
_______________________________________________
ceph-users mailing list -- ceph-users@xxxxxxx
To unsubscribe send an email to ceph-users-leave@xxxxxxx
_______________________________________________
ceph-users mailing list -- ceph-users@xxxxxxx
To unsubscribe send an email to ceph-users-leave@xxxxxxx