With good hardware and correct configuration, an all flash cluster
should give:
approx 1-2K write iops per thread (0.5-1 ms latency)
approx 2-5K read iops per thread (0.2-0.5 ms latency)
This is dependent on quality of drives and cpu/frequency but independent
on number of drives or cores.
Total iops should scale as you add threads, linearly at first but the
rate will slowly saturate based on number of drives and cores.
With 8 threads, you should easily get 8k write, 16k read iops or more.
With only 8 threads/queue depth, you would not need more than 3 drives
in total for performance, any extra drives will be for capacity as they
will be idle most of the time.
Some things to consider to get decent speeds:
1) Make sure you use enterprise ssd/nvme drives. Ceph syncs its writes,
it is worth testing raw drive write speed using fio with direct,sync
flags. 10K write iops and more is good.
2) tune your cpu: disable wait states, set min frequency: 100%, set
governor to performance. Make sure you disable anything in BIOS relating
to energy savings :)
3) disable volatile cache on your nvme
4) set the i/o scheduler on nvme drives to "none"
5) lower read_ahead_kb to 64KB or lower so not to affect random reads.
6) There are advanced tuning like numa pinning, but you should get
decent speeds without doing fancy stuff.
I would not recommend any drive caching. This could be good for bursty
workloads but typically give worse results for consistent heavy loads.
Ceph is best for consistent high load, if all you need is a single MySQL
db with a few threads that may at times have bursty load but is mostly
quiet, then probably Ceph is not the best solution. If however you have
this MySQL db in one of your vms among thousands of other vms, then Ceph
will be ideal.
/Maged
On 07/06/2024 18:32, Mark Lehrer wrote:
I've been using MySQL on Ceph forever, and have been down this road
before but it's been a couple of years so I wanted to see if there is
anything new here.
So the TL:DR version of this email - is there a good way to improve
16K write IOPs with a small number of threads? The OSDs themselves
are idle so is this just a weakness in the algorithms or do ceph
clients need some profiling? Or "other"?
Basically, this is one of the worst possible Ceph workloads so it is
fun to try to push the limits. I also happen have a MySQL instance
that is reaching the write IOPs limit so this is also a last-ditch
effort to keep it on Ceph.
This cluster is as straightforward as it gets... 6 servers with 10
SSDs each, 100 Gb networking. I'm using size=3. During operations,
the OSDs are more or less idle so I don't suspect any hardware
limitations.
MySQL has no parallelism so the number of threads and effective queue
depth stay pretty low. Therefore, as a proxy for MySQL I use rados
bench with 16K writes and 8 threads. The RBD actually gets about 2x
this level - still not so great.
I get about 2000 IOPs with this test:
# rados bench -p volumes 10 write -t 8 -b 16K
hints = 1
Maintaining 8 concurrent writes of 16384 bytes to objects of size
16384 for up to 10 seconds or 0 objects
Object prefix: benchmark_data_fstosinfra-5_3652583
sec Cur ops started finished avg MB/s cur MB/s last lat(s) avg lat(s)
0 0 0 0 0 0 - 0
1 8 2050 2042 31.9004 31.9062 0.00247633 0.00390848
2 8 4306 4298 33.5728 35.25 0.00278488 0.00371784
3 8 6607 6599 34.3645 35.9531 0.00277546 0.00363139
4 7 8951 8944 34.9323 36.6406 0.00414908 0.00357249
5 8 11292 11284 35.257 36.5625 0.00291434 0.00353997
6 8 13588 13580 35.3588 35.875 0.00306094 0.00353084
7 7 15933 15926 35.5432 36.6562 0.00308388 0.0035123
8 8 18361 18353 35.8399 37.9219 0.00314996 0.00348327
9 8 20629 20621 35.7947 35.4375 0.00352998 0.0034877
10 5 23010 23005 35.9397 37.25 0.00395566 0.00347376
Total time run: 10.003
Total writes made: 23010
Write size: 16384
Object size: 16384
Bandwidth (MB/sec): 35.9423
Stddev Bandwidth: 1.63433
Max bandwidth (MB/sec): 37.9219
Min bandwidth (MB/sec): 31.9062
Average IOPS: 2300
Stddev IOPS: 104.597
Max IOPS: 2427
Min IOPS: 2042
Average Latency(s): 0.0034737
Stddev Latency(s): 0.00163661
Max latency(s): 0.115932
Min latency(s): 0.00179735
Cleaning up (deleting benchmark objects)
Removed 23010 objects
Clean up completed and total clean up time :7.44664
Are there any good options to improve this? It seems like the client
side is the bottleneck since the OSD servers are at like 15%
utilization.
Thanks,
Mark
_______________________________________________
ceph-users mailing list -- ceph-users@xxxxxxx
To unsubscribe send an email to ceph-users-leave@xxxxxxx
_______________________________________________
ceph-users mailing list -- ceph-users@xxxxxxx
To unsubscribe send an email to ceph-users-leave@xxxxxxx