Re: Ceph RBD, MySQL write IOPs - what is possible?

"Anthony D'Atri" <anthony.datri@xxxxxxxxx> · Fri, 7 Jun 2024 15:25:04 -0400

> On Jun 7, 2024, at 13:20, Mark Lehrer <lehrer@xxxxxxxxx> wrote:
> 
>> server RAM and CPU
>> * osd_memory_target
>> * OSD drive model
> 
> Thanks for the reply.  The servers have dual Xeon Gold 6154 CPUs with
> 384 GB

So roughly 7 vcores / HTs per OSD?  Your Ceph is a recent release?

> The drives are older, first gen NVMe - WDC SN620.

Those appear to be a former SanDisk product, lower performers than more recent drives, how much a factor that is I can't say.
Which specific SKU?  There appear to be low and standard endurance SKUs, 3.84 or 1.92 T, 3.2T or 1.6T respectively.

What is the lifetime used like on them?  Less than 80%?  If you really want to eliminate uncertainties:

* Ensure they're updated to the latest firmware
* In rolling fashion, destroy the OSDs, secure-erase each OSD, redeploy the OSDs

> osd_memory_target is at the default.  Mellanox CX5 and SN2700
> hardware.  The test client is a similar machine with no drives.

This is via RBD?  Do you have the client RBD cache on or off?

> The CPUs are 80% idle during the test.

Do you have the server BMC/BIOS profile set to performance?  Deep C-states disabled via TuneD or other means?

> The OSDs (according to iostat)

Careful, iostat's metrics are of limited utility on SSDs, especially NVMe.

> I did find it interesting that the wareq-sz option in iostat is around
> 5 during the test - I was expecting 16.  Is there a way to tweak this
> in bluestore?

Not my area of expertise, but I once tried to make OSDs with a >4KB BlueStore block size, they crashed at startup.   4096 is hardcoded in various places.

Quality SSD firmware will coalesce writes to NAND.  If your firmware surfaces host vs NAND writes, you might capture deltas over, say, a week of workload and calculate the WAF.

> These drives are terrible at under 8K I/O.  Not that it really matters since we're not I/O bound at all.

I/O bound can be tricky, be careful with that assumption, there are multiple facets.  I can't find anything specific, but that makes me suspect that internally the IU isn't the usual 4KB, perhaps to save a few bucks on DRAM.  

> I can also increase threads from 8 to 32 and the iops are roughly
> quadruple so that's good at least.  Single thread writes are about 250
> iops and like 3.7MB/sec.  So sad.

Assuming that the pool you're writing to spans all 60 OSDs, what is your PG count on that pool?  Are there multiple pools in the cluster?  As reported by `ceph osd df`, on average how many PG replicas are on each OSD?

> The rados bench process is also under 50% CPU utilization of a single
> core.  This seems like a thead/semaphore kind of issue if I had to
> guess.  It's tricky to debug when there is no obvious bottleneck.

rados bench is a good smoke test, but fio may better represent the E2E experience.

> 
> Thanks,
> Mark
> 
> 
> 
> 
> On Fri, Jun 7, 2024 at 9:47 AM Anthony D'Atri <anthony.datri@xxxxxxxxx> wrote:
>> 
>> Please describe:
>> 
>> * server RAM and CPU
>> * osd_memory_target
>> * OSD drive model
>> 
>>> On Jun 7, 2024, at 11:32, Mark Lehrer <lehrer@xxxxxxxxx> wrote:
>>> 
>>> I've been using MySQL on Ceph forever, and have been down this road
>>> before but it's been a couple of years so I wanted to see if there is
>>> anything new here.
>>> 
>>> So the TL:DR version of this email - is there a good way to improve
>>> 16K write IOPs with a small number of threads?  The OSDs themselves
>>> are idle so is this just a weakness in the algorithms or do ceph
>>> clients need some profiling?  Or "other"?
>>> 
>>> Basically, this is one of the worst possible Ceph workloads so it is
>>> fun to try to push the limits.  I also happen have a MySQL instance
>>> that is reaching the write IOPs limit so this is also a last-ditch
>>> effort to keep it on Ceph.
>>> 
>>> This cluster is as straightforward as it gets... 6 servers with 10
>>> SSDs each, 100 Gb networking.  I'm using size=3.  During operations,
>>> the OSDs are more or less idle so I don't suspect any hardware
>>> limitations.
>>> 
>>> MySQL has no parallelism so the number of threads and effective queue
>>> depth stay pretty low.  Therefore, as a proxy for MySQL I use rados
>>> bench with 16K writes and 8 threads.  The RBD actually gets about 2x
>>> this level - still not so great.
>>> 
>>> I get about 2000 IOPs with this test:
>>> 
>>> # rados bench -p volumes 10 write -t 8 -b 16K
>>> hints = 1
>>> Maintaining 8 concurrent writes of 16384 bytes to objects of size
>>> 16384 for up to 10 seconds or 0 objects
>>> Object prefix: benchmark_data_fstosinfra-5_3652583
>>> sec Cur ops   started  finished  avg MB/s  cur MB/s last lat(s)  avg lat(s)
>>>   0       0         0         0         0         0           -           0
>>>   1       8      2050      2042   31.9004   31.9062  0.00247633  0.00390848
>>>   2       8      4306      4298   33.5728     35.25  0.00278488  0.00371784
>>>   3       8      6607      6599   34.3645   35.9531  0.00277546  0.00363139
>>>   4       7      8951      8944   34.9323   36.6406  0.00414908  0.00357249
>>>   5       8     11292     11284    35.257   36.5625  0.00291434  0.00353997
>>>   6       8     13588     13580   35.3588    35.875  0.00306094  0.00353084
>>>   7       7     15933     15926   35.5432   36.6562  0.00308388   0.0035123
>>>   8       8     18361     18353   35.8399   37.9219  0.00314996  0.00348327
>>>   9       8     20629     20621   35.7947   35.4375  0.00352998   0.0034877
>>>  10       5     23010     23005   35.9397     37.25  0.00395566  0.00347376
>>> Total time run:         10.003
>>> Total writes made:      23010
>>> Write size:             16384
>>> Object size:            16384
>>> Bandwidth (MB/sec):     35.9423
>>> Stddev Bandwidth:       1.63433
>>> Max bandwidth (MB/sec): 37.9219
>>> Min bandwidth (MB/sec): 31.9062
>>> Average IOPS:           2300
>>> Stddev IOPS:            104.597
>>> Max IOPS:               2427
>>> Min IOPS:               2042
>>> Average Latency(s):     0.0034737
>>> Stddev Latency(s):      0.00163661
>>> Max latency(s):         0.115932
>>> Min latency(s):         0.00179735
>>> Cleaning up (deleting benchmark objects)
>>> Removed 23010 objects
>>> Clean up completed and total clean up time :7.44664
>>> 
>>> 
>>> Are there any good options to improve this?  It seems like the client
>>> side is the bottleneck since the OSD servers are at like 15%
>>> utilization.
>>> 
>>> Thanks,
>>> Mark
>>> _______________________________________________
>>> ceph-users mailing list -- ceph-users@xxxxxxx
>>> To unsubscribe send an email to ceph-users-leave@xxxxxxx
>> 
> _______________________________________________
> ceph-users mailing list -- ceph-users@xxxxxxx
> To unsubscribe send an email to ceph-users-leave@xxxxxxx
_______________________________________________
ceph-users mailing list -- ceph-users@xxxxxxx
To unsubscribe send an email to ceph-users-leave@xxxxxxx