> On Jun 7, 2024, at 13:20, Mark Lehrer <lehrer@xxxxxxxxx> wrote: > >> server RAM and CPU >> * osd_memory_target >> * OSD drive model > > Thanks for the reply. The servers have dual Xeon Gold 6154 CPUs with > 384 GB So roughly 7 vcores / HTs per OSD? Your Ceph is a recent release? > The drives are older, first gen NVMe - WDC SN620. Those appear to be a former SanDisk product, lower performers than more recent drives, how much a factor that is I can't say. Which specific SKU? There appear to be low and standard endurance SKUs, 3.84 or 1.92 T, 3.2T or 1.6T respectively. What is the lifetime used like on them? Less than 80%? If you really want to eliminate uncertainties: * Ensure they're updated to the latest firmware * In rolling fashion, destroy the OSDs, secure-erase each OSD, redeploy the OSDs > osd_memory_target is at the default. Mellanox CX5 and SN2700 > hardware. The test client is a similar machine with no drives. This is via RBD? Do you have the client RBD cache on or off? > The CPUs are 80% idle during the test. Do you have the server BMC/BIOS profile set to performance? Deep C-states disabled via TuneD or other means? > The OSDs (according to iostat) Careful, iostat's metrics are of limited utility on SSDs, especially NVMe. > I did find it interesting that the wareq-sz option in iostat is around > 5 during the test - I was expecting 16. Is there a way to tweak this > in bluestore? Not my area of expertise, but I once tried to make OSDs with a >4KB BlueStore block size, they crashed at startup. 4096 is hardcoded in various places. Quality SSD firmware will coalesce writes to NAND. If your firmware surfaces host vs NAND writes, you might capture deltas over, say, a week of workload and calculate the WAF. > These drives are terrible at under 8K I/O. Not that it really matters since we're not I/O bound at all. I/O bound can be tricky, be careful with that assumption, there are multiple facets. I can't find anything specific, but that makes me suspect that internally the IU isn't the usual 4KB, perhaps to save a few bucks on DRAM. > I can also increase threads from 8 to 32 and the iops are roughly > quadruple so that's good at least. Single thread writes are about 250 > iops and like 3.7MB/sec. So sad. Assuming that the pool you're writing to spans all 60 OSDs, what is your PG count on that pool? Are there multiple pools in the cluster? As reported by `ceph osd df`, on average how many PG replicas are on each OSD? > The rados bench process is also under 50% CPU utilization of a single > core. This seems like a thead/semaphore kind of issue if I had to > guess. It's tricky to debug when there is no obvious bottleneck. rados bench is a good smoke test, but fio may better represent the E2E experience. > > Thanks, > Mark > > > > > On Fri, Jun 7, 2024 at 9:47 AM Anthony D'Atri <anthony.datri@xxxxxxxxx> wrote: >> >> Please describe: >> >> * server RAM and CPU >> * osd_memory_target >> * OSD drive model >> >>> On Jun 7, 2024, at 11:32, Mark Lehrer <lehrer@xxxxxxxxx> wrote: >>> >>> I've been using MySQL on Ceph forever, and have been down this road >>> before but it's been a couple of years so I wanted to see if there is >>> anything new here. >>> >>> So the TL:DR version of this email - is there a good way to improve >>> 16K write IOPs with a small number of threads? The OSDs themselves >>> are idle so is this just a weakness in the algorithms or do ceph >>> clients need some profiling? Or "other"? >>> >>> Basically, this is one of the worst possible Ceph workloads so it is >>> fun to try to push the limits. I also happen have a MySQL instance >>> that is reaching the write IOPs limit so this is also a last-ditch >>> effort to keep it on Ceph. >>> >>> This cluster is as straightforward as it gets... 6 servers with 10 >>> SSDs each, 100 Gb networking. I'm using size=3. During operations, >>> the OSDs are more or less idle so I don't suspect any hardware >>> limitations. >>> >>> MySQL has no parallelism so the number of threads and effective queue >>> depth stay pretty low. Therefore, as a proxy for MySQL I use rados >>> bench with 16K writes and 8 threads. The RBD actually gets about 2x >>> this level - still not so great. >>> >>> I get about 2000 IOPs with this test: >>> >>> # rados bench -p volumes 10 write -t 8 -b 16K >>> hints = 1 >>> Maintaining 8 concurrent writes of 16384 bytes to objects of size >>> 16384 for up to 10 seconds or 0 objects >>> Object prefix: benchmark_data_fstosinfra-5_3652583 >>> sec Cur ops started finished avg MB/s cur MB/s last lat(s) avg lat(s) >>> 0 0 0 0 0 0 - 0 >>> 1 8 2050 2042 31.9004 31.9062 0.00247633 0.00390848 >>> 2 8 4306 4298 33.5728 35.25 0.00278488 0.00371784 >>> 3 8 6607 6599 34.3645 35.9531 0.00277546 0.00363139 >>> 4 7 8951 8944 34.9323 36.6406 0.00414908 0.00357249 >>> 5 8 11292 11284 35.257 36.5625 0.00291434 0.00353997 >>> 6 8 13588 13580 35.3588 35.875 0.00306094 0.00353084 >>> 7 7 15933 15926 35.5432 36.6562 0.00308388 0.0035123 >>> 8 8 18361 18353 35.8399 37.9219 0.00314996 0.00348327 >>> 9 8 20629 20621 35.7947 35.4375 0.00352998 0.0034877 >>> 10 5 23010 23005 35.9397 37.25 0.00395566 0.00347376 >>> Total time run: 10.003 >>> Total writes made: 23010 >>> Write size: 16384 >>> Object size: 16384 >>> Bandwidth (MB/sec): 35.9423 >>> Stddev Bandwidth: 1.63433 >>> Max bandwidth (MB/sec): 37.9219 >>> Min bandwidth (MB/sec): 31.9062 >>> Average IOPS: 2300 >>> Stddev IOPS: 104.597 >>> Max IOPS: 2427 >>> Min IOPS: 2042 >>> Average Latency(s): 0.0034737 >>> Stddev Latency(s): 0.00163661 >>> Max latency(s): 0.115932 >>> Min latency(s): 0.00179735 >>> Cleaning up (deleting benchmark objects) >>> Removed 23010 objects >>> Clean up completed and total clean up time :7.44664 >>> >>> >>> Are there any good options to improve this? It seems like the client >>> side is the bottleneck since the OSD servers are at like 15% >>> utilization. >>> >>> Thanks, >>> Mark >>> _______________________________________________ >>> ceph-users mailing list -- ceph-users@xxxxxxx >>> To unsubscribe send an email to ceph-users-leave@xxxxxxx >> > _______________________________________________ > ceph-users mailing list -- ceph-users@xxxxxxx > To unsubscribe send an email to ceph-users-leave@xxxxxxx _______________________________________________ ceph-users mailing list -- ceph-users@xxxxxxx To unsubscribe send an email to ceph-users-leave@xxxxxxx