Re: ceph-osd performance on ram disk

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 



On 9/10/20 11:03 AM, George Shuklin wrote:

I'm creating a benchmark suite for Сeph.

During benchmarking of benchmark, I've checked how fast ceph-osd works. I decided to skip all 'SSD mess' and use brd (block ram disk, modprobe brd) as underlying storage. Brd itself can yield up to 2.7Mpps in fio. In single thread mode (iodepth=1) it can yield up to 750k IOPS. LVM over brd gives about 600kIOPS in single-threaded mode with iodepth=1 (16us latency).

But, as soon as I put ceph-osd (bluestore) on it, I see something very odd. No matter how much parallel load I push onto this OSD, it never gives more than 30 kIOPS, and I can't understand where bottleneck is.

CPU utilization: ~300%. There are 8 cores on my setup, so, CPU is not a bottleneck.

Network: I've moved benchmark on the same host as OSD, so it's a localhost. Even counting network, it's still far away from saturation. 30kIOPS (4k) is about 1Gb/s, but I have 10G links. Anyway, tests are run on localhost, so network is irrelevant (I've checked it, traffic is on localhost). Test itself consumes about 70% CPU of one core, so there are plenty left.

Replication: I've killed it (size=1, single osd in the pool).

single-threaded latency: 200us, 4.8kIOPS.
iopdeth=32: 2ms (15kIOPS).
iodepth=16,numjobs=8: 5ms (24k IOPS)

I'm running fio with 'rados' ioengine, and it looks like putting more workers doesn't change much, so it's not rados ioengine.

As there is plenty CPU and IO left, there is only one possible place for bottleneck: some time-consuming single-threaded code in ceph-osd.

Are there any knobs to tweak to see higher performance for ceph-osd? I'm pretty sure it's not any kind of leveling, GC or other 'iops-related' issues (brd has performance of two order of magnitude higher).


So as you've seen, Ceph does a lot more than just write a chunk of data out to a block on disk.  There's tons of encoding/decoding happening, crc checksums, crush calculations, onode lookups, write-ahead-logging, and other work involved that all adds latency.  You can overcome some of that through parallelism, but 30K IOPs per OSD is probably pretty on-point for a nautilus era OSD.  For octopus+ the cache refactor in bluestore should get you farther (40-50k+ for and OSD in isolation).  The maximum performance we've seen in-house is around 70-80K IOPs on a single OSD using very fast NVMe and highly tuned settings.


A couple of things you can try:


- upgrade to octopus+ for the cache refactor

- Make sure you are using the equivalent of the latency-performance or latency-network tuned profile.  The most important part is disabling CPU cstate transitions.

- increase osd_memory_target if you have a larger dataset (onode cache misses in bluestore add a lot of latency)

- enable turbo if it's disabled (higher clock speed generally helps)


On the write path you are correct that there is a limitation regarding a single kv sync thread.  Over the years we've made this less of a bottleneck but it's possible you still could be hitting it.  In our test lab we've managed to utilize up to around 12-14 cores on a single OSD in isolation with 16 tp_osd_tp worker threads and on a larger cluster about 6-7 cores per OSD.  There's probably multiple factors at play, including context switching, cache thrashing, memory throughput, object creation/destruction, etc.  If you decide to look into it further you may want to try wallclock profiling the OSD under load and seeing where it is spending its time.


Mark
_______________________________________________
ceph-users mailing list -- ceph-users@xxxxxxx
To unsubscribe send an email to ceph-users-leave@xxxxxxx




[Index of Archives]     [Information on CEPH]     [Linux Filesystem Development]     [Ceph Development]     [Ceph Large]     [Ceph Dev]     [Linux USB Development]     [Video for Linux]     [Linux Audio Users]     [Yosemite News]     [Linux Kernel]     [Linux SCSI]     [xfs]


  Powered by Linux