Re: ceph-osd performance on ram disk

Mark Nelson <mnelson@xxxxxxxxxx> · Thu, 10 Sep 2020 11:37:59 -0500

On 9/10/20 11:03 AM, George Shuklin wrote:

I'm creating a benchmark suite for Сeph.

During benchmarking of benchmark, I've checked how fast ceph-osd 
works. I decided to skip all 'SSD mess' and use brd (block ram disk, 
modprobe brd) as underlying storage. Brd itself can yield up to 
2.7Mpps in fio. In single thread mode (iodepth=1) it can yield up to 
750k IOPS. LVM over brd gives about 600kIOPS in single-threaded mode 
with iodepth=1 (16us latency).

But, as soon as I put ceph-osd (bluestore) on it, I see something very 
odd. No matter how much parallel load I push onto this OSD, it never 
gives more than 30 kIOPS, and I can't understand where bottleneck is.

CPU utilization: ~300%. There are 8 cores on my setup, so, CPU is not 
a bottleneck.

Network: I've moved benchmark on the same host as OSD, so it's a 
localhost. Even counting network, it's still far away from saturation. 
30kIOPS (4k) is about 1Gb/s, but I have 10G links. Anyway, tests are 
run on localhost, so network is irrelevant (I've checked it, traffic 
is on localhost). Test itself consumes about 70% CPU of one core, so 
there are plenty left.

Replication: I've killed it (size=1, single osd in the pool).

single-threaded latency: 200us, 4.8kIOPS.
iopdeth=32: 2ms (15kIOPS).
iodepth=16,numjobs=8: 5ms (24k IOPS)

I'm running fio with 'rados' ioengine, and it looks like putting more 
workers doesn't change much, so it's not rados ioengine.

As there is plenty CPU and IO left, there is only one possible place 
for bottleneck: some time-consuming single-threaded code in ceph-osd.

Are there any knobs to tweak to see higher performance for ceph-osd? 
I'm pretty sure it's not any kind of leveling, GC or other 
'iops-related' issues (brd has performance of two order of magnitude 
higher).

So as you've seen, Ceph does a lot more than just write a chunk of data 
out to a block on disk.  There's tons of encoding/decoding happening, 
crc checksums, crush calculations, onode lookups, write-ahead-logging, 
and other work involved that all adds latency.  You can overcome some of 
that through parallelism, but 30K IOPs per OSD is probably pretty 
on-point for a nautilus era OSD.  For octopus+ the cache refactor in 
bluestore should get you farther (40-50k+ for and OSD in isolation).  
The maximum performance we've seen in-house is around 70-80K IOPs on a 
single OSD using very fast NVMe and highly tuned settings.

A couple of things you can try:

- upgrade to octopus+ for the cache refactor

- Make sure you are using the equivalent of the latency-performance or 
latency-network tuned profile.  The most important part is disabling CPU 
cstate transitions.

- increase osd_memory_target if you have a larger dataset (onode cache 
misses in bluestore add a lot of latency)

- enable turbo if it's disabled (higher clock speed generally helps)

On the write path you are correct that there is a limitation regarding a 
single kv sync thread.  Over the years we've made this less of a 
bottleneck but it's possible you still could be hitting it.  In our test 
lab we've managed to utilize up to around 12-14 cores on a single OSD in 
isolation with 16 tp_osd_tp worker threads and on a larger cluster about 
6-7 cores per OSD.  There's probably multiple factors at play, including 
context switching, cache thrashing, memory throughput, object 
creation/destruction, etc.  If you decide to look into it further you 
may want to try wallclock profiling the OSD under load and seeing where 
it is spending its time.

Mark
_______________________________________________
ceph-users mailing list -- ceph-users@xxxxxxx
To unsubscribe send an email to ceph-users-leave@xxxxxxx