On 9/10/20 11:03 AM, George Shuklin wrote:
I'm creating a benchmark suite for Сeph.
During benchmarking of benchmark, I've checked how fast ceph-osd
works. I decided to skip all 'SSD mess' and use brd (block ram disk,
modprobe brd) as underlying storage. Brd itself can yield up to
2.7Mpps in fio. In single thread mode (iodepth=1) it can yield up to
750k IOPS. LVM over brd gives about 600kIOPS in single-threaded mode
with iodepth=1 (16us latency).
But, as soon as I put ceph-osd (bluestore) on it, I see something very
odd. No matter how much parallel load I push onto this OSD, it never
gives more than 30 kIOPS, and I can't understand where bottleneck is.
CPU utilization: ~300%. There are 8 cores on my setup, so, CPU is not
a bottleneck.
Network: I've moved benchmark on the same host as OSD, so it's a
localhost. Even counting network, it's still far away from saturation.
30kIOPS (4k) is about 1Gb/s, but I have 10G links. Anyway, tests are
run on localhost, so network is irrelevant (I've checked it, traffic
is on localhost). Test itself consumes about 70% CPU of one core, so
there are plenty left.
Replication: I've killed it (size=1, single osd in the pool).
single-threaded latency: 200us, 4.8kIOPS.
iopdeth=32: 2ms (15kIOPS).
iodepth=16,numjobs=8: 5ms (24k IOPS)
I'm running fio with 'rados' ioengine, and it looks like putting more
workers doesn't change much, so it's not rados ioengine.
As there is plenty CPU and IO left, there is only one possible place
for bottleneck: some time-consuming single-threaded code in ceph-osd.
Are there any knobs to tweak to see higher performance for ceph-osd?
I'm pretty sure it's not any kind of leveling, GC or other
'iops-related' issues (brd has performance of two order of magnitude
higher).
So as you've seen, Ceph does a lot more than just write a chunk of data
out to a block on disk. There's tons of encoding/decoding happening,
crc checksums, crush calculations, onode lookups, write-ahead-logging,
and other work involved that all adds latency. You can overcome some of
that through parallelism, but 30K IOPs per OSD is probably pretty
on-point for a nautilus era OSD. For octopus+ the cache refactor in
bluestore should get you farther (40-50k+ for and OSD in isolation).
The maximum performance we've seen in-house is around 70-80K IOPs on a
single OSD using very fast NVMe and highly tuned settings.
A couple of things you can try:
- upgrade to octopus+ for the cache refactor
- Make sure you are using the equivalent of the latency-performance or
latency-network tuned profile. The most important part is disabling CPU
cstate transitions.
- increase osd_memory_target if you have a larger dataset (onode cache
misses in bluestore add a lot of latency)
- enable turbo if it's disabled (higher clock speed generally helps)
On the write path you are correct that there is a limitation regarding a
single kv sync thread. Over the years we've made this less of a
bottleneck but it's possible you still could be hitting it. In our test
lab we've managed to utilize up to around 12-14 cores on a single OSD in
isolation with 16 tp_osd_tp worker threads and on a larger cluster about
6-7 cores per OSD. There's probably multiple factors at play, including
context switching, cache thrashing, memory throughput, object
creation/destruction, etc. If you decide to look into it further you
may want to try wallclock profiling the OSD under load and seeing where
it is spending its time.
Mark
_______________________________________________
ceph-users mailing list -- ceph-users@xxxxxxx
To unsubscribe send an email to ceph-users-leave@xxxxxxx