On 9/11/20 4:15 AM, George Shuklin wrote:
On 10/09/2020 19:37, Mark Nelson wrote:
On 9/10/20 11:03 AM, George Shuklin wrote:
...
Are there any knobs to tweak to see higher performance for ceph-osd?
I'm pretty sure it's not any kind of leveling, GC or other
'iops-related' issues (brd has performance of two order of magnitude
higher).
So as you've seen, Ceph does a lot more than just write a chunk of
data out to a block on disk. There's tons of encoding/decoding
happening, crc checksums, crush calculations, onode lookups,
write-ahead-logging, and other work involved that all adds latency.
You can overcome some of that through parallelism, but 30K IOPs per
OSD is probably pretty on-point for a nautilus era OSD. For octopus+
the cache refactor in bluestore should get you farther (40-50k+ for
and OSD in isolation). The maximum performance we've seen in-house
is around 70-80K IOPs on a single OSD using very fast NVMe and highly
tuned settings.
A couple of things you can try:
- upgrade to octopus+ for the cache refactor
- Make sure you are using the equivalent of the latency-performance
or latency-network tuned profile. The most important part is
disabling CPU cstate transitions.
- increase osd_memory_target if you have a larger dataset (onode
cache misses in bluestore add a lot of latency)
- enable turbo if it's disabled (higher clock speed generally helps)
On the write path you are correct that there is a limitation
regarding a single kv sync thread. Over the years we've made this
less of a bottleneck but it's possible you still could be hitting
it. In our test lab we've managed to utilize up to around 12-14
cores on a single OSD in isolation with 16 tp_osd_tp worker threads
and on a larger cluster about 6-7 cores per OSD. There's probably
multiple factors at play, including context switching, cache
thrashing, memory throughput, object creation/destruction, etc. If
you decide to look into it further you may want to try wallclock
profiling the OSD under load and seeing where it is spending its time.
Thank you for feedback.
I forgot to mention this, it's Octopus, fresh installation.
I've disabled CSTATE (governor=performance), it make no difference -
same iops, same CPU use by ceph-osd I've just can't force Ceph to
consume more than 330% of CPU. I can force read up to 150k IOPS (both
network and local), hitting CPU limit, but write is somewhat
restricted by ceph itself.
Ok, can I assume block/db/wal are all on the ramdisk? I'd start a
benchmark and attach gdbpmp to the OSD and see if you can get a
callgraph (1000 samples is nice if you don't mind waiting a bit). That
will tell us a lot more about where the code is spending time. It will
slow the benchmark way down fwiw. Some other things you could try: Try
to tweak the number of osd worker threads to better match the number of
cores in your system. Too many and you end up with context switching.
Too few and you limit parallelism. You can also check rocksdb
compaction stats in the osd logs using this tool:
https://github.com/ceph/cbt/blob/master/tools/ceph_rocksdb_log_parser.py
Given that you are on ramdisk the 1GB default WAL limit should be plenty
to let you avoid WAL throttling during compaction, but just verifying
that compactions are not taking a long time is good peace of mind.
Mark
_______________________________________________
ceph-users mailing list -- ceph-users@xxxxxxx
To unsubscribe send an email to ceph-users-leave@xxxxxxx
_______________________________________________
ceph-users mailing list -- ceph-users@xxxxxxx
To unsubscribe send an email to ceph-users-leave@xxxxxxx