Re: ceph-osd performance on ram disk

Mark Nelson <mnelson@xxxxxxxxxx> · Fri, 11 Sep 2020 09:44:05 -0500

On 9/11/20 4:15 AM, George Shuklin wrote:
On 10/09/2020 19:37, Mark Nelson wrote:
On 9/10/20 11:03 AM, George Shuklin wrote:

...
Are there any knobs to tweak to see higher performance for ceph-osd? 
I'm pretty sure it's not any kind of leveling, GC or other 
'iops-related' issues (brd has performance of two order of magnitude 
higher).

So as you've seen, Ceph does a lot more than just write a chunk of 
data out to a block on disk.  There's tons of encoding/decoding 
happening, crc checksums, crush calculations, onode lookups, 
write-ahead-logging, and other work involved that all adds latency.  
You can overcome some of that through parallelism, but 30K IOPs per 
OSD is probably pretty on-point for a nautilus era OSD.  For octopus+ 
the cache refactor in bluestore should get you farther (40-50k+ for 
and OSD in isolation).  The maximum performance we've seen in-house 
is around 70-80K IOPs on a single OSD using very fast NVMe and highly 
tuned settings.

A couple of things you can try:

- upgrade to octopus+ for the cache refactor

- Make sure you are using the equivalent of the latency-performance 
or latency-network tuned profile.  The most important part is 
disabling CPU cstate transitions.

- increase osd_memory_target if you have a larger dataset (onode 
cache misses in bluestore add a lot of latency)

- enable turbo if it's disabled (higher clock speed generally helps)

On the write path you are correct that there is a limitation 
regarding a single kv sync thread.  Over the years we've made this 
less of a bottleneck but it's possible you still could be hitting 
it.  In our test lab we've managed to utilize up to around 12-14 
cores on a single OSD in isolation with 16 tp_osd_tp worker threads 
and on a larger cluster about 6-7 cores per OSD.  There's probably 
multiple factors at play, including context switching, cache 
thrashing, memory throughput, object creation/destruction, etc.  If 
you decide to look into it further you may want to try wallclock 
profiling the OSD under load and seeing where it is spending its time. 

Thank you for feedback.

I forgot to mention this, it's Octopus, fresh installation.

I've disabled CSTATE (governor=performance), it make no difference - 
same iops, same CPU use by ceph-osd  I've just can't force Ceph to 
consume more than 330% of CPU. I can force read up to 150k IOPS (both 
network and local), hitting CPU limit, but write is somewhat 
restricted by ceph itself.

Ok, can I assume block/db/wal are all on the ramdisk?  I'd start a 
benchmark and attach gdbpmp to the OSD and see if you can get a 
callgraph (1000 samples is nice if you don't mind waiting a bit). That 
will tell us a lot more about where the code is spending time.  It will 
slow the benchmark way down fwiw.  Some other things you could try:  Try 
to tweak the number of osd worker threads to better match the number of 
cores in your system.  Too many and you end up with context switching.  
Too few and you limit parallelism.  You can also check rocksdb 
compaction stats in the osd logs using this tool:

https://github.com/ceph/cbt/blob/master/tools/ceph_rocksdb_log_parser.py

Given that you are on ramdisk the 1GB default WAL limit should be plenty 
to let you avoid WAL throttling during compaction, but just verifying 
that compactions are not taking a long time is good peace of mind.

Mark

_______________________________________________
ceph-users mailing list -- ceph-users@xxxxxxx
To unsubscribe send an email to ceph-users-leave@xxxxxxx
_______________________________________________
ceph-users mailing list -- ceph-users@xxxxxxx
To unsubscribe send an email to ceph-users-leave@xxxxxxx