Re: ceph-osd performance on ram disk

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 




On 9/11/20 4:15 AM, George Shuklin wrote:
On 10/09/2020 19:37, Mark Nelson wrote:
On 9/10/20 11:03 AM, George Shuklin wrote:

...
Are there any knobs to tweak to see higher performance for ceph-osd? I'm pretty sure it's not any kind of leveling, GC or other 'iops-related' issues (brd has performance of two order of magnitude higher).


So as you've seen, Ceph does a lot more than just write a chunk of data out to a block on disk.  There's tons of encoding/decoding happening, crc checksums, crush calculations, onode lookups, write-ahead-logging, and other work involved that all adds latency.  You can overcome some of that through parallelism, but 30K IOPs per OSD is probably pretty on-point for a nautilus era OSD.  For octopus+ the cache refactor in bluestore should get you farther (40-50k+ for and OSD in isolation).  The maximum performance we've seen in-house is around 70-80K IOPs on a single OSD using very fast NVMe and highly tuned settings.


A couple of things you can try:


- upgrade to octopus+ for the cache refactor

- Make sure you are using the equivalent of the latency-performance or latency-network tuned profile.  The most important part is disabling CPU cstate transitions.

- increase osd_memory_target if you have a larger dataset (onode cache misses in bluestore add a lot of latency)

- enable turbo if it's disabled (higher clock speed generally helps)


On the write path you are correct that there is a limitation regarding a single kv sync thread.  Over the years we've made this less of a bottleneck but it's possible you still could be hitting it.  In our test lab we've managed to utilize up to around 12-14 cores on a single OSD in isolation with 16 tp_osd_tp worker threads and on a larger cluster about 6-7 cores per OSD.  There's probably multiple factors at play, including context switching, cache thrashing, memory throughput, object creation/destruction, etc.  If you decide to look into it further you may want to try wallclock profiling the OSD under load and seeing where it is spending its time.

Thank you for feedback.

I forgot to mention this, it's Octopus, fresh installation.

I've disabled CSTATE (governor=performance), it make no difference - same iops, same CPU use by ceph-osd  I've just can't force Ceph to consume more than 330% of CPU. I can force read up to 150k IOPS (both network and local), hitting CPU limit, but write is somewhat restricted by ceph itself.


Ok, can I assume block/db/wal are all on the ramdisk?  I'd start a benchmark and attach gdbpmp to the OSD and see if you can get a callgraph (1000 samples is nice if you don't mind waiting a bit). That will tell us a lot more about where the code is spending time.  It will slow the benchmark way down fwiw.  Some other things you could try:  Try to tweak the number of osd worker threads to better match the number of cores in your system.  Too many and you end up with context switching.  Too few and you limit parallelism.  You can also check rocksdb compaction stats in the osd logs using this tool:


https://github.com/ceph/cbt/blob/master/tools/ceph_rocksdb_log_parser.py


Given that you are on ramdisk the 1GB default WAL limit should be plenty to let you avoid WAL throttling during compaction, but just verifying that compactions are not taking a long time is good peace of mind.


Mark





_______________________________________________
ceph-users mailing list -- ceph-users@xxxxxxx
To unsubscribe send an email to ceph-users-leave@xxxxxxx
_______________________________________________
ceph-users mailing list -- ceph-users@xxxxxxx
To unsubscribe send an email to ceph-users-leave@xxxxxxx




[Index of Archives]     [Information on CEPH]     [Linux Filesystem Development]     [Ceph Development]     [Ceph Large]     [Ceph Dev]     [Linux USB Development]     [Video for Linux]     [Linux Audio Users]     [Yosemite News]     [Linux Kernel]     [Linux SCSI]     [xfs]


  Powered by Linux