profiling osd threads - durability and scalability questions

Eshcar Hillel <eshcarh@xxxxxxxxxx> · Sun, 30 Oct 2022 18:36:50 +0000

Hi
Ceph Devs 

We
 run fio benchmark
 against a 3-node ceph cluster.
 Objects size is 4kb. We use
gdbpmp profiler https://github.com/markhpc/gdbpmp
to analyze
threads' performance. 

Following
 the profiling report, I have 2 questions: 

in
 our setting, the bstore_kv_sync thread
runs asynchronous
rocksdb transactions
 98% of the time (in the other 2% it runs synchronous transactions). How
 does this align with Ceph durability guarantees? What
 happens if the OSD fails after returning a success indication and before the 
wal memory
 buffer is flushed to disk? Are you assuming the
wal memory
 buffer is flushed while the value is written to the memtable? While
 this is reasonable it cannot guarantee 100% durability. Am
 I missing something in the write path? 

Each
osd has
 a single bstore_kv_sync thread
 and 16 tp_osd_tp threads.
bstore_kv_sync thread
 is always busy, while
tp_osd_tp threads
 are not busy most of the time. Given that 3 of rocksdb CFs
 are sharded, and
sharding is
 configurable, why
 not run multiple (3) bstore_kv_sync threads? assuming
 they will not have conflicts most of the time. This has the potential of removing the
rocksdb bottleneck
 and increasing IOPS 

Thank you,

Eshcar

_______________________________________________
Dev mailing list -- dev@xxxxxxx
To unsubscribe send an email to dev-leave@xxxxxxx