We
run fio benchmark
against a 3-node ceph cluster.
Objects size is 4kb. We use
gdbpmp profiler https://github.com/markhpc/gdbpmp
to analyze
threads' performance.
Following
the profiling report, I have 2 questions:
-
in
our setting, the bstore_kv_sync thread
runs asynchronous
rocksdb transactions
98% of the time (in the other 2% it runs synchronous transactions). How
does this align with Ceph durability guarantees? What
happens if the OSD fails after returning a success indication and before the
wal memory
buffer is flushed to disk? Are you assuming the
wal memory
buffer is flushed while the value is written to the memtable? While
this is reasonable it cannot guarantee 100% durability. Am
I missing something in the write path?
-
Each
osd has
a single bstore_kv_sync thread
and 16 tp_osd_tp threads.
bstore_kv_sync thread
is always busy, while
tp_osd_tp threads
are not busy most of the time. Given that 3 of rocksdb CFs
are sharded, and
sharding is
configurable, why
not run multiple (3) bstore_kv_sync threads? assuming
they will not have conflicts most of the time. This has the potential of removing the
rocksdb bottleneck
and increasing IOPS
Thank you,
Eshcar
|
_______________________________________________
Dev mailing list -- dev@xxxxxxx
To unsubscribe send an email to dev-leave@xxxxxxx