Another big CPU drain is debug ms = 1. We recently decided to disable
it by default in master since the overhead is so high. You can see that
PR here:
https://github.com/ceph/ceph/pull/26936
Okaaay, thanks, after disabling it the latency difference between krbd and
librbd slightly dropped, now it is like 0.57ms (krbd) vs 0.63ms (librbd)
in my setup. It's becoming not bad overall since I'm approaching 0.5ms
latency... :)
I also tried to make a patch for librados which makes it not recalculate
PG OSDs for every operation, it also helps, but only slightly by reducing
latency by 0.015ms :) (and probably only usable in small clusters with a
small number of PGs).
I still can't really understand what's making librados so slow... Is it
just the C++ code?.. :)
I can only see two things in valgrind profiles: "self" instruction count
for buffer::list::append and friends are 24% and tcmalloc's are 15%. Crush
calculation, which I had removed by caching it in my test, was taking 5.7%
in that same profile, so... maybe if 5.7% stands for 0.015ms - could 24%
stand for 0.075ms? :).
It seems buffer::list::append is called a lot of times, basically for each
field of the output structure. Could it be better to allocate several
fields at once and fill them by simple assignments or was I just digging
in the wrong direction and most of the overhead originated from the
copying of the original buffer (which is invisible in the profile)?
and the associated performance data:
https://docs.google.com/spreadsheets/d/1Zi3MFtvwLzCFfObL6evQKYtINQVQIjZ0SXczG78AnJM/edit?usp=sharing
Mark
--
With best regards,
Vitaliy Filippov