Hi, >>I can only see two things in valgrind profiles: "self" instruction count >>for buffer::list::append and friends are 24% and tcmalloc's are 15%. Crush >>calculation, which I had removed by caching it in my test, was taking 5.7% >>in that same profile, so... maybe if 5.7% stands for 0.015ms - could 24% >>stand for 0.075ms? :). could be interesting to test fio with jemalloc instead tcmalloc #export LD_PRELOAD=${JEMALLOC_PATH}/lib/libjemalloc.so.1 #fio .... In past, I had better results (In proxmox, we still use jemalloc in qemu) http://lists.ceph.com/pipermail/cbt-ceph.com/2015-May/000019.html ----- Mail original ----- De: "Vitaliy Filippov" <vitalif@xxxxxxxxxx> À: "dillaman" <dillaman@xxxxxxxxxx>, "Mark Nelson" <mark.a.nelson@xxxxxxxxx> Cc: "ceph-devel" <ceph-devel@xxxxxxxxxxxxxxx>, ceph-devel-owner@xxxxxxxxxxxxxxx Envoyé: Vendredi 5 Avril 2019 12:45:16 Objet: Re: librados (librbd) slower than krbd > Another big CPU drain is debug ms = 1. We recently decided to disable > it by default in master since the overhead is so high. You can see that > PR here: > > https://github.com/ceph/ceph/pull/26936 Okaaay, thanks, after disabling it the latency difference between krbd and librbd slightly dropped, now it is like 0.57ms (krbd) vs 0.63ms (librbd) in my setup. It's becoming not bad overall since I'm approaching 0.5ms latency... :) I also tried to make a patch for librados which makes it not recalculate PG OSDs for every operation, it also helps, but only slightly by reducing latency by 0.015ms :) (and probably only usable in small clusters with a small number of PGs). I still can't really understand what's making librados so slow... Is it just the C++ code?.. :) I can only see two things in valgrind profiles: "self" instruction count for buffer::list::append and friends are 24% and tcmalloc's are 15%. Crush calculation, which I had removed by caching it in my test, was taking 5.7% in that same profile, so... maybe if 5.7% stands for 0.015ms - could 24% stand for 0.075ms? :). It seems buffer::list::append is called a lot of times, basically for each field of the output structure. Could it be better to allocate several fields at once and fill them by simple assignments or was I just digging in the wrong direction and most of the overhead originated from the copying of the original buffer (which is invisible in the profile)? > and the associated performance data: > > https://docs.google.com/spreadsheets/d/1Zi3MFtvwLzCFfObL6evQKYtINQVQIjZ0SXczG78AnJM/edit?usp=sharing > > Mark -- With best regards, Vitaliy Filippov