Re: librados (librbd) slower than krbd

Alexandre DERUMIER <aderumier@xxxxxxxxx> · Tue, 30 Apr 2019 16:39:17 +0200 (CEST)

Hi,

>>I can only see two things in valgrind profiles: "self" instruction count  
>>for buffer::list::append and friends are 24% and tcmalloc's are 15%. Crush  
>>calculation, which I had removed by caching it in my test, was taking 5.7%  
>>in that same profile, so... maybe if 5.7% stands for 0.015ms - could 24%  
>>stand for 0.075ms? :).

could be interesting to test fio with jemalloc instead tcmalloc

#export LD_PRELOAD=${JEMALLOC_PATH}/lib/libjemalloc.so.1 
#fio ....

In past, I had better results (In proxmox, we still use jemalloc in qemu)
http://lists.ceph.com/pipermail/cbt-ceph.com/2015-May/000019.html

----- Mail original -----
De: "Vitaliy Filippov" <vitalif@xxxxxxxxxx>
À: "dillaman" <dillaman@xxxxxxxxxx>, "Mark Nelson" <mark.a.nelson@xxxxxxxxx>
Cc: "ceph-devel" <ceph-devel@xxxxxxxxxxxxxxx>, ceph-devel-owner@xxxxxxxxxxxxxxx
Envoyé: Vendredi 5 Avril 2019 12:45:16
Objet: Re: librados (librbd) slower than krbd

> Another big CPU drain is debug ms = 1. We recently decided to disable 
> it by default in master since the overhead is so high. You can see that 
> PR here: 
> 
> https://github.com/ceph/ceph/pull/26936 

Okaaay, thanks, after disabling it the latency difference between krbd and 
librbd slightly dropped, now it is like 0.57ms (krbd) vs 0.63ms (librbd) 
in my setup. It's becoming not bad overall since I'm approaching 0.5ms 
latency... :) 

I also tried to make a patch for librados which makes it not recalculate 
PG OSDs for every operation, it also helps, but only slightly by reducing 
latency by 0.015ms :) (and probably only usable in small clusters with a 
small number of PGs). 

I still can't really understand what's making librados so slow... Is it 
just the C++ code?.. :) 

I can only see two things in valgrind profiles: "self" instruction count 
for buffer::list::append and friends are 24% and tcmalloc's are 15%. Crush 
calculation, which I had removed by caching it in my test, was taking 5.7% 
in that same profile, so... maybe if 5.7% stands for 0.015ms - could 24% 
stand for 0.075ms? :). 

It seems buffer::list::append is called a lot of times, basically for each 
field of the output structure. Could it be better to allocate several 
fields at once and fill them by simple assignments or was I just digging 
in the wrong direction and most of the overhead originated from the 
copying of the original buffer (which is invisible in the profile)? 

> and the associated performance data: 
> 
> https://docs.google.com/spreadsheets/d/1Zi3MFtvwLzCFfObL6evQKYtINQVQIjZ0SXczG78AnJM/edit?usp=sharing 
> 
> Mark 

-- 
With best regards, 
Vitaliy Filippov