Re: librados (librbd) slower than krbd

Jason Dillaman <jdillama@xxxxxxxxxx> · Mon, 1 Apr 2019 16:25:23 -0400

Just to be clear, did you run with "direct=1" on your fio tests? I
would also recommend disabling the librbd in-memory cache for random
IO tests against fast storage (rbd cache = false).

Semi-recent performance tests against the librbd vs krbd against a
single NVMe-backed OSD showed 4K random write performance at around
75K MiBs/s for krbd and 90 MiBs/s for librbd -- but at the expense of
nearly 4x more client-side CPU for librbd. krbd pre-allocates a lot of
its up-front memory in slab allocation pools and zero-copies as much
as possible. librbd/librados are heavily hit w/ numerous C++ small
object heap allocations and related initialize/copy operations.

On Mon, Apr 1, 2019 at 3:06 PM Виталий Филиппов <vitalif@xxxxxxxxxx> wrote:
>
> Interesting, thanks... but how does the krbd driver handle it?
>
> Also it doesn't seem to be a big bottleneck with small writes, at least I don't see ceph::buffer::copy in valgrind and perf profiles...
>
> 1 апреля 2019 г. 21:25:42 GMT+03:00, Jason Dillaman <jdillama@xxxxxxxxxx> пишет:
>>
>> The C librbd/librados APIs currently need to copy provided user
>> buffer. There is a goal to remove this unnecessary copy once the
>> underlying issue that necessitates the copy is addressed, but in the
>> meantime, this CPU flame graphs will highlight that copy as a major
>> consumer of the CPU for larger IOs [1]. There is also a lot of
>> additional memory allocation and lock contention occurring in the
>> user-space libraries that will also impact CPU and wall-clock time
>> usage.
>>
>> On Mon, Apr 1, 2019 at 2:08 PM <vitalif@xxxxxxxxxx> wrote:
>>>
>>>
>>>  Hello,
>>>
>>>  I've recently benchmarked random writes into the same RBD image in a
>>>  full-flash cluster using `fio -ioengine=librbd` and `fio
>>>  -ioengine=libaio` into krbd-mapped /dev/rbd0.
>>>
>>>  The result was that with iodepth=1 librbd gave ~0.86ms latency and krbd
>>>  gave ~0.74ms; with iodepth=128 librbd gave ~9900 iops and krbd gave
>>>  ~17000 iops. That's a huge difference, it basically means a lot of
>>>  performance is wasted on the client side.
>>>
>>>  Also it seems the performance impact does not come from librbd, it comes
>>>  directly from librados, because our ceph-bench / ceph-gobench tools give
>>>  almost identical write IOPS as librbd.
>>>
>>>  My question is: could anyone make a guess about the thing that's
>>>  consuming so much CPU time in librados compared to the kernel rados
>>>  client?
>>>
>>>  I tried to profile it with valgrind, from valgrind profiles it seems
>>>  it's mostly ceph::buffer::list::append and friends. Could it be the
>>>  right thing?
>>>
>>>  --
>>>  With best regards,
>>>     Vitaliy Filippov
>>
>>
>>
>> [1] https://github.com/ceph/ceph/pull/25689#issuecomment-472271162
>
>
> --
> With best regards,
> Vitaliy Filippov

-- 
Jason