Re: [RFC] Performance experiments with Ceph messenger

Roman Penyaev <rpenyaev@xxxxxxx> · Tue, 13 Nov 2018 11:50:27 +0100

Hi Piotr,

What I've noticed in 'perf report' output is a huge amount of 
reference
increases/decreases for buffer::ptr objects.  So it seems allocation 
and
copies is not only the issue, but also a huge number of atomic 
operations
on hot paths.

This is something you can't really avoid, because that's an artifact
of buffer interface abuse. Most commonly this happens when moving
bufferptrs between bufferlists or sharing them with other bufferlists.

At least we can try to minimize it for send/recv path in messenger.
Step by step.  If you take a look into ProtocolV1::write_event() you 
will see
temporal bufferlist data, to which then message buffers appended three 
times
in prepare_send_message().  Then inside write_message() the same 
temporal
data buffer is appended to outcoming_bl collecting all the buffers 
together.
On high iops rate this kills performance.  I targeted exactly this issue
in https://github.com/rouming/ceph/commit/90831f241bc collecting all 
buffers
in iovec instead of appending them to bufferslist.  This is quite cheap 
and
doable optimization which brings some profit.

Over a year ago Red Hat started to rewrite large parts of Ceph code 
to
utilize Seastar framework which gave hope for the above to change. 
But
it's unclear whether it'll be the case, what performance gains should
users expect and whether it'll require users to redesign their
clusters or other costly labor. On the other hand, such large rewrite
puts a huge question mark on any performance improvement work as
nobody can tell you with 100% certainty that your work won't be
dropped during the transition to Seastar.

Did anyone try to do approximate tests putting seastar instead of 
msg/async?
It should not be a huge amount of work before first results can shed
the light and answer the question: is it worth to do or not.  Seems
couple of weeks, no?

Yes, they did a prototype: https://github.com/cohortfsllc/crimson
But I don't remember seeing any perf numbers (I might have an amnesia, 
though).

Hm, is this a correct repo?  It seems empty (24 commits) and was not
touched for a year (other parts for 3 years).  Implementation also looks
like just stubs.

But even seastar replaces all messenger internals - it won't replace
bufferlist and the whole message allocation strategy without deep
refactoring.  Or am I mistaken here?  What I mean is this can be
a bottleneck even with any imaginary 0-latency IO library.

That's true and if I sound skeptical, then it's because of this.
But I'm still looking forward to it, as by far the greatest CPU and
memory bandwidth consumer is copying data off kernel land back into
kernel land in replicated object write scenarios (read from socket,
process, write to two other sockets, write to device). Seastar + SPDK
+ DPDK is meant to reduce these memory copies as everything would
happen in userspace and no transition to kernelspace is necessary.

My current concern is ceph RDMA implementation which claims exactly the
same: direct access to the hardware.  But because of a lot things
messenger does along the way from request submission to issuing it to
kernel or hardware - sadly RDMA does not show desired numbers.

Besides, careful, optimal
buffer handling is difficult *especially* in complex multi-threaded
software like Ceph, and it's way less fun than, for example, using
tons of language features to get rid of integer divisions.

I just want to minimize obvious costly things like allocations, atomic
ops and sendmsg() syscalls, no rocket science here.  Especially that
makes sense in case of RDMA transport layer, which on my fio tests
does not show any performance gain, because for each IO before 
reaching
the actual hardware cpu is busy doing a lot of other things.

As Gregory already pointed out, Ceph is far past the point where
everything was nice and simple - bufferptrs received from messenger
may be later referenced by tons of other subsystems and threads in
seemingly unrelated bufferlists. Of course you're free to have a stab
at this, but if it was so obvious and simple, someone would have done
this long time ago.

Seems putting all buffers in iovec is obvious and simple.  Please 
correct
me if I am wrong or missing something.

--
Roman