Re: [RFC] Performance experiments with Ceph messenger

Piotr Dałek <piotr.dalek@xxxxxxxxxxxx> · Tue, 13 Nov 2018 12:24:15 +0100

On 2018-11-13 11:50 a.m., Roman Penyaev wrote:
Hi Piotr,

What I've noticed in 'perf report' output is a huge amount of reference
increases/decreases for buffer::ptr objects.  So it seems allocation and
copies is not only the issue, but also a huge number of atomic operations
on hot paths.

This is something you can't really avoid, because that's an artifact
of buffer interface abuse. Most commonly this happens when moving
bufferptrs between bufferlists or sharing them with other bufferlists.

At least we can try to minimize it for send/recv path in messenger.
Step by step.  If you take a look into ProtocolV1::write_event() you will see
temporal bufferlist data, to which then message buffers appended three times
in prepare_send_message().  Then inside write_message() the same temporal
data buffer is appended to outcoming_bl collecting all the buffers together.
On high iops rate this kills performance.  I targeted exactly this issue
in https://github.com/rouming/ceph/commit/90831f241bc collecting all buffers
in iovec instead of appending them to bufferslist.  This is quite cheap and
doable optimization which brings some profit.

That's why I called it "abuse" :)
In any case, you're free to make actual pull request.

Over a year ago Red Hat started to rewrite large parts of Ceph code to
utilize Seastar framework which gave hope for the above to change. But
it's unclear whether it'll be the case, what performance gains should
users expect and whether it'll require users to redesign their
clusters or other costly labor. On the other hand, such large rewrite
puts a huge question mark on any performance improvement work as
nobody can tell you with 100% certainty that your work won't be
dropped during the transition to Seastar.

Did anyone try to do approximate tests putting seastar instead of msg/async?
It should not be a huge amount of work before first results can shed
the light and answer the question: is it worth to do or not.  Seems
couple of weeks, no?

Yes, they did a prototype: https://github.com/cohortfsllc/crimson
But I don't remember seeing any perf numbers (I might have an amnesia, 
though).

Hm, is this a correct repo?  It seems empty (24 commits) and was not
touched for a year (other parts for 3 years).  Implementation also looks
like just stubs.

Because this was prototype they were working on. It has several branches, 
have you seen them?
Besides, careful, optimal
buffer handling is difficult *especially* in complex multi-threaded
software like Ceph, and it's way less fun than, for example, using
tons of language features to get rid of integer divisions.

I just want to minimize obvious costly things like allocations, atomic
ops and sendmsg() syscalls, no rocket science here.  Especially that
makes sense in case of RDMA transport layer, which on my fio tests
does not show any performance gain, because for each IO before reaching
the actual hardware cpu is busy doing a lot of other things.

As Gregory already pointed out, Ceph is far past the point where
everything was nice and simple - bufferptrs received from messenger
may be later referenced by tons of other subsystems and threads in
seemingly unrelated bufferlists. Of course you're free to have a stab
at this, but if it was so obvious and simple, someone would have done
this long time ago.

Seems putting all buffers in iovec is obvious and simple.  Please correct
me if I am wrong or missing something.

You have to assure that all data referenced in iovec will remain unchanged 
during entire sendmsg() call - that may return before all data is written or 
be preempted during write. If other thread does something to data processed 
by sendmsg() you'll end up with crashes or data corruption.

--
Piotr Dałek
piotr.dalek@xxxxxxxxxxxx
https://www.ovhcloud.com