Re: [RFC] Performance experiments with Ceph messenger

Piotr Dałek <piotr.dalek@xxxxxxxxxxxx> · Tue, 13 Nov 2018 09:37:07 +0100

On 2018-11-11 10:17 p.m., Roman Penyaev wrote:
Hi Piotr,

On 2018-11-09 11:07, Piotr Dalek wrote:
> [..]
2. Perf counters (or rather - the amount of it) is a real issue, so
far nobody decided to do any kind of sanity audit and thus it's
unclear if all of them are really necessary/useful. Usually when
someone decides to add them, they just issue a PR and soon they're
included without any further discussion. For sure some of counters at
least shouldn't be enabled in production-class binaries.
3 and 4. You're absolutely on point here - preallocating and reusing
data structures is the way to go. Back in 2015 everyone agreed that
Ceph has really bad memory management - lots of allocations,
deallocations and memory block copies - across different paths. The
only "big" fixes that people came up with was increasing TCMalloc
cache to 128MB (up from default 32MB) and replacing TCMalloc with
Jemalloc which with recent releases is impossible
(http://tracker.ceph.com/issues/20557) - fixing the code to utilize
proper memory management strategies is difficult in case of Ceph
because of the Bufferlist class that makes it easy to do complex
things, but not necessarily in the high-performance way.

What I've noticed in 'perf report' output is a huge amount of reference
increases/decreases for buffer::ptr objects.  So it seems allocation and
copies is not only the issue, but also a huge number of atomic operations
on hot paths.

This is something you can't really avoid, because that's an artifact of 
buffer interface abuse. Most commonly this happens when moving bufferptrs 
between bufferlists or sharing them with other bufferlists.

Over a year ago Red Hat started to rewrite large parts of Ceph code to
utilize Seastar framework which gave hope for the above to change. But
it's unclear whether it'll be the case, what performance gains should
users expect and whether it'll require users to redesign their
clusters or other costly labor. On the other hand, such large rewrite
puts a huge question mark on any performance improvement work as
nobody can tell you with 100% certainty that your work won't be
dropped during the transition to Seastar.

Did anyone try to do approximate tests putting seastar instead of msg/async?
It should not be a huge amount of work before first results can shed
the light and answer the question: is it worth to do or not.  Seems
couple of weeks, no?

Yes, they did a prototype: https://github.com/cohortfsllc/crimson
But I don't remember seeing any perf numbers (I might have an amnesia, though).

But even seastar replaces all messenger internals - it won't replace
bufferlist and the whole message allocation strategy without deep
refactoring.  Or am I mistaken here?  What I mean is this can be
a bottleneck even with any imaginary 0-latency IO library.

That's true and if I sound skeptical, then it's because of this.
But I'm still looking forward to it, as by far the greatest CPU and memory 
bandwidth consumer is copying data off kernel land back into kernel land in 
replicated object write scenarios (read from socket, process, write to two 
other sockets, write to device). Seastar + SPDK + DPDK is meant to reduce 
these memory copies as everything would happen in userspace and no 
transition to kernelspace is necessary.

Besides, careful, optimal
buffer handling is difficult *especially* in complex multi-threaded
software like Ceph, and it's way less fun than, for example, using
tons of language features to get rid of integer divisions.

I just want to minimize obvious costly things like allocations, atomic
ops and sendmsg() syscalls, no rocket science here.  Especially that
makes sense in case of RDMA transport layer, which on my fio tests
does not show any performance gain, because for each IO before reaching
the actual hardware cpu is busy doing a lot of other things.

As Gregory already pointed out, Ceph is far past the point where everything 
was nice and simple - bufferptrs received from messenger may be later 
referenced by tons of other subsystems and threads in seemingly unrelated 
bufferlists. Of course you're free to have a stab at this, but if it was so 
obvious and simple, someone would have done this long time ago.

--
Piotr Dałek
piotr.dalek@xxxxxxxxxxxx
https://www.ovhcloud.com