Re: [RFC] Performance experiments with Ceph messenger

Gregory Farnum <gfarnum@xxxxxxxxxx> · Mon, 12 Nov 2018 13:15:10 +0530

On Mon, Nov 12, 2018 at 3:28 AM Roman Penyaev <rpenyaev@xxxxxxx> wrote:
>
> Hi Gregory,
>
> On 2018-11-09 22:23, Gregory Farnum wrote:
> > On Thu, Nov 8, 2018 at 7:55 AM Roman Penyaev <rpenyaev@xxxxxxx> wrote:
> >
> >> Hi all,
> >>
> >> I would like to share some performance experiments based on new fio
> >> engine [1],
> >> which tests bare Ceph messenger without any disk IO or libraries
> >> involved,
> >> only messenger and subsidiary classes from src/msg/async/*.
> >>
> >> Firstly I would like to say that I am completely new to Ceph and
> >> started
> >> investigating performance numbers without any clue or deep
> >> understanding
> >> of messenger framework, the idea was to use profiler and then apply
> >> hacky
> >> quick fixes in order to squeeze everything from bandwidth or latency
> >> and
> >> send messages as fast as possible.
> >>
> >> Without any changes applied, based on latest master, with only one
> >> line
> >> in
> >> ceph.conf: ms_type=async+posix and with the following fio config:
> >>
> >>    bs=4k
> >>    size=3g
> >>    iodepth=128
> >>
> >
> > The FIO messenger test is interesting but I think the few tests we've
> > seen
> > out of it have been pretty out of whack with other existing tests;
> > Ricardo
> > might know more.
> >
> >
> >
> >>
> >> I got reference numbers on loopback connection with async+posix
> >> messenger:
> >>
> >>     IOPS=80.1k, BW=313MiB/s, Lat=1.598ms
> >>
> >> I have to mention that at this point of measurements fio engine
> >> allocates
> >> and deallocates MOSDOp and MOSDOpReply for each IO, repeating behavior
> >> of
> >> other Ceph components.  A bit later I will provide fio numbers when
> >> all
> >> messages (requests and replies) are cached, i.e. the whole queue of
> >> fixed
> >> size is preallocated.
> >>
> >> 1. First profiling showed that a lot of time we spent in crc
> >> calculation
> >> for
> >> data and header.  The following lines are added to ceph.conf:
> >>
> >>    ms_crc_data=false
> >>    ms_crc_header=false
> >>
> >> And the following is fio numbers after new config is applied:
> >>
> >>    before: IOPS=80.1k, BW=313MiB/s, Lat=1.598ms
> >>     after: IOPS=110k,  BW=429MiB/s, Lat=1.164ms
> >>
> >> My first question: what is the reason to calculate and then check crc
> >> for
> >> messages which are sent over reliable network?  ~100Mb/s and ~30k IOPS
> >> is
> >> quite a high price for making already reliable connection "more
> >> reliable".
> >>
> >
> > We have seen network errors pretty consistently when we disable our own
> > internal CRCs. I'm not sure if we collect numbers on the failures we
> > see in
> > practical use, but without data this is not a configuration option we
> > can
> > change for real users. :(
>
> Can it be that crc32 is disabled for ethernet data layer or nics in such
> networks where you saw these errors?  Otherwise this is weird.

I'm pretty sure that's not the case. Ethernet/TCP crc checksums are
just not as strong as people want to believe they are in comparison to
the amount of data we transmit. :(

> >> 2. The other thing which can be improved is throttling: by default on
> >> each
> >> IO Throttle::get_or_fail() is called which invokes
> >> pthread_mutex_lock()
> >> and
> >> PerfCounters::inc().  Since there is no any contention on the mutex
> >> I suspect CPU cache misses.  When the following line is applied to
> >> ceph.conf:
> >>
> >>    ms_dispatch_throttle_bytes=0
> >>
> >> Fio shows these numbers:
> >>
> >>    before: IOPS=110k, BW=429MiB/s, Lat=1.164ms
> >>     after: IOPS=114k, BW=444MiB/s, Lat=1.125ms
> >>
> >> And the following is the output of `perf stat`:
> >>
> >> Before:
> >>
> >>      13057.583388      task-clock:u (msec)       #    1.609 CPUs
> >> utilized
> >>    26,042,061,515      cycles:u                  #    1.994 GHz
> >>             (57.77%)
> >>    40,643,744,150      instructions:u            #    1.56  insn per
> >> cycle           (71.79%)
> >>       815,662,912      cache-references:u        #   62.467 M/sec
> >>             (71.64%)
> >>        12,926,237      cache-misses:u            #    1.585 % of all
> >> cache refs      (70.93%)
> >>    12,695,706,281      L1-dcache-loads:u         #  972.286 M/sec
> >>             (70.87%)
> >>       455,625,889      L1-dcache-load-misses:u   #    3.59% of all
> >> L1-dcache hits    (71.30%)
> >>     8,263,315,484      L1-dcache-stores:u        #  632.837 M/sec
> >>             (57.49%)
> >>
> >> After:
> >>
> >>      12516.889311      task-clock:u (msec)       #    1.631 CPUs
> >> utilized
> >>    24,987,047,978      cycles:u                  #    1.996 GHz
> >>             (57.01%)
> >>    40,072,709,633      instructions:u            #    1.60  insn per
> >> cycle           (71.49%)
> >>       792,468,416      cache-references:u        #   63.312 M/sec
> >>             (70.94%)
> >>         8,494,440      cache-misses:u            #    1.072 % of all
> >> cache refs      (71.60%)
> >>    12,424,744,615      L1-dcache-loads:u         #  992.638 M/sec
> >>             (71.82%)
> >>       438,946,415      L1-dcache-load-misses:u   #    3.53% of all
> >> L1-dcache hits    (71.39%)
> >>     8,199,282,875      L1-dcache-stores:u        #  655.058 M/sec
> >>             (57.24%)
> >>
> >>
> >> Overall cache-misses is slightly reduced along with cache-references,
> >> thus
> >> the rate of dcache-loads is increased: 992.638 M/sec against 972.286
> >> M/sec.
> >>
> >> This performance drop on throttling does not depend on actual value
> >> set
> >> for
> >> ms_dispatch_throttle_bytes config option, because fio_ceph_messenger
> >> engine
> >> uses only fast dispatch path, so even
> >> ms_dispatch_throttle_bytes=9999999999
> >> is set this does not bring any visible effect.  But when the option is
> >> set
> >> to 0 - execution follows another path and we immediately return from
> >> Throttle::get_or_fail() without any attempt to take locks or
> >> atomically
> >> increase counter values.  As a future fix which can increase overall
> >> Ceph
> >> performance is to keep perf counters in thread local storage, thus to
> >> avoid
> >> atomic ops.
> >>
> >
> > Improving the perfcounters when they cause issues is definitely
> > something
> > worth exploring!
> >
> >
> >>
> >> 3. As I mentioned earlier it is worth to reduce
> >> allocation/deallocation
> >> rate
> >> of messages preallocating them.  I did a small patch [2] and
> >> preallocate
> >> the
> >> whole queue on client and server sides of fio engine.  The other thing
> >> is
> >> worth to mention is that on a message completion I do not call put(),
> >> but call
> >> a new completion callback (completion hook does not fit well, since it
> >> is
> >> called in destructor, i.e. again extra allocations/deletions for each
> >> IO).
> >> It is always a good thing to reducing atomic incs/decs on a fast path,
> >> and
> >> since I fully control a message and never free it I do not need a
> >> quite
> >> expensive inc/get on each IO.  This is what I got:
> >>
> >>    before: IOPS=114k, BW=444MiB/s, Lat=1.125ms
> >>     after: IOPS=132k, BW=514MiB/s, Lat=0.973ms
> >>
> >> In current ceph master branch sizeof of MOSDOp is 824 bytes and sizeof
> >> of
> >> MOSDOpReply is 848 bytes, which of course is worth to keep in a fixed
> >> size
> >> queue.
> >
> >
> > I am skeptical that this is actually a useful avenue. These messages
> > will
> > always be accompanied by, you know, actual data IO which has a good
> > chance
> > of dominating, and the relative numbers we need for different message
> > "front" buffer types are going to vary.
>
> I did not get it.  I am talking about regular allocation of messages
> which are being sent and then immediately freed on a first put().
> My question was: will it be efficient if I simply preallocate the
> whole queue.  Yes, it can bring ~70MB/s.  And under the queue I mean
> not data IO, I mean objects, messages, like MOSDOp or MOSDOpReply or
> any other containers which hold decoded/encoded data/payload and other
> buffers.  Of course on each IO you have to pick up a free message
> from a queue and init it with a data pointer, which comes from a
> layer above, i.e. from a user of a messenger. For example the
> following is a chunk of a code from the fio engine:
>
>    req_msg = static_cast<MOSDOp *>(io->msg);
>
>    /* No handy method to clear ops before reusage? Ok */
>    req_msg->ops.clear();
>    req_msg->write(0, io_u->buflen, buflist);
>    /* Keep message alive */
>    req_msg->get();
>    req_msg->get_connection()->send_message(req_msg);
>
> where req_msg is never destroyed, it always exists in a queue.
>
> In this example I use only one type of a message MOSDOp, but it can
> be easily changed and cover all types of messages, then just
> construct/destruct any message on a preallocated memory from a queue
> and you will save new/delete cycle of ~1k.

Okay, so there's a very specific pattern in the fio messenger testing
engine that doesn't apply elsewhere in Ceph: namely, the constant
transmission of the same message type with the same data contents, and
the message allocation is a major driver of performance. This type of
microbenchmarking is a really good way to identify bottlenecks within
a particular part of your system, but it has a major systemic
weakness: you can't tell if the part of the system you are
microbenching is actually a driver of overall system performance.

Now, it's true: we could save the allocation and deallocation of an
MOSDOp structure if we maintain a slab cache or similar. But:
1) we now have another slab cache floating around which adds complexity,
2) we have no memory locality at all (including which processor owns
the memory!) between the MOSDOp and the actual data
3) we have to deal with the slab cache being under- or over-sized.

I don't think we have any evidence the actual Message struct
allocation/deallocation (as opposed to the data or other bufferlists)
is an important issue driving Ceph performance, and the costs of
removing it are nontrivial.

>
> >>   One of the possible ways is to keep a queue of a union of all
> >> messages
> >> inside a messenger itself, and each user of a messenger has to ask for
> >> a
> >> free
> >> request.  Fixed size queue implies also a back pressure: if no free
> >> requests
> >> exist in the queue user has to wait, thus throttling also can be
> >> easily
> >> implemented simply changing queue size at run-time.
> >
> >
> > Nope nope nope. This is not a realistic option given the
> > interdependencies
> > messages can have.
>
> Sorry, I don't understand what is exactly not realistic, could you
> please
> be a bit more specific?

It is quite possible for OSDs to have chains of message dependencies
that would cause deadlocks if the system has a fixed message count
available to it. For instance, osd.a might get a client request that
requires a pull from a peer osd.b; if osd.a uses its last message to
send the message, osd.b doesn't have an empty message slot available
and is blocked on a third osd.c which is itself out of messages and
waiting on a reply from osd.a. Deadlock!
You'll notice that the OSDs apply a Policy throttler in the messaging
layer to requests from clients but allow unlimited messages from
peers, precisely because of this.
-Greg