Re: [RFC] Performance experiments with Ceph messenger

Roman Penyaev <rpenyaev@xxxxxxx> · Sun, 11 Nov 2018 22:58:22 +0100

Hi Gregory,

On 2018-11-09 22:23, Gregory Farnum wrote:
On Thu, Nov 8, 2018 at 7:55 AM Roman Penyaev <rpenyaev@xxxxxxx> wrote:

Hi all,

I would like to share some performance experiments based on new fio
engine [1],
which tests bare Ceph messenger without any disk IO or libraries
involved,
only messenger and subsidiary classes from src/msg/async/*.

Firstly I would like to say that I am completely new to Ceph and 
started
investigating performance numbers without any clue or deep 
understanding
of messenger framework, the idea was to use profiler and then apply
hacky
quick fixes in order to squeeze everything from bandwidth or latency 
and
send messages as fast as possible.

Without any changes applied, based on latest master, with only one 
line
in
ceph.conf: ms_type=async+posix and with the following fio config:

   bs=4k
   size=3g
   iodepth=128

The FIO messenger test is interesting but I think the few tests we've 
seen
out of it have been pretty out of whack with other existing tests; 
Ricardo
might know more.

I got reference numbers on loopback connection with async+posix
messenger:

    IOPS=80.1k, BW=313MiB/s, Lat=1.598ms

I have to mention that at this point of measurements fio engine
allocates
and deallocates MOSDOp and MOSDOpReply for each IO, repeating behavior
of
other Ceph components.  A bit later I will provide fio numbers when 
all
messages (requests and replies) are cached, i.e. the whole queue of
fixed
size is preallocated.

1. First profiling showed that a lot of time we spent in crc 
calculation
for
data and header.  The following lines are added to ceph.conf:

   ms_crc_data=false
   ms_crc_header=false

And the following is fio numbers after new config is applied:

   before: IOPS=80.1k, BW=313MiB/s, Lat=1.598ms
    after: IOPS=110k,  BW=429MiB/s, Lat=1.164ms

My first question: what is the reason to calculate and then check crc
for
messages which are sent over reliable network?  ~100Mb/s and ~30k IOPS
is
quite a high price for making already reliable connection "more
reliable".

We have seen network errors pretty consistently when we disable our own
internal CRCs. I'm not sure if we collect numbers on the failures we 
see in
practical use, but without data this is not a configuration option we 
can
change for real users. :(

Can it be that crc32 is disabled for ethernet data layer or nics in such
networks where you saw these errors?  Otherwise this is weird.

2. The other thing which can be improved is throttling: by default on
each
IO Throttle::get_or_fail() is called which invokes 
pthread_mutex_lock()
and
PerfCounters::inc().  Since there is no any contention on the mutex
I suspect CPU cache misses.  When the following line is applied to
ceph.conf:

   ms_dispatch_throttle_bytes=0

Fio shows these numbers:

   before: IOPS=110k, BW=429MiB/s, Lat=1.164ms
    after: IOPS=114k, BW=444MiB/s, Lat=1.125ms

And the following is the output of `perf stat`:

Before:

     13057.583388      task-clock:u (msec)       #    1.609 CPUs 
utilized
   26,042,061,515      cycles:u                  #    1.994 GHz
            (57.77%)
   40,643,744,150      instructions:u            #    1.56  insn per
cycle           (71.79%)
      815,662,912      cache-references:u        #   62.467 M/sec
            (71.64%)
       12,926,237      cache-misses:u            #    1.585 % of all
cache refs      (70.93%)
   12,695,706,281      L1-dcache-loads:u         #  972.286 M/sec
            (70.87%)
      455,625,889      L1-dcache-load-misses:u   #    3.59% of all
L1-dcache hits    (71.30%)
    8,263,315,484      L1-dcache-stores:u        #  632.837 M/sec
            (57.49%)

After:

     12516.889311      task-clock:u (msec)       #    1.631 CPUs 
utilized
   24,987,047,978      cycles:u                  #    1.996 GHz
            (57.01%)
   40,072,709,633      instructions:u            #    1.60  insn per
cycle           (71.49%)
      792,468,416      cache-references:u        #   63.312 M/sec
            (70.94%)
        8,494,440      cache-misses:u            #    1.072 % of all
cache refs      (71.60%)
   12,424,744,615      L1-dcache-loads:u         #  992.638 M/sec
            (71.82%)
      438,946,415      L1-dcache-load-misses:u   #    3.53% of all
L1-dcache hits    (71.39%)
    8,199,282,875      L1-dcache-stores:u        #  655.058 M/sec
            (57.24%)

Overall cache-misses is slightly reduced along with cache-references,
thus
the rate of dcache-loads is increased: 992.638 M/sec against 972.286
M/sec.

This performance drop on throttling does not depend on actual value 
set
for
ms_dispatch_throttle_bytes config option, because fio_ceph_messenger
engine
uses only fast dispatch path, so even
ms_dispatch_throttle_bytes=9999999999
is set this does not bring any visible effect.  But when the option is
set
to 0 - execution follows another path and we immediately return from
Throttle::get_or_fail() without any attempt to take locks or 
atomically
increase counter values.  As a future fix which can increase overall
Ceph
performance is to keep perf counters in thread local storage, thus to
avoid
atomic ops.

Improving the perfcounters when they cause issues is definitely 
something
worth exploring!

3. As I mentioned earlier it is worth to reduce 
allocation/deallocation
rate
of messages preallocating them.  I did a small patch [2] and 
preallocate
the
whole queue on client and server sides of fio engine.  The other thing
is
worth to mention is that on a message completion I do not call put(),
but call
a new completion callback (completion hook does not fit well, since it
is
called in destructor, i.e. again extra allocations/deletions for each
IO).
It is always a good thing to reducing atomic incs/decs on a fast path,
and
since I fully control a message and never free it I do not need a 
quite
expensive inc/get on each IO.  This is what I got:

   before: IOPS=114k, BW=444MiB/s, Lat=1.125ms
    after: IOPS=132k, BW=514MiB/s, Lat=0.973ms

In current ceph master branch sizeof of MOSDOp is 824 bytes and sizeof
of
MOSDOpReply is 848 bytes, which of course is worth to keep in a fixed
size
queue.

I am skeptical that this is actually a useful avenue. These messages 
will
always be accompanied by, you know, actual data IO which has a good 
chance
of dominating, and the relative numbers we need for different message
"front" buffer types are going to vary.

I did not get it.  I am talking about regular allocation of messages
which are being sent and then immediately freed on a first put().
My question was: will it be efficient if I simply preallocate the
whole queue.  Yes, it can bring ~70MB/s.  And under the queue I mean
not data IO, I mean objects, messages, like MOSDOp or MOSDOpReply or
any other containers which hold decoded/encoded data/payload and other
buffers.  Of course on each IO you have to pick up a free message
from a queue and init it with a data pointer, which comes from a
layer above, i.e. from a user of a messenger. For example the
following is a chunk of a code from the fio engine:

  req_msg = static_cast<MOSDOp *>(io->msg);

  /* No handy method to clear ops before reusage? Ok */
  req_msg->ops.clear();
  req_msg->write(0, io_u->buflen, buflist);
  /* Keep message alive */
  req_msg->get();
  req_msg->get_connection()->send_message(req_msg);

where req_msg is never destroyed, it always exists in a queue.

In this example I use only one type of a message MOSDOp, but it can
be easily changed and cover all types of messages, then just
construct/destruct any message on a preallocated memory from a queue
and you will save new/delete cycle of ~1k.

  One of the possible ways is to keep a queue of a union of all
messages
inside a messenger itself, and each user of a messenger has to ask for 
a
free
request.  Fixed size queue implies also a back pressure: if no free
requests
exist in the queue user has to wait, thus throttling also can be 
easily
implemented simply changing queue size at run-time.

Nope nope nope. This is not a realistic option given the 
interdependencies
messages can have.

Sorry, I don't understand what is exactly not realistic, could you 
please
be a bit more specific?

--
Roman