Re: [RFC] Performance experiments with Ceph messenger

Gregory Farnum <gfarnum@xxxxxxxxxx> · Fri, 9 Nov 2018 13:47:24 -0800

[ Whoops, resend in plain text. ]

On Thu, Nov 8, 2018 at 7:55 AM Roman Penyaev <rpenyaev@xxxxxxx> wrote:
>
> Hi all,
>
> I would like to share some performance experiments based on new fio
> engine [1],
> which tests bare Ceph messenger without any disk IO or libraries
> involved,
> only messenger and subsidiary classes from src/msg/async/*.
>
> Firstly I would like to say that I am completely new to Ceph and started
> investigating performance numbers without any clue or deep understanding
> of messenger framework, the idea was to use profiler and then apply
> hacky
> quick fixes in order to squeeze everything from bandwidth or latency and
> send messages as fast as possible.
>
> Without any changes applied, based on latest master, with only one line
> in
> ceph.conf: ms_type=async+posix and with the following fio config:
>
>    bs=4k
>    size=3g
>    iodepth=128
>
> I got reference numbers on loopback connection with async+posix
> messenger:
>
>     IOPS=80.1k, BW=313MiB/s, Lat=1.598ms

The FIO messenger test is interesting but I think the few tests we've
seen out of it have been pretty out of whack with other existing
tests; Ricardo might know more.

>
> I have to mention that at this point of measurements fio engine
> allocates
> and deallocates MOSDOp and MOSDOpReply for each IO, repeating behavior
> of
> other Ceph components.  A bit later I will provide fio numbers when all
> messages (requests and replies) are cached, i.e. the whole queue of
> fixed
> size is preallocated.
>
> 1. First profiling showed that a lot of time we spent in crc calculation
> for
> data and header.  The following lines are added to ceph.conf:
>
>    ms_crc_data=false
>    ms_crc_header=false
>
> And the following is fio numbers after new config is applied:
>
>    before: IOPS=80.1k, BW=313MiB/s, Lat=1.598ms
>     after: IOPS=110k,  BW=429MiB/s, Lat=1.164ms
>
> My first question: what is the reason to calculate and then check crc
> for
> messages which are sent over reliable network?  ~100Mb/s and ~30k IOPS
> is
> quite a high price for making already reliable connection "more
> reliable".

We have seen network errors pretty consistently when we disable our
own internal CRCs. I'm not sure if we collect numbers on the failures
we see in practical use, but without data this is not a configuration
option we can change for real users. :(

>
> 2. The other thing which can be improved is throttling: by default on
> each
> IO Throttle::get_or_fail() is called which invokes pthread_mutex_lock()
> and
> PerfCounters::inc().  Since there is no any contention on the mutex
> I suspect CPU cache misses.  When the following line is applied to
> ceph.conf:
>
>    ms_dispatch_throttle_bytes=0
>
> Fio shows these numbers:
>
>    before: IOPS=110k, BW=429MiB/s, Lat=1.164ms
>     after: IOPS=114k, BW=444MiB/s, Lat=1.125ms
>
> And the following is the output of `perf stat`:
>
> Before:
>
>      13057.583388      task-clock:u (msec)       #    1.609 CPUs utilized
>    26,042,061,515      cycles:u                  #    1.994 GHz
>             (57.77%)
>    40,643,744,150      instructions:u            #    1.56  insn per
> cycle           (71.79%)
>       815,662,912      cache-references:u        #   62.467 M/sec
>             (71.64%)
>        12,926,237      cache-misses:u            #    1.585 % of all
> cache refs      (70.93%)
>    12,695,706,281      L1-dcache-loads:u         #  972.286 M/sec
>             (70.87%)
>       455,625,889      L1-dcache-load-misses:u   #    3.59% of all
> L1-dcache hits    (71.30%)
>     8,263,315,484      L1-dcache-stores:u        #  632.837 M/sec
>             (57.49%)
>
> After:
>
>      12516.889311      task-clock:u (msec)       #    1.631 CPUs utilized
>    24,987,047,978      cycles:u                  #    1.996 GHz
>             (57.01%)
>    40,072,709,633      instructions:u            #    1.60  insn per
> cycle           (71.49%)
>       792,468,416      cache-references:u        #   63.312 M/sec
>             (70.94%)
>         8,494,440      cache-misses:u            #    1.072 % of all
> cache refs      (71.60%)
>    12,424,744,615      L1-dcache-loads:u         #  992.638 M/sec
>             (71.82%)
>       438,946,415      L1-dcache-load-misses:u   #    3.53% of all
> L1-dcache hits    (71.39%)
>     8,199,282,875      L1-dcache-stores:u        #  655.058 M/sec
>             (57.24%)
>
>
> Overall cache-misses is slightly reduced along with cache-references,
> thus
> the rate of dcache-loads is increased: 992.638 M/sec against 972.286
> M/sec.
>
> This performance drop on throttling does not depend on actual value set
> for
> ms_dispatch_throttle_bytes config option, because fio_ceph_messenger
> engine
> uses only fast dispatch path, so even
> ms_dispatch_throttle_bytes=9999999999
> is set this does not bring any visible effect.  But when the option is
> set
> to 0 - execution follows another path and we immediately return from
> Throttle::get_or_fail() without any attempt to take locks or atomically
> increase counter values.  As a future fix which can increase overall
> Ceph
> performance is to keep perf counters in thread local storage, thus to
> avoid
> atomic ops.

Improving the perfcounters when they cause issues is definitely
something worth exploring!

>
> 3. As I mentioned earlier it is worth to reduce allocation/deallocation
> rate
> of messages preallocating them.  I did a small patch [2] and preallocate
> the
> whole queue on client and server sides of fio engine.  The other thing
> is
> worth to mention is that on a message completion I do not call put(),
> but call
> a new completion callback (completion hook does not fit well, since it
> is
> called in destructor, i.e. again extra allocations/deletions for each
> IO).
> It is always a good thing to reducing atomic incs/decs on a fast path,
> and
> since I fully control a message and never free it I do not need a quite
> expensive inc/get on each IO.  This is what I got:
>
>    before: IOPS=114k, BW=444MiB/s, Lat=1.125ms
>     after: IOPS=132k, BW=514MiB/s, Lat=0.973ms
>
> In current ceph master branch sizeof of MOSDOp is 824 bytes and sizeof
> of
> MOSDOpReply is 848 bytes, which of course is worth to keep in a fixed
> size
> queue.

I am skeptical that this is actually a useful avenue. These messages
will always be accompanied by, you know, actual data IO which has a
good chance of dominating, and the relative numbers we need for
different message "front" buffer types are going to vary.

>  One of the possible ways is to keep a queue of a union of all
> messages
> inside a messenger itself, and each user of a messenger has to ask for a
> free
> request.  Fixed size queue implies also a back pressure: if no free
> requests
> exist in the queue user has to wait, thus throttling also can be easily
> implemented simply changing queue size at run-time.

Nope nope nope. This is not a realistic option given the
interdependencies messages can have.
-Greg

>
> 4. The next hacky patch [3] reduces number of buffer::ptr::release() and
> sendmsg() syscalls by preparing one msghdr for all queued messages
> without
> usage of temporary buffers appending them with each part of a message,
> but
> simply filling in msghdr.msg_iov directly and sending it to kernel in
> one
> go.  The queueing is also slightly modified: instead of putting a
> message
> to the vector and then erasing it on dequeue side I simply chain
> messages
> in a single linked list, which does not require any memory allocations
> and
> enqueue/dequeue locks a mutex for a very short period of time.
>
>    before: IOPS=132k, BW=514MiB/s, Lat=0.973ms
>     after: IOPS=166k, BW=650MiB/s, Lat=0.769ms
>
> I have to apologize to the most curious who will still open and look at
> the
> patches: I did not consider error handling, priority queues, also I did
> not
> care about sendmsg(), which can return less bytes and iovec has to be
> advanced.
> As I mentioned, I wanted to get quick numbers on localhost and for the
> sake
> of the simplicity put asserts and prints to catch abnormal behavior (of
> course on localhost nothing terrible happens, as usual :)
>
> At the end here is the summary table:
>
>    IOPS=80.1k, BW=313MiB/s, Lat=1.598ms  -- empty config, no changes
> applied
>    IOPS=110k,  BW=429MiB/s, Lat=1.164ms  -- disable crc for tcp/ip
> transport
>    IOPS=114k,  BW=444MiB/s, Lat=1.125ms  -- disable throttle code paths
>    IOPS=132k,  BW=514MiB/s, Lat=0.973ms  -- preallocate messages in a
> queue
>    IOPS=166k,  BW=650MiB/s, Lat=0.769ms  -- reduce number of temporary
> buffers
>                                             allocations, reduce number of
>                                             sendmsg() syscalls, chain
> messages
>
> I would like to discuss necessity of such messenger improvements and
> possible
> steps forward.
>
> Thanks.
>
> [1] https://github.com/ceph/ceph/pull/24678
> [2] https://github.com/rouming/ceph/commit/3e34d9271ae
> [3] https://github.com/rouming/ceph/commit/90831f241cb
>
> --
> Roman
>
>