RE: [RFC] Performance experiments with Ceph messenger

Piotr Dalek <piotr.dalek@xxxxxxxxxxxx> · Fri, 9 Nov 2018 10:07:55 +0000

1. As for CRC, your results are suspiciously skewed - you might want to check (with gdb) whether you're utilizing software-only path or hardware-accelerated one. Besides, in real workloads, it's less of an issue because CRC for payload is calculated by different code segments and once-calculated payload CRC may be reused later by messenger (that's why there's CRC cache in bufferlist code). That also explains why on bluestore disabling crc for messenger doesn't give as much improvement as on filestore -- assuming bluestore is configured to calculate CRC32c checksums for on-disk data.
2. Perf counters (or rather - the amount of it) is a real issue, so far nobody decided to do any kind of sanity audit and thus it's unclear if all of them are really necessary/useful. Usually when someone decides to add them, they just issue a PR and soon they're included without any further discussion. For sure some of counters at least shouldn't be enabled in production-class binaries.
3 and 4. You're absolutely on point here - preallocating and reusing data structures is the way to go. Back in 2015 everyone agreed that Ceph has really bad memory management - lots of allocations, deallocations and memory block copies - across different paths. The only "big" fixes that people came up with was increasing TCMalloc cache to 128MB (up from default 32MB) and replacing TCMalloc with Jemalloc which with recent releases is impossible (http://tracker.ceph.com/issues/20557) - fixing the code to utilize proper memory management strategies is difficult in case of Ceph because of the Bufferlist class that makes it easy to do complex things, but not necessarily in the high-performance way.

Over a year ago Red Hat started to rewrite large parts of Ceph code to utilize Seastar framework which gave hope for the above to change. But it's unclear whether it'll be the case, what performance gains should users expect and whether it'll require users to redesign their clusters or other costly labor. On the other hand, such large rewrite puts a huge question mark on any performance improvement work as nobody can tell you with 100% certainty that your work won't be dropped during the transition to Seastar. Besides, careful, optimal buffer handling is difficult *especially* in complex multi-threaded software like Ceph, and it's way less fun than, for example, using tons of language features to get rid of integer divisions.

-- 
Piotr Dałek
piotr.dalek@xxxxxxxxxxxx
https://ovhcloud.com/

-----Original Message-----
From: ceph-devel-owner@xxxxxxxxxxxxxxx <ceph-devel-owner@xxxxxxxxxxxxxxx> On Behalf Of Roman Penyaev
Sent: Thursday, November 8, 2018 4:54 PM
To: ceph-devel@xxxxxxxxxxxxxxx
Subject: [RFC] Performance experiments with Ceph messenger

Hi all,

I would like to share some performance experiments based on new fio engine [1], which tests bare Ceph messenger without any disk IO or libraries involved, only messenger and subsidiary classes from src/msg/async/*.

Firstly I would like to say that I am completely new to Ceph and started investigating performance numbers without any clue or deep understanding of messenger framework, the idea was to use profiler and then apply hacky quick fixes in order to squeeze everything from bandwidth or latency and send messages as fast as possible.

Without any changes applied, based on latest master, with only one line in
ceph.conf: ms_type=async+posix and with the following fio config:

   bs=4k
   size=3g
   iodepth=128

I got reference numbers on loopback connection with async+posix
messenger:

    IOPS=80.1k, BW=313MiB/s, Lat=1.598ms

I have to mention that at this point of measurements fio engine allocates and deallocates MOSDOp and MOSDOpReply for each IO, repeating behavior of other Ceph components.  A bit later I will provide fio numbers when all messages (requests and replies) are cached, i.e. the whole queue of fixed size is preallocated.

1. First profiling showed that a lot of time we spent in crc calculation for data and header.  The following lines are added to ceph.conf:

   ms_crc_data=false
   ms_crc_header=false

And the following is fio numbers after new config is applied:

   before: IOPS=80.1k, BW=313MiB/s, Lat=1.598ms
    after: IOPS=110k,  BW=429MiB/s, Lat=1.164ms

My first question: what is the reason to calculate and then check crc for messages which are sent over reliable network?  ~100Mb/s and ~30k IOPS is quite a high price for making already reliable connection "more reliable".

2. The other thing which can be improved is throttling: by default on each IO Throttle::get_or_fail() is called which invokes pthread_mutex_lock() and PerfCounters::inc().  Since there is no any contention on the mutex I suspect CPU cache misses.  When the following line is applied to
ceph.conf:

   ms_dispatch_throttle_bytes=0

Fio shows these numbers:

   before: IOPS=110k, BW=429MiB/s, Lat=1.164ms
    after: IOPS=114k, BW=444MiB/s, Lat=1.125ms

And the following is the output of `perf stat`:

Before:

     13057.583388      task-clock:u (msec)       #    1.609 CPUs utilized
   26,042,061,515      cycles:u                  #    1.994 GHz           
            (57.77%)
   40,643,744,150      instructions:u            #    1.56  insn per 
cycle           (71.79%)
      815,662,912      cache-references:u        #   62.467 M/sec         
            (71.64%)
       12,926,237      cache-misses:u            #    1.585 % of all 
cache refs      (70.93%)
   12,695,706,281      L1-dcache-loads:u         #  972.286 M/sec         
            (70.87%)
      455,625,889      L1-dcache-load-misses:u   #    3.59% of all 
L1-dcache hits    (71.30%)
    8,263,315,484      L1-dcache-stores:u        #  632.837 M/sec         
            (57.49%)

After:

     12516.889311      task-clock:u (msec)       #    1.631 CPUs utilized
   24,987,047,978      cycles:u                  #    1.996 GHz           
            (57.01%)
   40,072,709,633      instructions:u            #    1.60  insn per 
cycle           (71.49%)
      792,468,416      cache-references:u        #   63.312 M/sec         
            (70.94%)
        8,494,440      cache-misses:u            #    1.072 % of all 
cache refs      (71.60%)
   12,424,744,615      L1-dcache-loads:u         #  992.638 M/sec         
            (71.82%)
      438,946,415      L1-dcache-load-misses:u   #    3.53% of all 
L1-dcache hits    (71.39%)
    8,199,282,875      L1-dcache-stores:u        #  655.058 M/sec         
            (57.24%)

Overall cache-misses is slightly reduced along with cache-references, thus the rate of dcache-loads is increased: 992.638 M/sec against 972.286 M/sec.

This performance drop on throttling does not depend on actual value set for ms_dispatch_throttle_bytes config option, because fio_ceph_messenger engine uses only fast dispatch path, so even
ms_dispatch_throttle_bytes=9999999999
is set this does not bring any visible effect.  But when the option is set to 0 - execution follows another path and we immediately return from
Throttle::get_or_fail() without any attempt to take locks or atomically increase counter values.  As a future fix which can increase overall Ceph performance is to keep perf counters in thread local storage, thus to avoid atomic ops.

3. As I mentioned earlier it is worth to reduce allocation/deallocation rate of messages preallocating them.  I did a small patch [2] and preallocate the whole queue on client and server sides of fio engine.  The other thing is worth to mention is that on a message completion I do not call put(), but call a new completion callback (completion hook does not fit well, since it is called in destructor, i.e. again extra allocations/deletions for each IO).
It is always a good thing to reducing atomic incs/decs on a fast path, and since I fully control a message and never free it I do not need a quite expensive inc/get on each IO.  This is what I got:

   before: IOPS=114k, BW=444MiB/s, Lat=1.125ms
    after: IOPS=132k, BW=514MiB/s, Lat=0.973ms

In current ceph master branch sizeof of MOSDOp is 824 bytes and sizeof of MOSDOpReply is 848 bytes, which of course is worth to keep in a fixed size queue.  One of the possible ways is to keep a queue of a union of all messages inside a messenger itself, and each user of a messenger has to ask for a free request.  Fixed size queue implies also a back pressure: if no free requests exist in the queue user has to wait, thus throttling also can be easily implemented simply changing queue size at run-time.

4. The next hacky patch [3] reduces number of buffer::ptr::release() and
sendmsg() syscalls by preparing one msghdr for all queued messages without usage of temporary buffers appending them with each part of a message, but simply filling in msghdr.msg_iov directly and sending it to kernel in one go.  The queueing is also slightly modified: instead of putting a message to the vector and then erasing it on dequeue side I simply chain messages in a single linked list, which does not require any memory allocations and enqueue/dequeue locks a mutex for a very short period of time.

   before: IOPS=132k, BW=514MiB/s, Lat=0.973ms
    after: IOPS=166k, BW=650MiB/s, Lat=0.769ms

I have to apologize to the most curious who will still open and look at the
patches: I did not consider error handling, priority queues, also I did not care about sendmsg(), which can return less bytes and iovec has to be advanced.
As I mentioned, I wanted to get quick numbers on localhost and for the sake of the simplicity put asserts and prints to catch abnormal behavior (of course on localhost nothing terrible happens, as usual :)

At the end here is the summary table:

   IOPS=80.1k, BW=313MiB/s, Lat=1.598ms  -- empty config, no changes applied
   IOPS=110k,  BW=429MiB/s, Lat=1.164ms  -- disable crc for tcp/ip transport
   IOPS=114k,  BW=444MiB/s, Lat=1.125ms  -- disable throttle code paths
   IOPS=132k,  BW=514MiB/s, Lat=0.973ms  -- preallocate messages in a queue
   IOPS=166k,  BW=650MiB/s, Lat=0.769ms  -- reduce number of temporary buffers
                                            allocations, reduce number of
                                            sendmsg() syscalls, chain messages

I would like to discuss necessity of such messenger improvements and possible steps forward.

Thanks.

[1] https://github.com/ceph/ceph/pull/24678
[2] https://github.com/rouming/ceph/commit/3e34d9271ae
[3] https://github.com/rouming/ceph/commit/90831f241cb

--
Roman