Re: [RFC] Performance experiments with Ceph messenger

Roman Penyaev <rpenyaev@xxxxxxx> · Sun, 11 Nov 2018 14:02:52 +0100

Hi Mark,

On 2018-11-08 19:41, Mark Nelson wrote:
Hi Roman!

responses in-line

On 11/8/18 9:54 AM, Roman Penyaev wrote:
Hi all,

I would like to share some performance experiments based on new fio 
engine [1],
which tests bare Ceph messenger without any disk IO or libraries 
involved,
only messenger and subsidiary classes from src/msg/async/*.

Firstly I would like to say that I am completely new to Ceph and 
started
investigating performance numbers without any clue or deep 
understanding
of messenger framework, the idea was to use profiler and then apply 
hacky
quick fixes in order to squeeze everything from bandwidth or latency 
and
send messages as fast as possible.

Without any changes applied, based on latest master, with only one 
line in
ceph.conf: ms_type=async+posix and with the following fio config:

   bs=4k
   size=3g
   iodepth=128

I got reference numbers on loopback connection with async+posix 
messenger:

    IOPS=80.1k, BW=313MiB/s, Lat=1.598ms

This sounds about like what I'd expect given what I've seen in the
past.  Usually we're bottle-necked elsewhere, but 80k IOPs seems like
it's in the right ballpark.

I have to mention that at this point of measurements fio engine 
allocates
and deallocates MOSDOp and MOSDOpReply for each IO, repeating behavior 
of
other Ceph components.  A bit later I will provide fio numbers when 
all
messages (requests and replies) are cached, i.e. the whole queue of 
fixed
size is preallocated.

1. First profiling showed that a lot of time we spent in crc 
calculation for
data and header.  The following lines are added to ceph.conf:

   ms_crc_data=false
   ms_crc_header=false

And the following is fio numbers after new config is applied:

   before: IOPS=80.1k, BW=313MiB/s, Lat=1.598ms
    after: IOPS=110k,  BW=429MiB/s, Lat=1.164ms

My first question: what is the reason to calculate and then check crc 
for
messages which are sent over reliable network?  ~100Mb/s and ~30k IOPS 
is
quite a high price for making already reliable connection "more 
reliable".

There was some discussion about this a while back as part of an
overall discussion about compression/checksum in bluestore.  I think
there are probably valid points on both sides of this topic:

https://www.spinics.net/lists/ceph-devel/msg29637.html

I see.  Referring this 
http://noahdavids.org/self_published/CRC_and_checksum.html,
I can assume only that frame check sequence (crc32) for data link layer
(ethernet in this case) is disabled, otherwise data is being corrupted 
in
kernel buffers just after crc has been checked, which is another issue.

2. The other thing which can be improved is throttling: by default on 
each
IO Throttle::get_or_fail() is called which invokes 
pthread_mutex_lock() and
PerfCounters::inc().  Since there is no any contention on the mutex
I suspect CPU cache misses.  When the following line is applied to
ceph.conf:

It would be interesting to verify that there is no lock contention
with gdbpmp or adam's profiler.

I did with mutrace.  Despite any profilers output fio submits IO from
a single thread, so per connection there is one throttler with mutex 
obj,
which is locked and unlocked only for one thread.

   ms_dispatch_throttle_bytes=0

Fio shows these numbers:

   before: IOPS=110k, BW=429MiB/s, Lat=1.164ms
    after: IOPS=114k, BW=444MiB/s, Lat=1.125ms

And the following is the output of `perf stat`:

Before:

     13057.583388      task-clock:u (msec)       #    1.609 CPUs 
utilized
   26,042,061,515      cycles:u                  #    1.994 GHz  
           (57.77%)
   40,643,744,150      instructions:u            #    1.56  insn per 
cycle           (71.79%)
      815,662,912      cache-references:u        #   62.467 M/sec  
           (71.64%)
       12,926,237      cache-misses:u            #    1.585 % of all 
cache refs      (70.93%)
   12,695,706,281      L1-dcache-loads:u         #  972.286 M/sec  
           (70.87%)
      455,625,889      L1-dcache-load-misses:u   #    3.59% of all 
L1-dcache hits    (71.30%)
    8,263,315,484      L1-dcache-stores:u        #  632.837 M/sec  
           (57.49%)

After:

     12516.889311      task-clock:u (msec)       #    1.631 CPUs 
utilized
   24,987,047,978      cycles:u                  #    1.996 GHz  
           (57.01%)
   40,072,709,633      instructions:u            #    1.60  insn per 
cycle           (71.49%)
      792,468,416      cache-references:u        #   63.312 M/sec  
           (70.94%)
        8,494,440      cache-misses:u            #    1.072 % of all 
cache refs      (71.60%)
   12,424,744,615      L1-dcache-loads:u         #  992.638 M/sec  
           (71.82%)
      438,946,415      L1-dcache-load-misses:u   #    3.53% of all 
L1-dcache hits    (71.39%)
    8,199,282,875      L1-dcache-stores:u        #  655.058 M/sec  
           (57.24%)

Overall cache-misses is slightly reduced along with cache-references, 
thus
the rate of dcache-loads is increased: 992.638 M/sec against 972.286 
M/sec.

There's probably a fair amount of noise here, but if we take the
numbers at face value, it's a ~3.6% IOPS gain for a 0.5% decrease in
cache misses and very slight reduction in dcache-loads right?  I'm
still curious about the locking behavior.

Indeed, looks like a noise, but pattern is always the same: ~15MB/s 
increase
and cache-misses counter becomes less.  Probably for such a minor thing 
hw
counters won't show us more.  I also did some experiments making 
get_or_fail()
path lockless - result gets partially improved, but only partially.  So 
looks
like atomic ops for mutex and perf counters on this path causes cpu 
cacheline
refetch.

This performance drop on throttling does not depend on actual value 
set for
ms_dispatch_throttle_bytes config option, because fio_ceph_messenger 
engine
uses only fast dispatch path, so even 
ms_dispatch_throttle_bytes=9999999999
is set this does not bring any visible effect.  But when the option is 
set
to 0 - execution follows another path and we immediately return from
Throttle::get_or_fail() without any attempt to take locks or 
atomically
increase counter values.  As a future fix which can increase overall 
Ceph
performance is to keep perf counters in thread local storage, thus to 
avoid
atomic ops.

3. As I mentioned earlier it is worth to reduce 
allocation/deallocation rate
of messages preallocating them.  I did a small patch [2] and 
preallocate the
whole queue on client and server sides of fio engine.  The other thing 
is
worth to mention is that on a message completion I do not call put(), 
but call
a new completion callback (completion hook does not fit well, since it 
is
called in destructor, i.e. again extra allocations/deletions for each 
IO).
It is always a good thing to reducing atomic incs/decs on a fast path, 
and
since I fully control a message and never free it I do not need a 
quite
expensive inc/get on each IO.  This is what I got:

   before: IOPS=114k, BW=444MiB/s, Lat=1.125ms
    after: IOPS=132k, BW=514MiB/s, Lat=0.973ms

Neat!  preallocating messages seemed like a good idea but I wasn't
sure how it would pan out in practice.  You did it and saw an
improvement. Good job! :)  Beyond speed in this test, it would be
interesting to see the effect on an active OSD.

This fio engine shows only ideal bandwidth/latency results which we can
get loading messenger without any other delays.  So it is an upper 
bound:
if you submit messages and get completions immediately (no locks, 
allocations,
etc) you get this number of IOPS.

But for real scenario I can't say is it really doable or not.  That was
exactly an intention of this rfc: rise a discussion. Seeing other 
replies
in this thread it seems preallocation is something not desirable (or 
possible,
doable).

In current ceph master branch sizeof of MOSDOp is 824 bytes and sizeof 
of
MOSDOpReply is 848 bytes, which of course is worth to keep in a fixed 
size
queue.  One of the possible ways is to keep a queue of a union of all 
messages
inside a messenger itself, and each user of a messenger has to ask for 
a free
request.  Fixed size queue implies also a back pressure: if no free 
requests
exist in the queue user has to wait, thus throttling also can be 
easily
implemented simply changing queue size at run-time.

4. The next hacky patch [3] reduces number of buffer::ptr::release() 
and
sendmsg() syscalls by preparing one msghdr for all queued messages 
without
usage of temporary buffers appending them with each part of a message, 
but
simply filling in msghdr.msg_iov directly and sending it to kernel in 
one
go.  The queueing is also slightly modified: instead of putting a 
message
to the vector and then erasing it on dequeue side I simply chain 
messages
in a single linked list, which does not require any memory allocations 
and
enqueue/dequeue locks a mutex for a very short period of time.

   before: IOPS=132k, BW=514MiB/s, Lat=0.973ms
    after: IOPS=166k, BW=650MiB/s, Lat=0.769ms

I have to apologize to the most curious who will still open and look 
at the
patches: I did not consider error handling, priority queues, also I 
did not
care about sendmsg(), which can return less bytes and iovec has to be 
advanced.
As I mentioned, I wanted to get quick numbers on localhost and for the 
sake
of the simplicity put asserts and prints to catch abnormal behavior 
(of
course on localhost nothing terrible happens, as usual :)

I think you are definitely on the right path here.

At the end here is the summary table:

   IOPS=80.1k, BW=313MiB/s, Lat=1.598ms  -- empty config, no changes 
applied
   IOPS=110k,  BW=429MiB/s, Lat=1.164ms  -- disable crc for tcp/ip 
transport
   IOPS=114k,  BW=444MiB/s, Lat=1.125ms  -- disable throttle code 
paths
   IOPS=132k,  BW=514MiB/s, Lat=0.973ms  -- preallocate messages in a 
queue
   IOPS=166k,  BW=650MiB/s, Lat=0.769ms  -- reduce number of temporary 
buffers
                                            allocations, reduce number 
of
                                            sendmsg() syscalls, chain 
messages

I would like to discuss necessity of such messenger improvements and 
possible
steps forward.

Radoslaw will almost certainly be interested in all of this especially
as it relates to his work.  I imagine Sage/Greg/Josh may have input
regarding correctness.  We just had this week's community performance
meeting today, but perhaps you'd like to present your work in the
coming weeks there?  Once it's ready, a formal review and run through
our QA suite would be additional steps we'd want to take before any
potential PRs are merged.

Yes, would be nice to participate.  This is the link:
https://pad.ceph.com/p/performance_weekly  ?

--
Roman