Hi all,
I would like to share some performance experiments based on new fio
engine [1],
which tests bare Ceph messenger without any disk IO or libraries
involved,
only messenger and subsidiary classes from src/msg/async/*.
Firstly I would like to say that I am completely new to Ceph and started
investigating performance numbers without any clue or deep understanding
of messenger framework, the idea was to use profiler and then apply
hacky
quick fixes in order to squeeze everything from bandwidth or latency and
send messages as fast as possible.
Without any changes applied, based on latest master, with only one line
in
ceph.conf: ms_type=async+posix and with the following fio config:
bs=4k
size=3g
iodepth=128
I got reference numbers on loopback connection with async+posix
messenger:
IOPS=80.1k, BW=313MiB/s, Lat=1.598ms
I have to mention that at this point of measurements fio engine
allocates
and deallocates MOSDOp and MOSDOpReply for each IO, repeating behavior
of
other Ceph components. A bit later I will provide fio numbers when all
messages (requests and replies) are cached, i.e. the whole queue of
fixed
size is preallocated.
1. First profiling showed that a lot of time we spent in crc calculation
for
data and header. The following lines are added to ceph.conf:
ms_crc_data=false
ms_crc_header=false
And the following is fio numbers after new config is applied:
before: IOPS=80.1k, BW=313MiB/s, Lat=1.598ms
after: IOPS=110k, BW=429MiB/s, Lat=1.164ms
My first question: what is the reason to calculate and then check crc
for
messages which are sent over reliable network? ~100Mb/s and ~30k IOPS
is
quite a high price for making already reliable connection "more
reliable".
2. The other thing which can be improved is throttling: by default on
each
IO Throttle::get_or_fail() is called which invokes pthread_mutex_lock()
and
PerfCounters::inc(). Since there is no any contention on the mutex
I suspect CPU cache misses. When the following line is applied to
ceph.conf:
ms_dispatch_throttle_bytes=0
Fio shows these numbers:
before: IOPS=110k, BW=429MiB/s, Lat=1.164ms
after: IOPS=114k, BW=444MiB/s, Lat=1.125ms
And the following is the output of `perf stat`:
Before:
13057.583388 task-clock:u (msec) # 1.609 CPUs utilized
26,042,061,515 cycles:u # 1.994 GHz
(57.77%)
40,643,744,150 instructions:u # 1.56 insn per
cycle (71.79%)
815,662,912 cache-references:u # 62.467 M/sec
(71.64%)
12,926,237 cache-misses:u # 1.585 % of all
cache refs (70.93%)
12,695,706,281 L1-dcache-loads:u # 972.286 M/sec
(70.87%)
455,625,889 L1-dcache-load-misses:u # 3.59% of all
L1-dcache hits (71.30%)
8,263,315,484 L1-dcache-stores:u # 632.837 M/sec
(57.49%)
After:
12516.889311 task-clock:u (msec) # 1.631 CPUs utilized
24,987,047,978 cycles:u # 1.996 GHz
(57.01%)
40,072,709,633 instructions:u # 1.60 insn per
cycle (71.49%)
792,468,416 cache-references:u # 63.312 M/sec
(70.94%)
8,494,440 cache-misses:u # 1.072 % of all
cache refs (71.60%)
12,424,744,615 L1-dcache-loads:u # 992.638 M/sec
(71.82%)
438,946,415 L1-dcache-load-misses:u # 3.53% of all
L1-dcache hits (71.39%)
8,199,282,875 L1-dcache-stores:u # 655.058 M/sec
(57.24%)
Overall cache-misses is slightly reduced along with cache-references,
thus
the rate of dcache-loads is increased: 992.638 M/sec against 972.286
M/sec.
This performance drop on throttling does not depend on actual value set
for
ms_dispatch_throttle_bytes config option, because fio_ceph_messenger
engine
uses only fast dispatch path, so even
ms_dispatch_throttle_bytes=9999999999
is set this does not bring any visible effect. But when the option is
set
to 0 - execution follows another path and we immediately return from
Throttle::get_or_fail() without any attempt to take locks or atomically
increase counter values. As a future fix which can increase overall
Ceph
performance is to keep perf counters in thread local storage, thus to
avoid
atomic ops.
3. As I mentioned earlier it is worth to reduce allocation/deallocation
rate
of messages preallocating them. I did a small patch [2] and preallocate
the
whole queue on client and server sides of fio engine. The other thing
is
worth to mention is that on a message completion I do not call put(),
but call
a new completion callback (completion hook does not fit well, since it
is
called in destructor, i.e. again extra allocations/deletions for each
IO).
It is always a good thing to reducing atomic incs/decs on a fast path,
and
since I fully control a message and never free it I do not need a quite
expensive inc/get on each IO. This is what I got:
before: IOPS=114k, BW=444MiB/s, Lat=1.125ms
after: IOPS=132k, BW=514MiB/s, Lat=0.973ms
In current ceph master branch sizeof of MOSDOp is 824 bytes and sizeof
of
MOSDOpReply is 848 bytes, which of course is worth to keep in a fixed
size
queue. One of the possible ways is to keep a queue of a union of all
messages
inside a messenger itself, and each user of a messenger has to ask for a
free
request. Fixed size queue implies also a back pressure: if no free
requests
exist in the queue user has to wait, thus throttling also can be easily
implemented simply changing queue size at run-time.
4. The next hacky patch [3] reduces number of buffer::ptr::release() and
sendmsg() syscalls by preparing one msghdr for all queued messages
without
usage of temporary buffers appending them with each part of a message,
but
simply filling in msghdr.msg_iov directly and sending it to kernel in
one
go. The queueing is also slightly modified: instead of putting a
message
to the vector and then erasing it on dequeue side I simply chain
messages
in a single linked list, which does not require any memory allocations
and
enqueue/dequeue locks a mutex for a very short period of time.
before: IOPS=132k, BW=514MiB/s, Lat=0.973ms
after: IOPS=166k, BW=650MiB/s, Lat=0.769ms
I have to apologize to the most curious who will still open and look at
the
patches: I did not consider error handling, priority queues, also I did
not
care about sendmsg(), which can return less bytes and iovec has to be
advanced.
As I mentioned, I wanted to get quick numbers on localhost and for the
sake
of the simplicity put asserts and prints to catch abnormal behavior (of
course on localhost nothing terrible happens, as usual :)
At the end here is the summary table:
IOPS=80.1k, BW=313MiB/s, Lat=1.598ms -- empty config, no changes
applied
IOPS=110k, BW=429MiB/s, Lat=1.164ms -- disable crc for tcp/ip
transport
IOPS=114k, BW=444MiB/s, Lat=1.125ms -- disable throttle code paths
IOPS=132k, BW=514MiB/s, Lat=0.973ms -- preallocate messages in a
queue
IOPS=166k, BW=650MiB/s, Lat=0.769ms -- reduce number of temporary
buffers
allocations, reduce number of
sendmsg() syscalls, chain
messages
I would like to discuss necessity of such messenger improvements and
possible
steps forward.
Thanks.
[1] https://github.com/ceph/ceph/pull/24678
[2] https://github.com/rouming/ceph/commit/3e34d9271ae
[3] https://github.com/rouming/ceph/commit/90831f241cb
--
Roman