Hi Gregory, On 2018-11-09 22:23, Gregory Farnum wrote:
On Thu, Nov 8, 2018 at 7:55 AM Roman Penyaev <rpenyaev@xxxxxxx> wrote:Hi all, I would like to share some performance experiments based on new fio engine [1], which tests bare Ceph messenger without any disk IO or libraries involved, only messenger and subsidiary classes from src/msg/async/*.Firstly I would like to say that I am completely new to Ceph and started investigating performance numbers without any clue or deep understandingof messenger framework, the idea was to use profiler and then apply hackyquick fixes in order to squeeze everything from bandwidth or latency andsend messages as fast as possible.Without any changes applied, based on latest master, with only one linein ceph.conf: ms_type=async+posix and with the following fio config: bs=4k size=3g iodepth=128The FIO messenger test is interesting but I think the few tests we've seen out of it have been pretty out of whack with other existing tests; Ricardomight know more.I got reference numbers on loopback connection with async+posix messenger: IOPS=80.1k, BW=313MiB/s, Lat=1.598ms I have to mention that at this point of measurements fio engine allocates and deallocates MOSDOp and MOSDOpReply for each IO, repeating behavior ofother Ceph components. A bit later I will provide fio numbers when allmessages (requests and replies) are cached, i.e. the whole queue of fixed size is preallocated.1. First profiling showed that a lot of time we spent in crc calculationfor data and header. The following lines are added to ceph.conf: ms_crc_data=false ms_crc_header=false And the following is fio numbers after new config is applied: before: IOPS=80.1k, BW=313MiB/s, Lat=1.598ms after: IOPS=110k, BW=429MiB/s, Lat=1.164ms My first question: what is the reason to calculate and then check crc for messages which are sent over reliable network? ~100Mb/s and ~30k IOPS is quite a high price for making already reliable connection "more reliable".We have seen network errors pretty consistently when we disable our owninternal CRCs. I'm not sure if we collect numbers on the failures we see in practical use, but without data this is not a configuration option we canchange for real users. :(
Can it be that crc32 is disabled for ethernet data layer or nics in such networks where you saw these errors? Otherwise this is weird.
2. The other thing which can be improved is throttling: by default on eachIO Throttle::get_or_fail() is called which invokes pthread_mutex_lock()and PerfCounters::inc(). Since there is no any contention on the mutex I suspect CPU cache misses. When the following line is applied to ceph.conf: ms_dispatch_throttle_bytes=0 Fio shows these numbers: before: IOPS=110k, BW=429MiB/s, Lat=1.164ms after: IOPS=114k, BW=444MiB/s, Lat=1.125ms And the following is the output of `perf stat`: Before:13057.583388 task-clock:u (msec) # 1.609 CPUs utilized26,042,061,515 cycles:u # 1.994 GHz (57.77%) 40,643,744,150 instructions:u # 1.56 insn per cycle (71.79%) 815,662,912 cache-references:u # 62.467 M/sec (71.64%) 12,926,237 cache-misses:u # 1.585 % of all cache refs (70.93%) 12,695,706,281 L1-dcache-loads:u # 972.286 M/sec (70.87%) 455,625,889 L1-dcache-load-misses:u # 3.59% of all L1-dcache hits (71.30%) 8,263,315,484 L1-dcache-stores:u # 632.837 M/sec (57.49%) After:12516.889311 task-clock:u (msec) # 1.631 CPUs utilized24,987,047,978 cycles:u # 1.996 GHz (57.01%) 40,072,709,633 instructions:u # 1.60 insn per cycle (71.49%) 792,468,416 cache-references:u # 63.312 M/sec (70.94%) 8,494,440 cache-misses:u # 1.072 % of all cache refs (71.60%) 12,424,744,615 L1-dcache-loads:u # 992.638 M/sec (71.82%) 438,946,415 L1-dcache-load-misses:u # 3.53% of all L1-dcache hits (71.39%) 8,199,282,875 L1-dcache-stores:u # 655.058 M/sec (57.24%) Overall cache-misses is slightly reduced along with cache-references, thus the rate of dcache-loads is increased: 992.638 M/sec against 972.286 M/sec.This performance drop on throttling does not depend on actual value setfor ms_dispatch_throttle_bytes config option, because fio_ceph_messenger engine uses only fast dispatch path, so even ms_dispatch_throttle_bytes=9999999999 is set this does not bring any visible effect. But when the option is set to 0 - execution follows another path and we immediately return fromThrottle::get_or_fail() without any attempt to take locks or atomicallyincrease counter values. As a future fix which can increase overall Ceph performance is to keep perf counters in thread local storage, thus to avoid atomic ops.Improving the perfcounters when they cause issues is definitely somethingworth exploring!3. As I mentioned earlier it is worth to reduce allocation/deallocationrateof messages preallocating them. I did a small patch [2] and preallocatethe whole queue on client and server sides of fio engine. The other thing is worth to mention is that on a message completion I do not call put(), but call a new completion callback (completion hook does not fit well, since it is called in destructor, i.e. again extra allocations/deletions for each IO). It is always a good thing to reducing atomic incs/decs on a fast path, andsince I fully control a message and never free it I do not need a quiteexpensive inc/get on each IO. This is what I got: before: IOPS=114k, BW=444MiB/s, Lat=1.125ms after: IOPS=132k, BW=514MiB/s, Lat=0.973ms In current ceph master branch sizeof of MOSDOp is 824 bytes and sizeof of MOSDOpReply is 848 bytes, which of course is worth to keep in a fixed size queue.I am skeptical that this is actually a useful avenue. These messages will always be accompanied by, you know, actual data IO which has a good chanceof dominating, and the relative numbers we need for different message "front" buffer types are going to vary.
I did not get it. I am talking about regular allocation of messages which are being sent and then immediately freed on a first put(). My question was: will it be efficient if I simply preallocate the whole queue. Yes, it can bring ~70MB/s. And under the queue I mean not data IO, I mean objects, messages, like MOSDOp or MOSDOpReply or any other containers which hold decoded/encoded data/payload and other buffers. Of course on each IO you have to pick up a free message from a queue and init it with a data pointer, which comes from a layer above, i.e. from a user of a messenger. For example the following is a chunk of a code from the fio engine: req_msg = static_cast<MOSDOp *>(io->msg); /* No handy method to clear ops before reusage? Ok */ req_msg->ops.clear(); req_msg->write(0, io_u->buflen, buflist); /* Keep message alive */ req_msg->get(); req_msg->get_connection()->send_message(req_msg); where req_msg is never destroyed, it always exists in a queue. In this example I use only one type of a message MOSDOp, but it can be easily changed and cover all types of messages, then just construct/destruct any message on a preallocated memory from a queue and you will save new/delete cycle of ~1k.
One of the possible ways is to keep a queue of a union of all messagesinside a messenger itself, and each user of a messenger has to ask for afree request. Fixed size queue implies also a back pressure: if no free requestsexist in the queue user has to wait, thus throttling also can be easilyimplemented simply changing queue size at run-time.Nope nope nope. This is not a realistic option given the interdependenciesmessages can have.
Sorry, I don't understand what is exactly not realistic, could you please
be a bit more specific? -- Roman