Re: [RFC] Performance experiments with Ceph messenger

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 



Hi Piotr,

On 2018-11-09 11:07, Piotr Dalek wrote:
1. As for CRC, your results are suspiciously skewed - you might want
to check (with gdb) whether you're utilizing software-only path or
hardware-accelerated one.

You are right, after installing yasm and having HAVE_GOOD_YASM_ELF64
defined the following function is executed:

    ceph_crc32c_intel_fast()

On high iops

So on latest master fio outputs:

   default crc32: IOPS=80.1k, BW=313MiB/s, Lat=1.598ms
      fast crc32: IOPS=104k,  BW=405MiB/s, Lat=1.235ms
        no crc32: IOPS=110k,  BW=429MiB/s, Lat=1.164ms

But on higher IOPS (with changes described here) the difference
becomes significant:

      fast crc32: IOPS=146k, BW=570MiB/s, Lat=0.877ms
        no crc32: IOPS=166k, BW=650MiB/s, Lat=0.769ms

So on higher IOPS any small delay on a way from IO submition to
actual syscall pulls performance down.


Besides, in real workloads, it's less of an
issue because CRC for payload is calculated by different code segments
and once-calculated payload CRC may be reused later by messenger
(that's why there's CRC cache in bufferlist code). That also explains
why on bluestore disabling crc for messenger doesn't give as much
improvement as on filestore -- assuming bluestore is configured to
calculate CRC32c checksums for on-disk data.

Good to know, thanks.

2. Perf counters (or rather - the amount of it) is a real issue, so
far nobody decided to do any kind of sanity audit and thus it's
unclear if all of them are really necessary/useful. Usually when
someone decides to add them, they just issue a PR and soon they're
included without any further discussion. For sure some of counters at
least shouldn't be enabled in production-class binaries.
3 and 4. You're absolutely on point here - preallocating and reusing
data structures is the way to go. Back in 2015 everyone agreed that
Ceph has really bad memory management - lots of allocations,
deallocations and memory block copies - across different paths. The
only "big" fixes that people came up with was increasing TCMalloc
cache to 128MB (up from default 32MB) and replacing TCMalloc with
Jemalloc which with recent releases is impossible
(http://tracker.ceph.com/issues/20557) - fixing the code to utilize
proper memory management strategies is difficult in case of Ceph
because of the Bufferlist class that makes it easy to do complex
things, but not necessarily in the high-performance way.

What I've noticed in 'perf report' output is a huge amount of reference
increases/decreases for buffer::ptr objects.  So it seems allocation and
copies is not only the issue, but also a huge number of atomic operations
on hot paths.

Over a year ago Red Hat started to rewrite large parts of Ceph code to
utilize Seastar framework which gave hope for the above to change. But
it's unclear whether it'll be the case, what performance gains should
users expect and whether it'll require users to redesign their
clusters or other costly labor. On the other hand, such large rewrite
puts a huge question mark on any performance improvement work as
nobody can tell you with 100% certainty that your work won't be
dropped during the transition to Seastar.

Did anyone try to do approximate tests putting seastar instead of msg/async?
It should not be a huge amount of work before first results can shed
the light and answer the question: is it worth to do or not.  Seems
couple of weeks, no?

But even seastar replaces all messenger internals - it won't replace
bufferlist and the whole message allocation strategy without deep
refactoring.  Or am I mistaken here?  What I mean is this can be
a bottleneck even with any imaginary 0-latency IO library.

Besides, careful, optimal
buffer handling is difficult *especially* in complex multi-threaded
software like Ceph, and it's way less fun than, for example, using
tons of language features to get rid of integer divisions.

I just want to minimize obvious costly things like allocations, atomic
ops and sendmsg() syscalls, no rocket science here.  Especially that
makes sense in case of RDMA transport layer, which on my fio tests
does not show any performance gain, because for each IO before reaching
the actual hardware cpu is busy doing a lot of other things.

--
Roman




[Index of Archives]     [CEPH Users]     [Ceph Large]     [Information on CEPH]     [Linux BTRFS]     [Linux USB Devel]     [Video for Linux]     [Linux Audio Users]     [Yosemite News]     [Linux Kernel]     [Linux SCSI]

  Powered by Linux