Re: Performance variation across RBD clients on different pools in all SSD setup - tcmalloc issue

Mark Nelson <mark.nelson@xxxxxxxxxxx> · Wed, 03 Dec 2014 08:21:25 -0600

On 12/03/2014 05:41 AM, Chaitanya Huilgol wrote:
Hi All,

We are seeing large read performance variations across RBD clients on different pools. Below is the summary of our findings

- First client starting I/O after a cluster restart (ceph start/stop on all OSD nodes) gets the best performance
- Clients started later exhibit 40% to 70% degraded performance, This is seen even in cases where first client I/O is stopped before starting the second client I/O
-  Adding performance counters showed large increase in latency across the entire path and no specific point of increased latency - upto 3x increase in latency
- On further investigation we have root caused this to degradation in tcmalloc performance inducing large latency across the entire path
- Also the variation is more as we increase the number of op worker shards, with lower shards the variation is lesser but this results in more lock contention and is not a good option for SSD based clusters
- Variation is observed even when the RBD images are not written at all thus indicating that this is not a filesystem issue

Below is a snippet of perf top output for the two runs:

(1)    TCmalloc  - Client-1
    2.68%  ceph-osd                 [.] crush_hash32_3
   2.65%  libtcmalloc.so.4.1.2     [.] operator new(unsigned long)
   1.66%  [kernel]                 [k] _raw_spin_lock
   1.56%  libstdc++.so.6.0.19      [.] std::basic_string<char, std::char_traits<char>, std::allocator<char> >::basic_string(std::string const&)
   1.51%  libtcmalloc.so.4.1.2     [.] operator delete(void*)

(2)    TCmalloc - Client -2 (note significant increase in TCmalloc internal free to central list code paths)

14.75%  libtcmalloc.so.4.1.2     [.] tcmalloc::CentralFreeList::FetchFromSpans()
   7.46%  libtcmalloc.so.4.1.2     [.] tcmalloc::ThreadCache::ReleaseToCentralCache(tcmalloc::ThreadCache::FreeList*, unsigned long, int)
   6.71%  libtcmalloc.so.4.1.2     [.] tcmalloc::CentralFreeList::ReleaseToSpans(void*)
   1.68%  libtcmalloc.so.4.1.2     [.] operator new(unsigned long)
   1.57%  ceph-osd                 [.] crush_hash32_3

Tying it all together, It looks like the new client I/O on a different pool induces change in how the OSD shards are used, this would induce movement of memory to/from the thread local caches to the central free lists.
Increasing the TCmalloc thread cache limit with 'TCMALLOC_MAX_TOTAL_THREAD_CACHE_BYTES' alleviates the issues in our test setups. However this is a temporary resolution - this also bloats the OSD memory usage

We have also tested with glibc malloc and jemalloc based builds where this issue is not seen, both hold up well, below is the perf output from the tests

(3)    Glibc - malloc : Any client - no significant change

   3.00%  libc-2.19.so         [.] _int_malloc
   2.65%  libc-2.19.so         [.] malloc
   2.47%  libc-2.19.so         [.] _int_free
   2.33%  ceph-osd             [.] crush_hash32_3
   1.63%  [kernel]             [k] _raw_spin_lock
   1.38%  libstdc++.so.6.0.19  [.] std::basic_string<char, std::char_traits<char>, std::allocator<char> >::basic_string(std::string const&)

(4)    Jemalloc  - Any client - no significant changes

   2.47%  ceph-osd                 [.] crush_hash32_3
   2.25%  libjemalloc.so.1         [.] free
   2.07%  libc-2.19.so             [.] 0x0000000000081070
   1.95%  libjemalloc.so.1         [.] malloc
   1.65%  [kernel]                 [k] _raw_spin_lock
   1.60%  libstdc++.so.6.0.19      [.] std::basic_string<char, std::char_traits<char>, std::allocator<char> >::basic_string(std::string const&)

IMHO, we should probably look at the following in general for better performance with less variation

- Add jemalloc option for ceph builds
- Look at ways to evenly distribute PGs across the shards  - with larger number of shards some shards do not get exercised at all while some are overloaded
- Look at decreasing heap activity in the I/O path (index Manager, Hash Index, LFN index etc.)

We can discuss this further in todays performance meeting

This is a fantastic writup Chaitanya.  Please add to the performance 
meeting agenda.

fwiw there are some interesting benchmarks and discussion of different 
allocators here:

http://www.percona.com/blog/2012/07/05/impact-of-memory-allocators-on-mysql-performance/
http://www.reddit.com/r/programming/comments/18zija/github_got_30_better_performance_using_tcmalloc/

I would definitely be in favor of at least exploring options other than 
tcmalloc.

Mark

Thanks,
Chaitanya

________________________________

PLEASE NOTE: The information contained in this electronic mail message is intended only for the use of the designated recipient(s) named above. If the reader of this message is not the intended recipient, you are hereby notified that you have received this message in error and that any review, dissemination, distribution, or copying of this message is strictly prohibited. If you have received this communication in error, please notify the sender by telephone or e-mail (as shown above) immediately and destroy any and all copies of this message in your possession (whether hard copies or electronically stored copies).

--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at  http://vger.kernel.org/majordomo-info.html

--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at  http://vger.kernel.org/majordomo-info.html