RE: Performance variation across RBD clients on different pools in all SSD setup - tcmalloc issue

Chaitanya Huilgol <Chaitanya.Huilgol@xxxxxxxxxxx> · Wed, 3 Dec 2014 15:21:21 +0000

Hi Dan,

I think the default value of ' TCMALLOC_MAX_TOTAL_THREAD_CACHE_BYTES' is 16M, we increased it to 128M (need to tune this further). The heap stats from the OSD show about 30M in the thread caches though.
With default setting , we have seen the performance dropping down by 70% on the second client and with the tuning we have not seen this drop - I guess it may be just postponing the problem.

Setting env in ceph-osd.conf will do it.

Regards,
Chaitanya

-----Original Message-----
From: Dan van der Ster [mailto:daniel.vanderster@xxxxxxx]
Sent: Wednesday, December 03, 2014 6:24 PM
To: Chaitanya Huilgol
Cc: ceph-devel@xxxxxxxxxxxxxxx
Subject: Re: Performance variation across RBD clients on different pools in all SSD setup - tcmalloc issue

On Wed, Dec 3, 2014 at 12:41 PM, Chaitanya Huilgol <Chaitanya.Huilgol@xxxxxxxxxxx> wrote:
> Hi All,
>
> We are seeing large read performance variations across RBD clients on
> different pools. Below is the summary of our findings
>
> - First client starting I/O after a cluster restart (ceph start/stop
> on all OSD nodes) gets the best performance
> - Clients started later exhibit 40% to 70% degraded performance, This
> is seen even in cases where first client I/O is stopped before
> starting the second client I/O
> -  Adding performance counters showed large increase in latency across
> the entire path and no specific point of increased latency - upto 3x
> increase in latency
> - On further investigation we have root caused this to degradation in
> tcmalloc performance inducing large latency across the entire path
> - Also the variation is more as we increase the number of op worker
> shards, with lower shards the variation is lesser but this results in
> more lock contention and is not a good option for SSD based clusters
> - Variation is observed even when the RBD images are not written at
> all thus indicating that this is not a filesystem issue
>
> Below is a snippet of perf top output for the two runs:
>
> (1)    TCmalloc  - Client-1
>    2.68%  ceph-osd                 [.] crush_hash32_3
>   2.65%  libtcmalloc.so.4.1.2     [.] operator new(unsigned long)
>   1.66%  [kernel]                 [k] _raw_spin_lock
>   1.56%  libstdc++.so.6.0.19      [.] std::basic_string<char, std::char_traits<char>, std::allocator<char> >::basic_string(std::string const&)
>   1.51%  libtcmalloc.so.4.1.2     [.] operator delete(void*)
>
> (2)    TCmalloc - Client -2 (note significant increase in TCmalloc internal free to central list code paths)
>
> 14.75%  libtcmalloc.so.4.1.2     [.] tcmalloc::CentralFreeList::FetchFromSpans()
>   7.46%  libtcmalloc.so.4.1.2     [.] tcmalloc::ThreadCache::ReleaseToCentralCache(tcmalloc::ThreadCache::FreeList*, unsigned long, int)
>   6.71%  libtcmalloc.so.4.1.2     [.] tcmalloc::CentralFreeList::ReleaseToSpans(void*)
>   1.68%  libtcmalloc.so.4.1.2     [.] operator new(unsigned long)
>   1.57%  ceph-osd                 [.] crush_hash32_3
>
> Tying it all together, It looks like the new client I/O on a different pool induces change in how the OSD shards are used, this would induce movement of memory to/from the thread local caches to the central free lists.
> Increasing the TCmalloc thread cache limit with
> 'TCMALLOC_MAX_TOTAL_THREAD_CACHE_BYTES' alleviates the issues in our
> test setups. However this is a temporary resolution - this also bloats
> the OSD memory usage
>

I've noticed that tcmalloc is quite visible in perf top, but I never looked closer because we don't even have debug symbols enabled in our tcmalloc. Here's a production dumpling ceph-osd right now:

Samples: 35K of event 'cycles', Event count (approx.): 4040795974,
Thread: ceph-osd(13976)
 87.81%  libtcmalloc.so.4.1.0.#prelink#.P1wCcj  [.] 0x0000000000017e6f
  1.41%  libpthread-2.12.so                     [.] pthread_mutex_lock
  1.40%  libstdc++.so.6.0.13                    [.] 0x0000000000065b8c

What value did you use for TCMALLOC_MAX_TOTAL_THREAD_CACHE_BYTES and how well did it alleviate the problem? I assume env TCMALLOC_MAX_TOTAL_THREAD_CACHE_BYTES=x ceph-osd ... is sufficient to override this?

Cheers, Dan

________________________________

PLEASE NOTE: The information contained in this electronic mail message is intended only for the use of the designated recipient(s) named above. If the reader of this message is not the intended recipient, you are hereby notified that you have received this message in error and that any review, dissemination, distribution, or copying of this message is strictly prohibited. If you have received this communication in error, please notify the sender by telephone or e-mail (as shown above) immediately and destroy any and all copies of this message in your possession (whether hard copies or electronically stored copies).

��.n��������+%������w��{.n����z��u���ܨ}���Ơz�j:+v�����w����ޙ��&�)ߡ�a����z�ޗ���ݢj��w�f