НА: НА: tcmalloc use a lot of CPU

Межов Игорь Александрович <megov@xxxxxxxxxx> · Tue, 18 Aug 2015 09:37:27 +0000

Hi!

>How many nodes? How many SSDs/OSDs? 

2 Nodes, each: 

- 1xE5-2670, 128G, 

- 2x146G SAS 10krpm - system + MON root

- 10x600G SAS 10krpm + 7x900G SAS 10krpm single drive RAID0 on lsi2208

- 2x400G SSD Intel DC S3700 on С602 - for separate SSD pool 

- 2x200G SSD Intel DC S3700 on SATA3- for ceph journals

- 10Gbit shared interconnect (Eth)

So: 2 MONs (I know about quorum ;) ) + 34 HDD OSDs + 4 SSD OSDs

Ceph 0.94.2 on Debian Jessie. Tuning: swappiness, low latency TCP tuning, 

enlarging TCP buffers, disable interrupt colaescing, noop on ssd, deadline on HDD.

>Are they random? 

Yes. 4k random read, 8 pocesses, aio, qd=32 over a 500G RBD volumes.

There are 2 testing volumes - on HDD and SSD pools. Client is running

on separate host with 10Gbin network. Volumes are real Linux filesystems,

created with rbd import, so they are fully allocated.

>What are you using to make the tests? 

fio-rbd 2.2.7 - with native rbd support, made from sources.

>How big are those OPS? 

When I use deafult ceph.conf (simple messenger, use crc, use cephx, all debug off):

1. ~12k iops from HDD pool in cold state (after dropping caches on all nodes)

- 8-10% user, 2-3% sys, ~70% iowait, 18% idle

- iostat shows >70% load on OSD drives

- perf top shows 

   7,53%  libtcmalloc.so.4.2.2              [.] tcmalloc::SLL_Next(void*)

   1,86%  libtcmalloc.so.4.2.2              [.] tcmalloc::CentralFreeList::FetchFromOneSpans(int, void**, void**)

   1,51%  libpthread-2.19.so                [.] __pthread_mutex_unlock_usercnt

   1,49%  libtcmalloc.so.4.2.2              [.] TCMalloc_PageMap3<35>::get(unsigned long) const

   1,29%  libtcmalloc.so.4.2.2              [.] PackedCache<35, unsigned long>::GetOrDefault(unsigned long, unsigned long)

   1,25%  libtcmalloc.so.4.2.2              [.] tcmalloc::CentralFreeList::ReleaseToSpans(void*)

   1,19%  ceph-osd                          [.] crush_hash32_3

   1,00%  libpthread-2.19.so                [.] pthread_mutex_lock

   0,89%  libtcmalloc.so.4.2.2              [.] tcmalloc::ThreadCache::Deallocate(void*, unsigned long)

   0,87%  libtcmalloc.so.4.2.2              [.] base::subtle::NoBarrier_Load(long const volatile*)

2. ~30-40k iops from HDD pool in warm state (second pass)

- 40-60% user (!), 8-10% sys, <1% iowait, ~50% idle

- iostat shows <1% load on OSD drives

- perf top shows the same - tcmalloc calls are in top

I It is quite understandable situation: at the first run most io read from platters and we got

12000iops/34osd ~ 350iops, that is good value for 10krpm drive. At the second run we serve

reads (mostly) from pagecache, so no IO on platters. But both runs shows us, that there is

some tcmalloc issue, limiting to overall io of cluster. Also >40% CPU in the second run

is abnormal value, I think.

Next test is the same, except volume is on the SSD pool. 

3. ~43k iops from SSD pool in cold state (after dropping caches on all nodes)

- 25% user, 8-12% sys, ~6% iowait, ~55-60% idle

- iostat shows ~55-65% load on SSD with ~8 kiops each (4 ssd total in pool)

- perf top shows two different things, I'll explain later(*)

4. Also the same ~43k iops from SSD pool in warm state

This test shows, that ceph somewhere limits performance by itself, 

cause (a) there are almost no difference in iops between serving io 

from ssd itself and pagecache. I think io from pagecache will be faster anyway.

And (b) each SSD can do >30k iops random read, while we got only ~8k per drive.

(*) As for perf top results, sometimes things quickly changed and instead of tcmalloc's

calls in top, we got:

  46,07%  [kernel]              [k] _raw_spin_lock

   6,51%  [kernel]              [k] mb_cache_entry_alloc

As I can see the function's names, it is kernel calls for cache allocation, in normal

situation, they are far behind tcmalloc calls, but sometimes they're go up in perf top.

In this moments, performance from SSD pool drops significantly - to <10k iops.

And this is not happens, while benchmarking volume, located on HDD pool,

only when testing volume on SSD pool. Pity, but I dont have any explanations.

Kernel issue?

>Using atop on the OSD nodes where is your bottleneck? 

That is the main question! We built this test Hammer install to get the best performance

from it, because our productuion Firefly cluster performs not so well. And I can't see

any bottleneck, thal limits performance to ~40k iops, except tcmalloc issues.

PS: I try to use ms_async messenger, and it raises performance top over 60k! 

It is very good! But the bad thing is a core dump, that always happens in two minutes

after start. As I can see, there is assert on memory deallocation in AsyncMessenger code.

Hope, that in new Ceph versions, async messnger will work better, as it really helps to

increace performance.

Megov Igor

CIO, Yuterra

От: Luis Periquito <periquito@xxxxxxxxx>

Отправлено: 17 августа 2015 г. 17:15

Кому: Межов Игорь Александрович

Копия: YeYin; ceph-users

Тема: Re:  НА: tcmalloc use a lot of CPU

How big are those OPS? Are they random? How many nodes? How many SSDs/OSDs? What are you using to make the tests? Using atop on the OSD nodes where is your bottleneck? 

On Mon, Aug 17, 2015 at 1:05 PM, Межов Игорь Александрович
<megov@xxxxxxxxxx> wrote:

Hi!

We also observe the same behavior on our test Hammer install, and I wrote about it some time ago:

http://permalink.gmane.org/gmane.comp.file-systems.ceph.user/22609

Jan Schremes give us some suggestions in thread, but we still not got any positive results - TCMalloc usage is

high. The usage is lowered to <10%, when disable crc in messages, disable debug and disable cephx auth,

but this is od course not for production use. Also we got a different trace, while performin FIO-RBD benchmarks

on ssd pool:

---

  46,07%  [kernel]              [k] _raw_spin_lock

   6,51%  [kernel]              [k] mb_cache_entry_alloc

   5,74%  libtcmalloc.so.4.2.2  [.] tcmalloc::CentralFreeList::FetchFromOneSpans(int, void**, void**)

   5,50%  libtcmalloc.so.4.2.2  [.] tcmalloc::SLL_Next(void*)

   3,86%  libtcmalloc.so.4.2.2  [.] TCMalloc_PageMap3<35>::get(unsigned long) const

   2,73%  libtcmalloc.so.4.2.2  [.] tcmalloc::CentralFreeList::ReleaseToSpans(void*)

   0,69%  libtcmalloc.so.4.2.2  [.] tcmalloc::CentralFreeList::ReleaseListToSpans(void*)

   0,69%  libtcmalloc.so.4.2.2  [.] tcmalloc::PageHeap::GetDescriptor(unsigned long) const

   0,64%  libtcmalloc.so.4.2.2  [.] tcmalloc::SLL_PopRange(void**, int, void**, void**)

---

I dont clearly understand, what's happening in this case: ssd pool is connected to the same host,

but different controller (C60X onboard instead of LSI2208), io scheduler set to noop, pool is gathered

from 4х400Gb Intel DC S3700 and have to perform better, I think - more than 30-40 kops.

But we got the trace above and no more then 12-15 kiops. Where can be a problem?

Megov Igor

CIO, Yuterra

От: ceph-users <ceph-users-bounces@xxxxxxxxxxxxxx> от имени YeYin <eyniy@xxxxxx>

Отправлено: 17 августа 2015 г. 12:58

Кому: ceph-users

Тема:  tcmalloc use a lot of CPU

Hi, all,
  When I do performance test with rados bench, I found tcmalloc consumed a lot of CPU:

Samples: 265K of event 'cycles', Event count (approx.): 104385445900
+  27.58%  libtcmalloc.so.4.1.0    [.] tcmalloc::CentralFreeList::FetchFromSpans()
+  15.25%  libtcmalloc.so.4.1.0    [.] tcmalloc::ThreadCache::ReleaseToCentralCache(tcmalloc::ThreadCache::FreeList*, unsigned long,
+  12.20%  libtcmalloc.so.4.1.0    [.] tcmalloc::CentralFreeList::ReleaseToSpans(void*)
+   1.63%  perf                    [.] append_chain
+   1.39%  libtcmalloc.so.4.1.0    [.] tcmalloc::CentralFreeList::ReleaseListToSpans(void*)
+   1.02%  libtcmalloc.so.4.1.0    [.] tcmalloc::CentralFreeList::RemoveRange(void**, void**, int)
+   0.85%  libtcmalloc.so.4.1.0    [.] 0x0000000000017e6f
+   0.75%  libtcmalloc.so.4.1.0    [.] tcmalloc::ThreadCache::IncreaseCacheLimitLocked()
+   0.67%  libc-2.12.so            [.] memcpy
+   0.53%  libtcmalloc.so.4.1.0    [.] operator delete(void*)

Ceph version:

# ceph --version
ceph version 0.87.2 (87a7cec9ab11c677de2ab23a7668a77d2f5b955e)

Kernel version:
3.10.83

Is this phenomenon normal?Is there any idea about this problem?

Thanks.
Ye

_______________________________________________

ceph-users mailing list

ceph-users@xxxxxxxxxxxxxx

http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

_______________________________________________
ceph-users mailing list
ceph-users@xxxxxxxxxxxxxx
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com