Hi!
>How many nodes? How many SSDs/OSDs?
2 Nodes, each:
- 1xE5-2670, 128G,
- 2x146G SAS 10krpm - system + MON root
- 10x600G SAS 10krpm + 7x900G SAS 10krpm single drive RAID0 on lsi2208
- 2x400G SSD Intel DC S3700 on С602 - for separate SSD pool
- 2x200G SSD Intel DC S3700 on SATA3- for ceph journals
- 10Gbit shared interconnect (Eth)
So: 2 MONs (I know about quorum ;) ) + 34 HDD OSDs + 4 SSD OSDs
Ceph 0.94.2 on Debian Jessie. Tuning: swappiness, low latency TCP tuning,
enlarging TCP buffers, disable interrupt colaescing, noop on ssd, deadline on HDD.
>Are they random?
Yes. 4k random read, 8 pocesses, aio, qd=32 over a 500G RBD volumes.
There are 2 testing volumes - on HDD and SSD pools. Client is running
on separate host with 10Gbin network. Volumes are real Linux filesystems,
created with rbd import, so they are fully allocated.
>What are you using to make the tests?
fio-rbd 2.2.7 - with native rbd support, made from sources.
>How big are those OPS?
When I use deafult ceph.conf (simple messenger, use crc, use cephx, all debug off):
1. ~12k iops from HDD pool in cold state (after dropping caches on all nodes)
- 8-10% user, 2-3% sys, ~70% iowait, 18% idle
- iostat shows >70% load on OSD drives
- perf top shows
7,53% libtcmalloc.so.4.2.2 [.] tcmalloc::SLL_Next(void*)
1,86% libtcmalloc.so.4.2.2 [.] tcmalloc::CentralFreeList::FetchFromOneSpans(int, void**, void**)
1,51% libpthread-2.19.so [.] __pthread_mutex_unlock_usercnt
1,49% libtcmalloc.so.4.2.2 [.] TCMalloc_PageMap3<35>::get(unsigned long) const
1,29% libtcmalloc.so.4.2.2 [.] PackedCache<35, unsigned long>::GetOrDefault(unsigned long, unsigned long)
1,25% libtcmalloc.so.4.2.2 [.] tcmalloc::CentralFreeList::ReleaseToSpans(void*)
1,19% ceph-osd [.] crush_hash32_3
1,00% libpthread-2.19.so [.] pthread_mutex_lock
0,89% libtcmalloc.so.4.2.2 [.] tcmalloc::ThreadCache::Deallocate(void*, unsigned long)
0,87% libtcmalloc.so.4.2.2 [.] base::subtle::NoBarrier_Load(long const volatile*)
2. ~30-40k iops from HDD pool in warm state (second pass)
- 40-60% user (!), 8-10% sys, <1% iowait, ~50% idle
- iostat shows <1% load on OSD drives
- perf top shows the same - tcmalloc calls are in top
I It is quite understandable situation: at the first run most io read from platters and we got
12000iops/34osd ~ 350iops, that is good value for 10krpm drive. At the second run we serve
reads (mostly) from pagecache, so no IO on platters. But both runs shows us, that there is
some tcmalloc issue, limiting to overall io of cluster. Also >40% CPU in the second run
is abnormal value, I think.
Next test is the same, except volume is on the SSD pool.
3. ~43k iops from SSD pool in cold state (after dropping caches on all nodes)
- 25% user, 8-12% sys, ~6% iowait, ~55-60% idle
- iostat shows ~55-65% load on SSD with ~8 kiops each (4 ssd total in pool)
- perf top shows two different things, I'll explain later(*)
4. Also the same ~43k iops from SSD pool in warm state
This test shows, that ceph somewhere limits performance by itself,
cause (a) there are almost no difference in iops between serving io
from ssd itself and pagecache. I think io from pagecache will be faster anyway.
And (b) each SSD can do >30k iops random read, while we got only ~8k per drive.
(*) As for perf top results, sometimes things quickly changed and instead of tcmalloc's
calls in top, we got:
46,07% [kernel] [k] _raw_spin_lock
6,51% [kernel] [k] mb_cache_entry_alloc
As I can see the function's names, it is kernel calls for cache allocation, in normal
situation, they are far behind tcmalloc calls, but sometimes they're go up in perf top.
In this moments, performance from SSD pool drops significantly - to <10k iops.
And this is not happens, while benchmarking volume, located on HDD pool,
only when testing volume on SSD pool. Pity, but I dont have any explanations.
Kernel issue?
>Using atop on the OSD nodes where is your bottleneck?
That is the main question! We built this test Hammer install to get the best performance
from it, because our productuion Firefly cluster performs not so well. And I can't see
any bottleneck, thal limits performance to ~40k iops, except tcmalloc issues.
PS: I try to use ms_async messenger, and it raises performance top over 60k!
It is very good! But the bad thing is a core dump, that always happens in two minutes
after start. As I can see, there is assert on memory deallocation in AsyncMessenger code.
Hope, that in new Ceph versions, async messnger will work better, as it really helps to
increace performance.
Megov Igor
CIO, Yuterra
>How many nodes? How many SSDs/OSDs?
2 Nodes, each:
- 1xE5-2670, 128G,
- 2x146G SAS 10krpm - system + MON root
- 10x600G SAS 10krpm + 7x900G SAS 10krpm single drive RAID0 on lsi2208
- 2x400G SSD Intel DC S3700 on С602 - for separate SSD pool
- 2x200G SSD Intel DC S3700 on SATA3- for ceph journals
- 10Gbit shared interconnect (Eth)
So: 2 MONs (I know about quorum ;) ) + 34 HDD OSDs + 4 SSD OSDs
Ceph 0.94.2 on Debian Jessie. Tuning: swappiness, low latency TCP tuning,
enlarging TCP buffers, disable interrupt colaescing, noop on ssd, deadline on HDD.
>Are they random?
Yes. 4k random read, 8 pocesses, aio, qd=32 over a 500G RBD volumes.
There are 2 testing volumes - on HDD and SSD pools. Client is running
on separate host with 10Gbin network. Volumes are real Linux filesystems,
created with rbd import, so they are fully allocated.
>What are you using to make the tests?
fio-rbd 2.2.7 - with native rbd support, made from sources.
>How big are those OPS?
When I use deafult ceph.conf (simple messenger, use crc, use cephx, all debug off):
1. ~12k iops from HDD pool in cold state (after dropping caches on all nodes)
- 8-10% user, 2-3% sys, ~70% iowait, 18% idle
- iostat shows >70% load on OSD drives
- perf top shows
7,53% libtcmalloc.so.4.2.2 [.] tcmalloc::SLL_Next(void*)
1,86% libtcmalloc.so.4.2.2 [.] tcmalloc::CentralFreeList::FetchFromOneSpans(int, void**, void**)
1,51% libpthread-2.19.so [.] __pthread_mutex_unlock_usercnt
1,49% libtcmalloc.so.4.2.2 [.] TCMalloc_PageMap3<35>::get(unsigned long) const
1,29% libtcmalloc.so.4.2.2 [.] PackedCache<35, unsigned long>::GetOrDefault(unsigned long, unsigned long)
1,25% libtcmalloc.so.4.2.2 [.] tcmalloc::CentralFreeList::ReleaseToSpans(void*)
1,19% ceph-osd [.] crush_hash32_3
1,00% libpthread-2.19.so [.] pthread_mutex_lock
0,89% libtcmalloc.so.4.2.2 [.] tcmalloc::ThreadCache::Deallocate(void*, unsigned long)
0,87% libtcmalloc.so.4.2.2 [.] base::subtle::NoBarrier_Load(long const volatile*)
2. ~30-40k iops from HDD pool in warm state (second pass)
- 40-60% user (!), 8-10% sys, <1% iowait, ~50% idle
- iostat shows <1% load on OSD drives
- perf top shows the same - tcmalloc calls are in top
I It is quite understandable situation: at the first run most io read from platters and we got
12000iops/34osd ~ 350iops, that is good value for 10krpm drive. At the second run we serve
reads (mostly) from pagecache, so no IO on platters. But both runs shows us, that there is
some tcmalloc issue, limiting to overall io of cluster. Also >40% CPU in the second run
is abnormal value, I think.
Next test is the same, except volume is on the SSD pool.
3. ~43k iops from SSD pool in cold state (after dropping caches on all nodes)
- 25% user, 8-12% sys, ~6% iowait, ~55-60% idle
- iostat shows ~55-65% load on SSD with ~8 kiops each (4 ssd total in pool)
- perf top shows two different things, I'll explain later(*)
4. Also the same ~43k iops from SSD pool in warm state
This test shows, that ceph somewhere limits performance by itself,
cause (a) there are almost no difference in iops between serving io
from ssd itself and pagecache. I think io from pagecache will be faster anyway.
And (b) each SSD can do >30k iops random read, while we got only ~8k per drive.
(*) As for perf top results, sometimes things quickly changed and instead of tcmalloc's
calls in top, we got:
46,07% [kernel] [k] _raw_spin_lock
6,51% [kernel] [k] mb_cache_entry_alloc
As I can see the function's names, it is kernel calls for cache allocation, in normal
situation, they are far behind tcmalloc calls, but sometimes they're go up in perf top.
In this moments, performance from SSD pool drops significantly - to <10k iops.
And this is not happens, while benchmarking volume, located on HDD pool,
only when testing volume on SSD pool. Pity, but I dont have any explanations.
Kernel issue?
>Using atop on the OSD nodes where is your bottleneck?
That is the main question! We built this test Hammer install to get the best performance
from it, because our productuion Firefly cluster performs not so well. And I can't see
any bottleneck, thal limits performance to ~40k iops, except tcmalloc issues.
PS: I try to use ms_async messenger, and it raises performance top over 60k!
It is very good! But the bad thing is a core dump, that always happens in two minutes
after start. As I can see, there is assert on memory deallocation in AsyncMessenger code.
Hope, that in new Ceph versions, async messnger will work better, as it really helps to
increace performance.
Megov Igor
CIO, Yuterra
От: Luis Periquito <periquito@xxxxxxxxx>
Отправлено: 17 августа 2015 г. 17:15
Кому: Межов Игорь Александрович
Копия: YeYin; ceph-users
Тема: Re: НА: tcmalloc use a lot of CPU
Отправлено: 17 августа 2015 г. 17:15
Кому: Межов Игорь Александрович
Копия: YeYin; ceph-users
Тема: Re: НА: tcmalloc use a lot of CPU
How big are those OPS? Are they random? How many nodes? How many SSDs/OSDs? What are you using to make the tests? Using atop on the OSD nodes where is your bottleneck?
On Mon, Aug 17, 2015 at 1:05 PM, Межов Игорь Александрович
<megov@xxxxxxxxxx> wrote:
Hi!
We also observe the same behavior on our test Hammer install, and I wrote about it some time ago:
http://permalink.gmane.org/gmane.comp.file-systems.ceph.user/22609
Jan Schremes give us some suggestions in thread, but we still not got any positive results - TCMalloc usage is
high. The usage is lowered to <10%, when disable crc in messages, disable debug and disable cephx auth,
but this is od course not for production use. Also we got a different trace, while performin FIO-RBD benchmarks
on ssd pool:
---
46,07% [kernel] [k] _raw_spin_lock
6,51% [kernel] [k] mb_cache_entry_alloc
5,74% libtcmalloc.so.4.2.2 [.] tcmalloc::CentralFreeList::FetchFromOneSpans(int, void**, void**)
5,50% libtcmalloc.so.4.2.2 [.] tcmalloc::SLL_Next(void*)
3,86% libtcmalloc.so.4.2.2 [.] TCMalloc_PageMap3<35>::get(unsigned long) const
2,73% libtcmalloc.so.4.2.2 [.] tcmalloc::CentralFreeList::ReleaseToSpans(void*)
0,69% libtcmalloc.so.4.2.2 [.] tcmalloc::CentralFreeList::ReleaseListToSpans(void*)
0,69% libtcmalloc.so.4.2.2 [.] tcmalloc::PageHeap::GetDescriptor(unsigned long) const
0,64% libtcmalloc.so.4.2.2 [.] tcmalloc::SLL_PopRange(void**, int, void**, void**)
---
I dont clearly understand, what's happening in this case: ssd pool is connected to the same host,
but different controller (C60X onboard instead of LSI2208), io scheduler set to noop, pool is gathered
from 4х400Gb Intel DC S3700 and have to perform better, I think - more than 30-40 kops.
But we got the trace above and no more then 12-15 kiops. Where can be a problem?
Megov Igor
CIO, Yuterra
От: ceph-users <ceph-users-bounces@xxxxxxxxxxxxxx> от имени YeYin <eyniy@xxxxxx>
Отправлено: 17 августа 2015 г. 12:58
Кому: ceph-users
Тема: tcmalloc use a lot of CPUHi, all,When I do performance test with rados bench, I found tcmalloc consumed a lot of CPU:
Samples: 265K of event 'cycles', Event count (approx.): 104385445900+ 27.58% libtcmalloc.so.4.1.0 [.] tcmalloc::CentralFreeList::FetchFromSpans()+ 15.25% libtcmalloc.so.4.1.0 [.] tcmalloc::ThreadCache::ReleaseToCentralCache(tcmalloc::ThreadCache::FreeList*, unsigned long,+ 12.20% libtcmalloc.so.4.1.0 [.] tcmalloc::CentralFreeList::ReleaseToSpans(void*)+ 1.63% perf [.] append_chain+ 1.39% libtcmalloc.so.4.1.0 [.] tcmalloc::CentralFreeList::ReleaseListToSpans(void*)+ 1.02% libtcmalloc.so.4.1.0 [.] tcmalloc::CentralFreeList::RemoveRange(void**, void**, int)+ 0.85% libtcmalloc.so.4.1.0 [.] 0x0000000000017e6f+ 0.75% libtcmalloc.so.4.1.0 [.] tcmalloc::ThreadCache::IncreaseCacheLimitLocked()+ 0.67% libc-2.12.so [.] memcpy+ 0.53% libtcmalloc.so.4.1.0 [.] operator delete(void*)
Ceph version:# ceph --versionceph version 0.87.2 (87a7cec9ab11c677de2ab23a7668a77d2f5b955e)
Kernel version:3.10.83
Is this phenomenon normal?Is there any idea about this problem?
Thanks.Ye
_______________________________________________
ceph-users mailing list
ceph-users@xxxxxxxxxxxxxx
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
_______________________________________________ ceph-users mailing list ceph-users@xxxxxxxxxxxxxx http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com