Ceph allocator and performance

Межов Игорь Александрович <megov@xxxxxxxxxx> · Tue, 11 Aug 2015 15:13:51 +0000

Hi!

We got some strange performance results when running random read fio test on our test Hammer cluster.

When we run fio-rbd (4k, randread, 8 jobs, QD=32, 500Gb rbd image) at first time (pagecache is cold/empty) 
we got ~12kiops sustained performance. It is quite resonable value, as 12kiops/34osd = 352iops per disk. 
This is rather normal value per 10k sas disk. As most of the data have really read from platters, we also got 
high iowait - ~45% and average user cpu activity (~35%).

But when we run the same test second time, some data already stay in a pagecache and can be acessed
faster, and yes, we got ~25kiops. We have low iowait (~1-3%), but surprisingly high user cpu activity >70%

Perf top shows us, than most calls are in tcmalloc library:
  19,61%  libtcmalloc.so.4.2.2              [.] tcmalloc::CentralFreeList::FetchFromOneSpans(int, void**, void**)
  15,53%  libtcmalloc.so.4.2.2              [.] tcmalloc::SLL_Next(void*)
   9,03%  libtcmalloc.so.4.2.2              [.] TCMalloc_PageMap3<35>::get(unsigned long) const
   6,71%  libtcmalloc.so.4.2.2              [.] tcmalloc::CentralFreeList::ReleaseToSpans(void*)
   1,59%  libtcmalloc.so.4.2.2              [.] tcmalloc::CentralFreeList::ReleaseListToSpans(void*)
   1,58%  libtcmalloc.so.4.2.2              [.] tcmalloc::SLL_PopRange(void**, int, void**, void**)
   1,42%  libtcmalloc.so.4.2.2              [.] tcmalloc::PageHeap::GetDescriptor(unsigned long) const
   1,03%  libtcmalloc.so.4.2.2              [.] 0x0000000000060589
   0,91%  libtcmalloc.so.4.2.2              [.] tcmalloc::ThreadCache::Scavenge()
   0,82%  libtcmalloc.so.4.2.2              [.] tcmalloc::DLL_Remove(tcmalloc::Span*)
   0,80%  libtcmalloc.so.4.2.2              [.] tcmalloc::ThreadCache::IncreaseCacheLimitLocked()
   0,75%  libtcmalloc.so.4.2.2              [.] tcmalloc::Static::pageheap()
   0,69%  libtcmalloc.so.4.2.2              [.] PackedCache<35, unsigned long>::GetOrDefault(unsigned long, unsigned long) const
   0,51%  libpthread-2.19.so                [.] __pthread_mutex_unlock_usercnt        

Running the same test over an RBD image in SSD pool gives the same 25-30kiops, while every DC S3700 SSD
we used in ssd pool are easily performing >50k iops. I think, that 25-30kiops limit we got are due to tcmalloc 
inefficiency. 

What we can do to improve our results? Is there are some tuning of tcmalloc, or maybe compiling ceph
with jemalloc will give better results? Have you any thoughts?

Our small test Hammer install:
 - Debian Jessie;
 - Ceph Hammer 0.94.2 self-built from sources (tcmalloc)
 - 1xE5-2670 + 128Gb RAM
 - 2 nodes shared with mons, system and mon DB are on separate SAS mirror;
 - 17 OSD on each node, SAS 10k;
 - 2 Intel DC S3700 200Gb SSD for journalling on each node
 - 2 Intel DC S3700 400Gb SSD for separate SSD pool
 - 10Gbit interconnect, shared public and cluster metwork, MTU9100
 - 10Gbit client host, fio 2.2.7 compiled with RBD engine

Megov Igor
CIO, Yuterra

_______________________________________________
ceph-users mailing list
ceph-users@xxxxxxxxxxxxxx
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com