Hi! We got some strange performance results when running random read fio test on our test Hammer cluster. When we run fio-rbd (4k, randread, 8 jobs, QD=32, 500Gb rbd image) at first time (pagecache is cold/empty) we got ~12kiops sustained performance. It is quite resonable value, as 12kiops/34osd = 352iops per disk. This is rather normal value per 10k sas disk. As most of the data have really read from platters, we also got high iowait - ~45% and average user cpu activity (~35%). But when we run the same test second time, some data already stay in a pagecache and can be acessed faster, and yes, we got ~25kiops. We have low iowait (~1-3%), but surprisingly high user cpu activity >70% Perf top shows us, than most calls are in tcmalloc library: 19,61% libtcmalloc.so.4.2.2 [.] tcmalloc::CentralFreeList::FetchFromOneSpans(int, void**, void**) 15,53% libtcmalloc.so.4.2.2 [.] tcmalloc::SLL_Next(void*) 9,03% libtcmalloc.so.4.2.2 [.] TCMalloc_PageMap3<35>::get(unsigned long) const 6,71% libtcmalloc.so.4.2.2 [.] tcmalloc::CentralFreeList::ReleaseToSpans(void*) 1,59% libtcmalloc.so.4.2.2 [.] tcmalloc::CentralFreeList::ReleaseListToSpans(void*) 1,58% libtcmalloc.so.4.2.2 [.] tcmalloc::SLL_PopRange(void**, int, void**, void**) 1,42% libtcmalloc.so.4.2.2 [.] tcmalloc::PageHeap::GetDescriptor(unsigned long) const 1,03% libtcmalloc.so.4.2.2 [.] 0x0000000000060589 0,91% libtcmalloc.so.4.2.2 [.] tcmalloc::ThreadCache::Scavenge() 0,82% libtcmalloc.so.4.2.2 [.] tcmalloc::DLL_Remove(tcmalloc::Span*) 0,80% libtcmalloc.so.4.2.2 [.] tcmalloc::ThreadCache::IncreaseCacheLimitLocked() 0,75% libtcmalloc.so.4.2.2 [.] tcmalloc::Static::pageheap() 0,69% libtcmalloc.so.4.2.2 [.] PackedCache<35, unsigned long>::GetOrDefault(unsigned long, unsigned long) const 0,51% libpthread-2.19.so [.] __pthread_mutex_unlock_usercnt Running the same test over an RBD image in SSD pool gives the same 25-30kiops, while every DC S3700 SSD we used in ssd pool are easily performing >50k iops. I think, that 25-30kiops limit we got are due to tcmalloc inefficiency. What we can do to improve our results? Is there are some tuning of tcmalloc, or maybe compiling ceph with jemalloc will give better results? Have you any thoughts? Our small test Hammer install: - Debian Jessie; - Ceph Hammer 0.94.2 self-built from sources (tcmalloc) - 1xE5-2670 + 128Gb RAM - 2 nodes shared with mons, system and mon DB are on separate SAS mirror; - 17 OSD on each node, SAS 10k; - 2 Intel DC S3700 200Gb SSD for journalling on each node - 2 Intel DC S3700 400Gb SSD for separate SSD pool - 10Gbit interconnect, shared public and cluster metwork, MTU9100 - 10Gbit client host, fio 2.2.7 compiled with RBD engine Megov Igor CIO, Yuterra _______________________________________________ ceph-users mailing list ceph-users@xxxxxxxxxxxxxx http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com