Re: Ceph Hackathon: More Memory Allocator Testing

Łukasz Redynk <lukas.redynk@xxxxxxxxx> · Wed, 19 Aug 2015 11:47:30 -0700

@Mark: could you also post your ceph.conf?

2015-08-19 11:36 GMT-07:00 Mark Nelson <mnelson@xxxxxxxxxx>:
> On 08/19/2015 01:20 PM, Allen Samuels wrote:
>>
>> It was a surprising result that the memory allocator is making such a
>> large difference in performance. All of the recent work in fiddling with
>> TCmalloc's and Jemalloc's various knobs and switches has been excellent a
>> great example of group collaboration. But I think it's only a partial
>> optimization of the underlying problem. The real take-away from this
>> activity is that the code base is doing a LOT of memory
>> allocation/deallocation which is consuming substantial CPU time-- regardless
>> of how much we optimize the memory allocator, you can't get away from the
>> fact that it macroscopically MATTERs. The better long-term solution is to
>> reduce reliance on the general-purpose memory allocator and to implement
>> strategies that are more specific to our usage model.
>>
>> What really needs to happen initially is to instrument the
>> allocation/deallocation. Most likely we'll find that 80+% of the work is
>> coming from just a few object classes and it will be easy to create custom
>> allocation strategies for those usages. This will lead to even higher
>> performance that's much less sensitive to easy-to-misconfigure environmental
>> factors and the entire tcmalloc/jemalloc -- oops it uses more memory
>> discussion will go away.
>
>
> Yes, I think the real take away is the Ceph is really hard on memory
> allocators.  I think a lot of us have sort of had a feeling this was the
> case for a long time.  The current discussion/results just draws it a lot
> more sharply into focus.
>
> On the plus side there is work going on to make things a little more
> manageable, though a more comprehensive analysis would be very welcome!  I
> see the jemalloc has some interesting looking profiling options in the newer
> releases.
>
> Mark
>
>
>>
>>
>> Allen Samuels
>> Software Architect, Systems and Software Solutions
>>
>> 2880 Junction Avenue, San Jose, CA 95134
>> T: +1 408 801 7030| M: +1 408 780 6416
>> allen.samuels@xxxxxxxxxxx
>>
>>
>> -----Original Message-----
>> From: ceph-devel-owner@xxxxxxxxxxxxxxx
>> [mailto:ceph-devel-owner@xxxxxxxxxxxxxxx] On Behalf Of Somnath Roy
>> Sent: Wednesday, August 19, 2015 10:30 AM
>> To: Alexandre DERUMIER
>> Cc: Mark Nelson; ceph-devel
>> Subject: RE: Ceph Hackathon: More Memory Allocator Testing
>>
>> Yes, it should be 1 per OSD...
>> There is no doubt that TCMALLOC_MAX_TOTAL_THREAD_CACHE_BYTES is relative
>> to the number of threads running..
>> But, I don't know if number of threads is a factor for jemalloc..
>>
>> Thanks & Regards
>> Somnath
>>
>> -----Original Message-----
>> From: Alexandre DERUMIER [mailto:aderumier@xxxxxxxxx]
>> Sent: Wednesday, August 19, 2015 9:55 AM
>> To: Somnath Roy
>> Cc: Mark Nelson; ceph-devel
>> Subject: Re: Ceph Hackathon: More Memory Allocator Testing
>>
>> << I think that tcmalloc have a fixed size
>> (TCMALLOC_MAX_TOTAL_THREAD_CACHE_BYTES), and share it between all process.
>>
>>>> I think it is per tcmalloc instance loaded , so, at least with num_osds
>>>> * num_tcmalloc_instance * TCMALLOC_MAX_TOTAL_THREAD_CACHE_BYTES in a box.
>>
>>
>> What is num_tcmalloc_instance ? I think 1 osd process use a defined
>> TCMALLOC_MAX_TOTAL_THREAD_CACHE_BYTES size ?
>>
>> I'm saying that, because I have exactly the same bug, client side, with
>> librbd + tcmalloc + qemu + iothreads.
>> When I defined too much iothread threads, I'm hitting the bug directly.
>> (can reproduce 100%).
>> Like the thread_cache size is divide by number of threads?
>>
>>
>>
>>
>>
>>
>> ----- Mail original -----
>> De: "Somnath Roy" <Somnath.Roy@xxxxxxxxxxx>
>> À: "aderumier" <aderumier@xxxxxxxxx>, "Mark Nelson" <mnelson@xxxxxxxxxx>
>> Cc: "ceph-devel" <ceph-devel@xxxxxxxxxxxxxxx>
>> Envoyé: Mercredi 19 Août 2015 18:27:30
>> Objet: RE: Ceph Hackathon: More Memory Allocator Testing
>>
>> << I think that tcmalloc have a fixed size
>> (TCMALLOC_MAX_TOTAL_THREAD_CACHE_BYTES), and share it between all process.
>>
>> I think it is per tcmalloc instance loaded , so, at least with num_osds *
>> num_tcmalloc_instance * TCMALLOC_MAX_TOTAL_THREAD_CACHE_BYTES in a box.
>>
>> Also, I think there is no point of increasing osd_op_threads as it is not
>> in IO path anymore..Mark is using default 5:2 for shard:thread per shard..
>>
>> But, yes, it could be related to number of threads OSDs are using, need to
>> understand how jemalloc works..Also, there may be some tuning to reduce
>> memory usage (?).
>>
>> Thanks & Regards
>> Somnath
>>
>> -----Original Message-----
>> From: ceph-devel-owner@xxxxxxxxxxxxxxx
>> [mailto:ceph-devel-owner@xxxxxxxxxxxxxxx] On Behalf Of Alexandre DERUMIER
>> Sent: Wednesday, August 19, 2015 9:06 AM
>> To: Mark Nelson
>> Cc: ceph-devel
>> Subject: Re: Ceph Hackathon: More Memory Allocator Testing
>>
>> I was listening at the today meeting,
>>
>> and seem that the blocker to have jemalloc as default,
>>
>> is that it's used more memory by osd (around 300MB?), and some guys could
>> have boxes with 60disks.
>>
>>
>> I just wonder if the memory increase is related to
>> osd_op_num_shards/osd_op_threads value ?
>>
>> Seem that as hackaton, the bench has been done on super big cpus boxed
>> 36cores/72T, http://ceph.com/hackathon/2015-08-ceph-hammer-full-ssd.pptx
>> with osd_op_threads = 32.
>>
>> I think that tcmalloc have a fixed size
>> (TCMALLOC_MAX_TOTAL_THREAD_CACHE_BYTES), and share it between all process.
>>
>> Maybe jemalloc allocated memory by threads.
>>
>>
>>
>> (I think guys with 60disks box, dont use ssd, so low iops by osd, and they
>> don't need a lot of threads by osd)
>>
>>
>>
>> ----- Mail original -----
>> De: "aderumier" <aderumier@xxxxxxxxx>
>> À: "Mark Nelson" <mnelson@xxxxxxxxxx>
>> Cc: "ceph-devel" <ceph-devel@xxxxxxxxxxxxxxx>
>> Envoyé: Mercredi 19 Août 2015 16:01:28
>> Objet: Re: Ceph Hackathon: More Memory Allocator Testing
>>
>> Thanks Marc,
>>
>> Results are matching exactly what I have seen with tcmalloc 2.1 vs 2.4 vs
>> jemalloc.
>>
>> and indeed tcmalloc, even with bigger cache, seem decrease over time.
>>
>>
>> What is funny, is that I see exactly same behaviour client librbd side,
>> with qemu and multiple iothreads.
>>
>>
>> Switching both server and client to jemalloc give me best performance on
>> small read currently.
>>
>>
>>
>>
>>
>>
>> ----- Mail original -----
>> De: "Mark Nelson" <mnelson@xxxxxxxxxx>
>> À: "ceph-devel" <ceph-devel@xxxxxxxxxxxxxxx>
>> Envoyé: Mercredi 19 Août 2015 06:45:36
>> Objet: Ceph Hackathon: More Memory Allocator Testing
>>
>> Hi Everyone,
>>
>> One of the goals at the Ceph Hackathon last week was to examine how to
>> improve Ceph Small IO performance. Jian Zhang presented findings showing a
>> dramatic improvement in small random IO performance when Ceph is used with
>> jemalloc. His results build upon Sandisk's original findings that the
>> default thread cache values are a major bottleneck in TCMalloc 2.1. To
>> further verify these results, we sat down at the Hackathon and configured
>> the new performance test cluster that Intel generously donated to the Ceph
>> community laboratory to run through a variety of tests with different memory
>> allocator configurations. I've since written the results of those tests up
>> in pdf form for folks who are interested.
>>
>> The results are located here:
>>
>> http://nhm.ceph.com/hackathon/Ceph_Hackathon_Memory_Allocator_Testing.pdf
>>
>> I want to be clear that many other folks have done the heavy lifting here.
>> These results are simply a validation of the many tests that other folks
>> have already done. Many thanks to Sandisk and others for figuring this out
>> as it's a pretty big deal!
>>
>> Side note: Very little tuning other than swapping the memory allocator and
>> a couple of quick and dirty ceph tunables were set during these tests. It's
>> quite possible that higher IOPS will be achieved as we really start digging
>> into the cluster and learning what the bottlenecks are.
>>
>> Thanks,
>> Mark
>> --
>> To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
>> the body of a message to majordomo@xxxxxxxxxxxxxxx More majordomo info at
>> http://vger.kernel.org/majordomo-info.html
>>
>> --
>> To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
>> the body of a message to majordomo@xxxxxxxxxxxxxxx More majordomo info at
>> http://vger.kernel.org/majordomo-info.html
>> --
>> To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
>> the body of a message to majordomo@xxxxxxxxxxxxxxx More majordomo info at
>> http://vger.kernel.org/majordomo-info.html
>>
>> ________________________________
>>
>> PLEASE NOTE: The information contained in this electronic mail message is
>> intended only for the use of the designated recipient(s) named above. If the
>> reader of this message is not the intended recipient, you are hereby
>> notified that you have received this message in error and that any review,
>> dissemination, distribution, or copying of this message is strictly
>> prohibited. If you have received this communication in error, please notify
>> the sender by telephone or e-mail (as shown above) immediately and destroy
>> any and all copies of this message in your possession (whether hard copies
>> or electronically stored copies).
>> N     r  y   b X  ǧv ^ )޺{.n +   z ]z   {ay  ʇڙ ,j   f   h   z   w
>> j:+v   w j m         zZ+     ݢj"  ! i
>>
> --
> To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
> the body of a message to majordomo@xxxxxxxxxxxxxxx
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at  http://vger.kernel.org/majordomo-info.html