@Mark: could you also post your ceph.conf? 2015-08-19 11:36 GMT-07:00 Mark Nelson <mnelson@xxxxxxxxxx>: > On 08/19/2015 01:20 PM, Allen Samuels wrote: >> >> It was a surprising result that the memory allocator is making such a >> large difference in performance. All of the recent work in fiddling with >> TCmalloc's and Jemalloc's various knobs and switches has been excellent a >> great example of group collaboration. But I think it's only a partial >> optimization of the underlying problem. The real take-away from this >> activity is that the code base is doing a LOT of memory >> allocation/deallocation which is consuming substantial CPU time-- regardless >> of how much we optimize the memory allocator, you can't get away from the >> fact that it macroscopically MATTERs. The better long-term solution is to >> reduce reliance on the general-purpose memory allocator and to implement >> strategies that are more specific to our usage model. >> >> What really needs to happen initially is to instrument the >> allocation/deallocation. Most likely we'll find that 80+% of the work is >> coming from just a few object classes and it will be easy to create custom >> allocation strategies for those usages. This will lead to even higher >> performance that's much less sensitive to easy-to-misconfigure environmental >> factors and the entire tcmalloc/jemalloc -- oops it uses more memory >> discussion will go away. > > > Yes, I think the real take away is the Ceph is really hard on memory > allocators. I think a lot of us have sort of had a feeling this was the > case for a long time. The current discussion/results just draws it a lot > more sharply into focus. > > On the plus side there is work going on to make things a little more > manageable, though a more comprehensive analysis would be very welcome! I > see the jemalloc has some interesting looking profiling options in the newer > releases. > > Mark > > >> >> >> Allen Samuels >> Software Architect, Systems and Software Solutions >> >> 2880 Junction Avenue, San Jose, CA 95134 >> T: +1 408 801 7030| M: +1 408 780 6416 >> allen.samuels@xxxxxxxxxxx >> >> >> -----Original Message----- >> From: ceph-devel-owner@xxxxxxxxxxxxxxx >> [mailto:ceph-devel-owner@xxxxxxxxxxxxxxx] On Behalf Of Somnath Roy >> Sent: Wednesday, August 19, 2015 10:30 AM >> To: Alexandre DERUMIER >> Cc: Mark Nelson; ceph-devel >> Subject: RE: Ceph Hackathon: More Memory Allocator Testing >> >> Yes, it should be 1 per OSD... >> There is no doubt that TCMALLOC_MAX_TOTAL_THREAD_CACHE_BYTES is relative >> to the number of threads running.. >> But, I don't know if number of threads is a factor for jemalloc.. >> >> Thanks & Regards >> Somnath >> >> -----Original Message----- >> From: Alexandre DERUMIER [mailto:aderumier@xxxxxxxxx] >> Sent: Wednesday, August 19, 2015 9:55 AM >> To: Somnath Roy >> Cc: Mark Nelson; ceph-devel >> Subject: Re: Ceph Hackathon: More Memory Allocator Testing >> >> << I think that tcmalloc have a fixed size >> (TCMALLOC_MAX_TOTAL_THREAD_CACHE_BYTES), and share it between all process. >> >>>> I think it is per tcmalloc instance loaded , so, at least with num_osds >>>> * num_tcmalloc_instance * TCMALLOC_MAX_TOTAL_THREAD_CACHE_BYTES in a box. >> >> >> What is num_tcmalloc_instance ? I think 1 osd process use a defined >> TCMALLOC_MAX_TOTAL_THREAD_CACHE_BYTES size ? >> >> I'm saying that, because I have exactly the same bug, client side, with >> librbd + tcmalloc + qemu + iothreads. >> When I defined too much iothread threads, I'm hitting the bug directly. >> (can reproduce 100%). >> Like the thread_cache size is divide by number of threads? >> >> >> >> >> >> >> ----- Mail original ----- >> De: "Somnath Roy" <Somnath.Roy@xxxxxxxxxxx> >> À: "aderumier" <aderumier@xxxxxxxxx>, "Mark Nelson" <mnelson@xxxxxxxxxx> >> Cc: "ceph-devel" <ceph-devel@xxxxxxxxxxxxxxx> >> Envoyé: Mercredi 19 Août 2015 18:27:30 >> Objet: RE: Ceph Hackathon: More Memory Allocator Testing >> >> << I think that tcmalloc have a fixed size >> (TCMALLOC_MAX_TOTAL_THREAD_CACHE_BYTES), and share it between all process. >> >> I think it is per tcmalloc instance loaded , so, at least with num_osds * >> num_tcmalloc_instance * TCMALLOC_MAX_TOTAL_THREAD_CACHE_BYTES in a box. >> >> Also, I think there is no point of increasing osd_op_threads as it is not >> in IO path anymore..Mark is using default 5:2 for shard:thread per shard.. >> >> But, yes, it could be related to number of threads OSDs are using, need to >> understand how jemalloc works..Also, there may be some tuning to reduce >> memory usage (?). >> >> Thanks & Regards >> Somnath >> >> -----Original Message----- >> From: ceph-devel-owner@xxxxxxxxxxxxxxx >> [mailto:ceph-devel-owner@xxxxxxxxxxxxxxx] On Behalf Of Alexandre DERUMIER >> Sent: Wednesday, August 19, 2015 9:06 AM >> To: Mark Nelson >> Cc: ceph-devel >> Subject: Re: Ceph Hackathon: More Memory Allocator Testing >> >> I was listening at the today meeting, >> >> and seem that the blocker to have jemalloc as default, >> >> is that it's used more memory by osd (around 300MB?), and some guys could >> have boxes with 60disks. >> >> >> I just wonder if the memory increase is related to >> osd_op_num_shards/osd_op_threads value ? >> >> Seem that as hackaton, the bench has been done on super big cpus boxed >> 36cores/72T, http://ceph.com/hackathon/2015-08-ceph-hammer-full-ssd.pptx >> with osd_op_threads = 32. >> >> I think that tcmalloc have a fixed size >> (TCMALLOC_MAX_TOTAL_THREAD_CACHE_BYTES), and share it between all process. >> >> Maybe jemalloc allocated memory by threads. >> >> >> >> (I think guys with 60disks box, dont use ssd, so low iops by osd, and they >> don't need a lot of threads by osd) >> >> >> >> ----- Mail original ----- >> De: "aderumier" <aderumier@xxxxxxxxx> >> À: "Mark Nelson" <mnelson@xxxxxxxxxx> >> Cc: "ceph-devel" <ceph-devel@xxxxxxxxxxxxxxx> >> Envoyé: Mercredi 19 Août 2015 16:01:28 >> Objet: Re: Ceph Hackathon: More Memory Allocator Testing >> >> Thanks Marc, >> >> Results are matching exactly what I have seen with tcmalloc 2.1 vs 2.4 vs >> jemalloc. >> >> and indeed tcmalloc, even with bigger cache, seem decrease over time. >> >> >> What is funny, is that I see exactly same behaviour client librbd side, >> with qemu and multiple iothreads. >> >> >> Switching both server and client to jemalloc give me best performance on >> small read currently. >> >> >> >> >> >> >> ----- Mail original ----- >> De: "Mark Nelson" <mnelson@xxxxxxxxxx> >> À: "ceph-devel" <ceph-devel@xxxxxxxxxxxxxxx> >> Envoyé: Mercredi 19 Août 2015 06:45:36 >> Objet: Ceph Hackathon: More Memory Allocator Testing >> >> Hi Everyone, >> >> One of the goals at the Ceph Hackathon last week was to examine how to >> improve Ceph Small IO performance. Jian Zhang presented findings showing a >> dramatic improvement in small random IO performance when Ceph is used with >> jemalloc. His results build upon Sandisk's original findings that the >> default thread cache values are a major bottleneck in TCMalloc 2.1. To >> further verify these results, we sat down at the Hackathon and configured >> the new performance test cluster that Intel generously donated to the Ceph >> community laboratory to run through a variety of tests with different memory >> allocator configurations. I've since written the results of those tests up >> in pdf form for folks who are interested. >> >> The results are located here: >> >> http://nhm.ceph.com/hackathon/Ceph_Hackathon_Memory_Allocator_Testing.pdf >> >> I want to be clear that many other folks have done the heavy lifting here. >> These results are simply a validation of the many tests that other folks >> have already done. Many thanks to Sandisk and others for figuring this out >> as it's a pretty big deal! >> >> Side note: Very little tuning other than swapping the memory allocator and >> a couple of quick and dirty ceph tunables were set during these tests. It's >> quite possible that higher IOPS will be achieved as we really start digging >> into the cluster and learning what the bottlenecks are. >> >> Thanks, >> Mark >> -- >> To unsubscribe from this list: send the line "unsubscribe ceph-devel" in >> the body of a message to majordomo@xxxxxxxxxxxxxxx More majordomo info at >> http://vger.kernel.org/majordomo-info.html >> >> -- >> To unsubscribe from this list: send the line "unsubscribe ceph-devel" in >> the body of a message to majordomo@xxxxxxxxxxxxxxx More majordomo info at >> http://vger.kernel.org/majordomo-info.html >> -- >> To unsubscribe from this list: send the line "unsubscribe ceph-devel" in >> the body of a message to majordomo@xxxxxxxxxxxxxxx More majordomo info at >> http://vger.kernel.org/majordomo-info.html >> >> ________________________________ >> >> PLEASE NOTE: The information contained in this electronic mail message is >> intended only for the use of the designated recipient(s) named above. If the >> reader of this message is not the intended recipient, you are hereby >> notified that you have received this message in error and that any review, >> dissemination, distribution, or copying of this message is strictly >> prohibited. If you have received this communication in error, please notify >> the sender by telephone or e-mail (as shown above) immediately and destroy >> any and all copies of this message in your possession (whether hard copies >> or electronically stored copies). >> N r y b X ǧv ^ ){.n + z ]z {ay ʇڙ ,j f h z w >> j:+v w j m zZ+ ݢj" ! i >> > -- > To unsubscribe from this list: send the line "unsubscribe ceph-devel" in > the body of a message to majordomo@xxxxxxxxxxxxxxx > More majordomo info at http://vger.kernel.org/majordomo-info.html -- To unsubscribe from this list: send the line "unsubscribe ceph-devel" in the body of a message to majordomo@xxxxxxxxxxxxxxx More majordomo info at http://vger.kernel.org/majordomo-info.html