I'll look at those references. Thanks! Allen Samuels SanDisk |a Western Digital brand 2880 Junction Avenue, San Jose, CA 95134 T: +1 408 801 7030| M: +1 408 780 6416 allen.samuels@xxxxxxxxxxx > -----Original Message----- > From: Haomai Wang [mailto:haomai@xxxxxxxx] > Sent: Wednesday, September 28, 2016 3:34 PM > To: Sage Weil <sage@xxxxxxxxxxxx> > Cc: Allen Samuels <Allen.Samuels@xxxxxxxxxxx>; Ceph Development > <ceph-devel@xxxxxxxxxxxxxxx> > Subject: Re: Memory Pooling and Containers > > On Wed, Sep 28, 2016 at 9:27 PM, Sage Weil <sage@xxxxxxxxxxxx> wrote: > > On Tue, 27 Sep 2016, Allen Samuels wrote: > >> As we discussed in the Bluestore standup this morning. This is > >> intended to start a discussion about creating some internal memory > >> pooling technology to try to get a better handle on the internal > >> usage of memory by Ceph. Let's start by discussing the requirements... > >> > >> Here is my list of requirements: > >> > >> (1) Should be able to create an arbitrary number of "pools" of memory. > >> > >> (2) Developers should be able to declare that a particular container > >> (i.e., STL or boost-like container) is wholly contained within a pool. > >> > >> (3) Beyond declarations (and possibly constructor initialization), no > >> explicit code is required to be written by developers to support (2). > >> All container manipulation primitives properly update the accounting. > >> > >> (4) Beyond construction/destruction costs, no container operation is > >> burdened by additional code -- only implicit malloc/free operations > >> are burdened with accounting. > >> > >> (5) The system tracks the aggregate amount of memory consumed in > each > >> pool and it's relatively cheap to interrogate the current total > >> consumption. > > > > Yes > > > >> (6) The system tracks the aggregate amount of memory consumed by > each > >> container in each pool -- but this is expensive to interrogate and is > >> intended to be used primarily for debugging purposes. > > > > This one sounds like a nice-to-have to me. If there is a performance > > cost I would skip it. > > > >> (7) generic object new/delete is possible, but not freed of the > >> accounting requirements -- especially #6, i.e.. > >> > >> (8) No backpressure is built into the scheme, i.e., nobody has to > >> worry about suddenly being "out" of memory or being delayed -- just > >> because some particular pool is filling up. That's a higher level > >> problem to solve. No memory is "reserved" either -- If you > >> overcommit, that's also not solved at this layer. IMO, this is a > >> crappy place to be doing ingest and flow control. > >> > >> (9) Implementation must be multi-thread and multi-socket aware. It > >> should expect high levels of thread concurrency and avoid unnecessary > >> global data manipulation (expect internal sharding of data structures > >> -- something like an arena-based malloc scheme). > > > > Yes > > > >> Requirement 5 allows a "trimming" system to be developed. I think > >> there are really two styles for this: > >> > >> (a) Time-based, i.e., periodically some thread wakes up and checks > >> memory usage within a pool. If it doesn't like it, then it's > >> responsible for "fixing" it, i.e., trimming as needed. > >> > >> (b) event-based. No reason that we couldn't setup an event or > >> condition variable per-pool and have the malloc/free code trigger > >> that condition/variable. It adds one or two compare/branches to each > >> malloc / free operation (which is pretty cheap), but doesn't have the > >> latency costs of (a). The downside is that this implicitly assumes a > >> single global-thread is responsible for cleaning each pool which > >> works well when there are a relatively small number of pools. > >> > >> Here is my list of anti-requirements: > >> > >> (1) No hierarchical relationship between the pools. [IMO, this is > >> kewl, but unnecessary and tends to screw up your cache, i.e., destroys > #9. > >> > >> (2) No physical colocation of the allocated pool memory. The pool is > >> "logical", i.e., an accounting mirage only. > >> > >> (3) No reason to dynamically create/destroy memory pools. They can be > >> statically declared (this dramatically simplifies the code that uses > >> this system). > > > > Yes. Great summary! > > > >> Let the discussion begin!! > >> ///////////////////////// > >> > >> Here is my proposed external interface to the design: > >> > >> First, look at the slab_xxxx containers that I just submitted. You > >> can find them at > >> > https://github.com/allensamuels/ceph/blob/master/src/include/slab_con > >> tainers.h > >> > >> I would propose to extend those containers as the basis for the > >> memory pooling. > >> > >> First, there's a global enum that defines the memory pools -- yes, > >> they're static and a small number > >> > >> enum mpool_index { > >> MPOOL_ONE, > >> MPOOL_TWO, > >> ... > >> MPOOL_LAST > >> }; > >> > >> And a global object for each pool: > >> > >> class mpool; // TBD ... see below. > >> > >> Extern mpool[MPOOL_LAST]; // Actual definition of each pool > >> > >> Each slab_xxx container template is augmented to expect receive an > >> additional "enum mpool_index" parameter. > >> > >> That's ALL THAT'S required for the developer. In other words, if each > >> definition of an STL container uses a typedef with the right mpool > >> index, then you're done. The machinery takes care of everything else > >> :) > > > > FWIW I'm not sure if there's much reason to pass MPOOL_FOO instead of > > g_mpool[MPOOL_FOO] to the allocator instance. The former hard-codes > > the global instance; the latter means you could manage the memory pool > > however you like (e.g., as part of the CephContext for librados). > > That's a small detail, though. > > > >> Standalone objects, i.e., naked new/delete are easily done by making > >> the equivalent of a slab_intrusive_list and maybe a macro or two. > >> There's some tricky initialization for this one (see below). > >> > >> ------------------------------------------- > >> > >> Implementation > >> > >> ------------------------------------------- > >> > >> Requirement 6 is what drives this implementation. > >> > >> I would extend each slab_xxxx container to also virtually inherit > >> from a Pool_Member interface, this interface allows the memory pool > >> global machinery to implement #6. > >> > >> I propose that the ctor/dtor for Pool_Member (one for each container) > >> put itself on a list within the respective memory pool. This MUST be > >> a synchronized operation but we can shard the list to reduce the > >> collisions (use the low 4-5 bits of the creating thread pointer to > >> index the shard -- minimizes ctor expense but increases the dtor > >> expense -- which is often done in "trim"). This assumes that the rate > >> of container creation/destruction within a memory pool is not super > >> high -- we could make this be a compile-time option if it becomes too > expensive. > >> > >> The per-pool sharded lists allow the debug routines to visit each > >> container and do things like ask "how many elements do you have?" -- > >> "How big is each element" -- "Give me a printable string of the > >> type-signature for this container". Once you have this list you can > >> generate lots of interesting debug reports. Because you can sort by > >> individual containers as well as group containers by their type > >> signatures (i.e., combined the consumption of all "map<a,b>" > >> containers as a group). You can report out both by Byte as well as by > >> Element count consumption. > > > > Yeah, this sounds pretty nice. But I think it's important to be able > > to compile it out. I think we will have a lot of creations/destructions. > > For example, in BlueStore, Onodes have maps of Extents, those map to > > Blobs, and those have a BufferSpace with a map of Buffers for cached > > data. I expect that blobs and even onodes will be coming in and out > > of cache a lot. > > > >> This kind of information usually allows you to quickly figure out > >> where the memory is being consumed. A bit of script wizardry would > >> recognize that some containers contain other containers. For example, > >> no reason a simple Python script couldn't recognize that each oNode > >> might have a bunch of vector<pextents> within it and tell you things > >> like the average number of pextents / oNodes. Average DRAM > >> consumption per oNode (which is a pretty complicated combination of > >> pextents, lextents, buffer::ptr, > >> etc.) > >> > >> Comments: ????? > > > > It would be nice to build this on top of existing allocator libraries > > if we can. For example, something in boost. I took a quick peek the > > other day and didn't find something that allowed simple interrogation > > about utilization, though, which was surprising. It would be nice to > > have something useful (perhaps without #6) that could be done > > relatively quickly and address all of the other requirements. > > If we want to do in a light way, I recommend to refer to seastar > impl(https://github.com/scylladb/seastar/blob/master/core/memory.hh > https://github.com/scylladb/seastar/blob/master/core/slab.hh). This can > give a lot of insights. > > > > > sage > > -- > > To unsubscribe from this list: send the line "unsu bscribe ceph-devel" > > in the body of a message to majordomo@xxxxxxxxxxxxxxx More > majordomo > > info at http://vger.kernel.org/majordomo-info.html ��.n��������+%������w��{.n����z��u���ܨ}���Ơz�j:+v�����w����ޙ��&�)ߡ�a����z�ޗ���ݢj��w�f