On 23/11/16 20:28, Michal Hocko wrote: > On Wed 23-11-16 19:37:16, Balbir Singh wrote: >> >> >> On 23/11/16 19:07, Michal Hocko wrote: >>> On Wed 23-11-16 18:50:42, Balbir Singh wrote: >>>> >>>> >>>> On 23/11/16 18:25, Michal Hocko wrote: >>>>> On Wed 23-11-16 15:36:51, Balbir Singh wrote: >>>>>> In the absence of hotplug we use extra memory proportional to >>>>>> (possible_nodes - online_nodes) * number_of_cgroups. PPC64 has a patch >>>>>> to disable large consumption with large number of cgroups. This patch >>>>>> adds hotplug support to memory cgroups and reverts the commit that >>>>>> limited possible nodes to online nodes. >>>>> >>>>> Balbir, >>>>> I have asked this in the previous version but there still seems to be a >>>>> lack of information of _why_ do we want this, _how_ much do we save on >>>>> the memory overhead on most systems and _why_ the additional complexity >>>>> is really worth it. Please make sure to add all this in the cover >>>>> letter. >>>>> >>>> >>>> The data is in the patch referred to in patch 3. The order of waste was >>>> 200MB for 400 cgroup directories enough for us to restrict possible_map >>>> to online_map. These patches allow us to have a larger possible map and >>>> allow onlining nodes not in the online_map, which is currently a restriction >>>> on ppc64. >>> >>> How common is to have possible_map >> online_map? If this is ppc64 then >>> what is the downside of keeping the current restriction instead? >>> >> >> On my system CONFIG_NODE_SHIFT is 8, 256 nodes and possible_nodes are 2 >> The downside is the ability to hotplug and online an offline node. >> Please see http://www.spinics.net/lists/linux-mm/msg116724.html > > OK, so we are slowly getting to what I've asked originally ;) So who > cares? Depending on CONFIG_NODE_SHIFT (which tends to be quite large in > distribution or other general purpose kernels) the overhead is 424B (as > per pahole on the current kernel) for one numa node. Most machines are > to be expected 1-4 numa nodes so the overhead might be somewhere around > 100K per memcg (with 256 possible nodes). Not trivial amount for sure > but I would rather encourage people to lower the possible node count for > their hardware if it is artificially large. > On my desktop NODES_SHIFT is 6, many distro kernels have it a 9. I've known of solutions that use fake NUMA for partitioning and need as many nodes as possible. >>>> A typical system that I use has about 100-150 directories, depending on the >>>> number of users/docker instances/configuration/virtual machines. These numbers >>>> will only grow as we pack more of these instances on them. >>>> >>>> From a complexity view point, the patches are quite straight forward. >>> >>> Well, I would like to hear more about that. {get,put}_online_memory >>> at random places doesn't sound all that straightforward to me. >>> >> >> I thought those places were not random :) I tried to think them out as >> discussed with Vladimir. I don't claim the code is bug free, we can fix >> any bugs as we test this more. > > I am more worried about synchronization with the hotplug which tends to > be a PITA in places were we were simply safe by definition until now. We > do not have all that many users of memcg->nodeinfo[nid] from what I can see > but are all of them safe to never race with the hotplug. A lack of > highlevel design description is less than encouraging. As in explanation? The design is dictated by the notifier and the actions to take when the node comes online/offline. So please try to > spend some time describing how do we use nodeinfo currently and how is > the synchronization with the hotplug supposed to work and what > guarantees that no stale nodinfos can be ever used. This is just too > easy to get wrong... > OK.. I'll add that in the next cover letter Balbir -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@xxxxxxxxx. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@xxxxxxxxx"> email@xxxxxxxxx </a>