On 16/11/16 20:01, Vladimir Davydov wrote: > Hello, > > On Wed, Nov 16, 2016 at 10:44:59AM +1100, Balbir Singh wrote: >> The lack of hotplug support makes us allocate all memory >> upfront for per node data structures. With large number >> of cgroups this can be an overhead. PPC64 actually limits >> n_possible nodes to n_online to avoid some of this overhead. >> >> This patch adds the basic notifiers to listen to hotplug >> events and does the allocation and free of those structures >> per cgroup. We walk every cgroup per event, its a trade-off >> of allocating upfront vs allocating on demand and freeing >> on offline. >> >> Cc: Tejun Heo <tj@xxxxxxxxxx> >> Cc: Andrew Morton <akpm@xxxxxxxxxxxxxxxxxxxx> >> Cc: Johannes Weiner <hannes@xxxxxxxxxxx> >> Cc: Michal Hocko <mhocko@xxxxxxxxxx> >> Cc: Vladimir Davydov <vdavydov.dev@xxxxxxxxx> >> >> Signed-off-by: Balbir Singh <bsingharora@xxxxxxxxx> >> --- >> mm/memcontrol.c | 68 ++++++++++++++++++++++++++++++++++++++++++++++++++------- >> 1 file changed, 60 insertions(+), 8 deletions(-) >> >> diff --git a/mm/memcontrol.c b/mm/memcontrol.c >> index 91dfc7c..5585fce 100644 >> --- a/mm/memcontrol.c >> +++ b/mm/memcontrol.c >> @@ -63,6 +63,7 @@ >> #include <linux/lockdep.h> >> #include <linux/file.h> >> #include <linux/tracehook.h> >> +#include <linux/memory.h> >> #include "internal.h" >> #include <net/sock.h> >> #include <net/ip.h> >> @@ -1342,6 +1343,10 @@ int mem_cgroup_select_victim_node(struct mem_cgroup *memcg) >> { >> return 0; >> } >> + >> +static void mem_cgroup_may_update_nodemask(struct mem_cgroup *memcg) >> +{ >> +} >> #endif >> >> static int mem_cgroup_soft_reclaim(struct mem_cgroup *root_memcg, >> @@ -4115,14 +4120,7 @@ static int alloc_mem_cgroup_per_node_info(struct mem_cgroup *memcg, int node) >> { >> struct mem_cgroup_per_node *pn; >> int tmp = node; >> - /* >> - * This routine is called against possible nodes. >> - * But it's BUG to call kmalloc() against offline node. >> - * >> - * TODO: this routine can waste much memory for nodes which will >> - * never be onlined. It's better to use memory hotplug callback >> - * function. >> - */ >> + >> if (!node_state(node, N_NORMAL_MEMORY)) >> tmp = -1; >> pn = kzalloc_node(sizeof(*pn), GFP_KERNEL, tmp); >> @@ -5773,6 +5771,59 @@ static int __init cgroup_memory(char *s) >> } >> __setup("cgroup.memory=", cgroup_memory); >> >> +static void memcg_node_offline(int node) >> +{ >> + struct mem_cgroup *memcg; >> + >> + if (node < 0) >> + return; > > Is this possible? Yes, please see node_states_check_changes_online/offline > >> + >> + for_each_mem_cgroup(memcg) { >> + free_mem_cgroup_per_node_info(memcg, node); >> + mem_cgroup_may_update_nodemask(memcg); > > If memcg->numainfo_events is 0, mem_cgroup_may_update_nodemask() won't > update memcg->scan_nodes. Is it OK? > >> + } > > What if a memory cgroup is created or destroyed while you're walking the > tree? Should we probably use get_online_mems() in mem_cgroup_alloc() to > avoid that? > The iterator internally takes rcu_read_lock() to avoid any side-effects of cgroups added/removed. I suspect you are also suggesting using get_online_mems() around each call to for_each_online_node My understanding so far is 1. invalidate_reclaim_iterators should be safe (no bad side-effects) 2. mem_cgroup_free - should be safe as well 3. mem_cgroup_alloc - needs protection 4. mem_cgroup_init - needs protection 5. mem_cgroup_remove_from_tress - should be safe >> +} >> + >> +static void memcg_node_online(int node) >> +{ >> + struct mem_cgroup *memcg; >> + >> + if (node < 0) >> + return; >> + >> + for_each_mem_cgroup(memcg) { >> + alloc_mem_cgroup_per_node_info(memcg, node); >> + mem_cgroup_may_update_nodemask(memcg); >> + } >> +} >> + >> +static int memcg_memory_hotplug_callback(struct notifier_block *self, >> + unsigned long action, void *arg) >> +{ >> + struct memory_notify *marg = arg; >> + int node = marg->status_change_nid; >> + >> + switch (action) { >> + case MEM_GOING_OFFLINE: >> + case MEM_CANCEL_ONLINE: >> + memcg_node_offline(node); > > Judging by __offline_pages(), the MEM_GOING_OFFLINE event is emitted > before migrating pages off the node. So, I guess freeing per-node info > here isn't quite correct, as pages still need it to be moved from the > node's LRU lists. Better move it to MEM_OFFLINE? > Good point, will redo >> + break; >> + case MEM_GOING_ONLINE: >> + case MEM_CANCEL_OFFLINE: >> + memcg_node_online(node); >> + break; >> + case MEM_ONLINE: >> + case MEM_OFFLINE: >> + break; >> + } >> + return NOTIFY_OK; >> +} >> + >> +static struct notifier_block memcg_memory_hotplug_nb __meminitdata = { >> + .notifier_call = memcg_memory_hotplug_callback, >> + .priority = IPC_CALLBACK_PRI, > > I wonder why you chose this priority? > I just chose the lowest priority >> +}; >> + >> /* >> * subsys_initcall() for memory controller. >> * >> @@ -5797,6 +5848,7 @@ static int __init mem_cgroup_init(void) >> #endif >> >> hotcpu_notifier(memcg_cpu_hotplug_callback, 0); >> + register_hotmemory_notifier(&memcg_memory_hotplug_nb); >> >> for_each_possible_cpu(cpu) >> INIT_WORK(&per_cpu_ptr(&memcg_stock, cpu)->work, > > I guess, we should modify mem_cgroup_alloc/free() in the scope of this > patch, otherwise it doesn't make much sense IMHO. May be, it's even > worth merging patches 1 and 2 altogether. > Thanks for the review, I'll revisit the organization of the patches. Balbir Singh -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@xxxxxxxxx. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@xxxxxxxxx"> email@xxxxxxxxx </a>