Yinghai Lu wrote: > Ingo Molnar wrote: >> * Yinghai Lu <yinghai@xxxxxxxxxx> wrote: >> >>> Andrew Morton wrote: >>>> On Mon, 22 Jun 2009 08:38:50 -0700 >>>> Yinghai Lu <yinghai@xxxxxxxxxx> wrote: >>>> >>>>> Nathan reported that >>>>> | commit 73d60b7f747176dbdff826c4127d22e1fd3f9f74 >>>>> | Author: Yinghai Lu <yinghai@xxxxxxxxxx> >>>>> | Date: Tue Jun 16 15:33:00 2009 -0700 >>>>> | >>>>> | page-allocator: clear N_HIGH_MEMORY map before we set it again >>>>> | >>>>> | SRAT tables may contains nodes of very small size. The arch code may >>>>> | decide to not activate such a node. However, currently the early boot >>>>> | code sets N_HIGH_MEMORY for such nodes. These nodes therefore seem to be >>>>> | active although these nodes have no present pages. >>>>> | >>>>> | For 64bit N_HIGH_MEMORY == N_NORMAL_MEMORY, so that works for 64 bit too >>>>> >>>>> the cpuset.mems cgroup attribute on an i386 kvm guest >>>>> >>>>> fix it by only clearing node_states[N_NORMAL_MEMORY] for 64bit only. >>>>> and need to do save/restore for that in find_zone_movable_pfn >>>>> >>>> There appear to be some words omitted from this changelog - it doesn't >>>> make sense. >>>> >>>> I think that perhaps a line got deleted before "the cpuset.mems cgroup >>>> ...". That was the line which actualy describes the bug which we're >>>> fixing. Or perhaps it was a single word? "zeroes". >>>> >>>> >>>> I did this: >>>> >>>> Nathan reported that >>>> : >>>> : | commit 73d60b7f747176dbdff826c4127d22e1fd3f9f74 >>>> : | Author: Yinghai Lu <yinghai@xxxxxxxxxx> >>>> : | Date: Tue Jun 16 15:33:00 2009 -0700 >>>> : | >>>> : | page-allocator: clear N_HIGH_MEMORY map before we set it again >>>> : | >>>> : | SRAT tables may contains nodes of very small size. The arch code may >>>> : | decide to not activate such a node. However, currently the early boot >>>> : | code sets N_HIGH_MEMORY for such nodes. These nodes therefore seem to be >>>> : | active although these nodes have no present pages. >>>> : | >>>> : | For 64bit N_HIGH_MEMORY == N_NORMAL_MEMORY, so that works for 64 bit too >>>> : >>> " >>>> : unintentionally and incorrectly clears the cpuset.mems cgroup attribute on >>>> : an i386 kvm guest >>> " >>> ==> >>> >>> 32bit assume NORMAL_MEMORY bit and HIGH_MEMORY bit are set for >>> Node0 always. >> Where in the code is this assumption? > > in mm/page_alloc.c > /* > * Array of node states. > */ > nodemask_t node_states[NR_NODE_STATES] __read_mostly = { > [N_POSSIBLE] = NODE_MASK_ALL, > [N_ONLINE] = { { [0] = 1UL } }, > #ifndef CONFIG_NUMA > [N_NORMAL_MEMORY] = { { [0] = 1UL } }, > #ifdef CONFIG_HIGHMEM > [N_HIGH_MEMORY] = { { [0] = 1UL } }, > #endif > [N_CPU] = { { [0] = 1UL } }, > #endif /* NUMA */ > }; > EXPORT_SYMBOL(node_states); > > for x86 64bit, we clear POSSIBLE and ONLINE in arch/x86/mm/numa_64.c::initmem_init > and this patch clear NORMAL in arch/x86/mm/init_64.c::paging_init > > for x86 32bit: ONLINE get cleared in get_memcfg_from_srat() > and NORMAL and HIGH_MEMORY are not cleared > before try to set new in mm/page_alloc.c::free_area_init_nodes > >>> and some code only check if HIGH_MEMORY is there to know if >>> NORMAL_MEMORY is there. >> Which code is that exactly? >> > with grep: > arch/x86/mm/init_64.c: nodes_clear(node_states[N_NORMAL_MEMORY]); > drivers/base/node.c: return print_nodes_state(N_NORMAL_MEMORY, buf); > include/linux/nodemask.h: N_NORMAL_MEMORY, /* The node has regular memory */ > include/linux/nodemask.h: N_HIGH_MEMORY = N_NORMAL_MEMORY, > mm/memcontrol.c: if (!node_state(node, N_NORMAL_MEMORY)) > mm/page_alloc.c: [N_NORMAL_MEMORY] = { { [0] = 1UL } }, > mm/page_alloc.c: node_set_state(zone_to_nid(zone), N_NORMAL_MEMORY); > mm/slub.c: for_each_node_state(node, N_NORMAL_MEMORY) { > mm/slub.c: for_each_node_state(node, N_NORMAL_MEMORY) { > mm/slub.c: for_each_node_state(node, N_NORMAL_MEMORY) { > mm/slub.c: for_each_node_state(node, N_NORMAL_MEMORY) { > mm/slub.c: for_each_node_state(node, N_NORMAL_MEMORY) { > mm/slub.c: for_each_node_state(node, N_NORMAL_MEMORY) { > mm/slub.c: for_each_node_state(node, N_NORMAL_MEMORY) { > mm/slub.c: for_each_node_state(node, N_NORMAL_MEMORY) { > mm/slub.c: for_each_node_state(node, N_NORMAL_MEMORY) > > Documentation/cgroups/cpusets.txt:automatically tracks the value of node_states[N_HIGH_MEMORY]--i.e., > Documentation/memory-hotplug.txt:status_change_nid is set node id when N_HIGH_MEMORY of nodemask is (will be) > arch/ia64/kernel/uncached.c: if (!node_state(nid, N_HIGH_MEMORY)) > drivers/base/node.c: return print_nodes_state(N_HIGH_MEMORY, buf); > include/linux/cpuset.h:#define cpuset_current_mems_allowed (node_states[N_HIGH_MEMORY]) > include/linux/nodemask.h: N_HIGH_MEMORY, /* The node has regular or high memory */ > include/linux/nodemask.h: N_HIGH_MEMORY = N_NORMAL_MEMORY, > kernel/cpuset.c: * found any online mems, return node_states[N_HIGH_MEMORY]. > kernel/cpuset.c: * of node_states[N_HIGH_MEMORY]. > kernel/cpuset.c: node_states[N_HIGH_MEMORY])) > kernel/cpuset.c: node_states[N_HIGH_MEMORY]); > kernel/cpuset.c: *pmask = node_states[N_HIGH_MEMORY]; > kernel/cpuset.c: BUG_ON(!nodes_intersects(*pmask, node_states[N_HIGH_MEMORY])); > kernel/cpuset.c: * top_cpuset.mems_allowed tracks node_stats[N_HIGH_MEMORY]; > kernel/cpuset.c: node_states[N_HIGH_MEMORY])) > kernel/cpuset.c: nodes_subset(cp->mems_allowed, node_states[N_HIGH_MEMORY])) > kernel/cpuset.c: node_states[N_HIGH_MEMORY]); > kernel/cpuset.c: * Keep top_cpuset.mems_allowed tracking node_states[N_HIGH_MEMORY]. > kernel/cpuset.c: * Call this routine anytime after node_states[N_HIGH_MEMORY] changes. > kernel/cpuset.c: top_cpuset.mems_allowed = node_states[N_HIGH_MEMORY]; > kernel/cpuset.c: top_cpuset.mems_allowed = node_states[N_HIGH_MEMORY]; > kernel/cpuset.c: * subset of node_states[N_HIGH_MEMORY], even if this means going outside the > mm/memcontrol.c: for_each_node_state(node, N_HIGH_MEMORY) { > mm/memory_hotplug.c: node_set_state(zone_to_nid(zone), N_HIGH_MEMORY); > mm/mempolicy.c: if (!nodes_subset(new, node_states[N_HIGH_MEMORY])) { > mm/mempolicy.c: for_each_node_state(nid, N_HIGH_MEMORY) { > mm/mempolicy.c: if (!nodes_subset(nodes, node_states[N_HIGH_MEMORY])) > mm/mempolicy.c: nodes = node_states[N_HIGH_MEMORY]; > mm/mempolicy.c: &node_states[N_HIGH_MEMORY], MPOL_MF_STATS, md); > mm/mempolicy.c: for_each_node_state(n, N_HIGH_MEMORY) > mm/migrate.c: if (!node_state(node, N_HIGH_MEMORY)) > mm/oom_kill.c: nodemask_t nodes = node_states[N_HIGH_MEMORY]; > mm/page-writeback.c: for_each_node_state(node, N_HIGH_MEMORY) { > mm/page_alloc.c: [N_HIGH_MEMORY] = { { [0] = 1UL } }, > mm/page_alloc.c: * tasks mems_allowed, or node_states[N_HIGH_MEMORY].) > mm/page_alloc.c: &node_states[N_HIGH_MEMORY]; > mm/page_alloc.c: for_each_node_state(n, N_HIGH_MEMORY) { > mm/page_alloc.c: (nodes_weight(node_states[N_HIGH_MEMORY]) + 1); > mm/page_alloc.c: * Populate N_HIGH_MEMORY for calculating usable_nodes. > mm/page_alloc.c: node_set_state(early_node_map[i].nid, N_HIGH_MEMORY); > mm/page_alloc.c: nodemask_t saved_node_state = node_states[N_HIGH_MEMORY]; > mm/page_alloc.c: int usable_nodes = nodes_weight(node_states[N_HIGH_MEMORY]); > mm/page_alloc.c: for_each_node_state(nid, N_HIGH_MEMORY) { > mm/page_alloc.c: node_states[N_HIGH_MEMORY] = saved_node_state; > mm/page_alloc.c: node_set_state(nid, N_HIGH_MEMORY); > mm/vmalloc.c: for_each_node_state(nr, N_HIGH_MEMORY) > mm/vmscan.c: for_each_node_state(nid, N_HIGH_MEMORY) { > mm/vmscan.c: for_each_node_state(nid, N_HIGH_MEMORY) > mm/vmstat.c: if (!node_state(pgdat->node_id, N_HIGH_MEMORY)) > > for 64bit N_HIGH_MEMORY == NORMAL_MEMORY > > for 32bit, there are more reference to N_HIGH_MEMORY... > - Why is this patch good/desired? fix the broken with cpuset.mems cgroup attribute on an i386 kvm guest - What did prior code do, and why was that wrong? clear node_states[N_HIGH_MEMORY] for 32 bit and 64bit. actually we only need clear that for 64bit to make that right for some strange case like small range in one node. - What were the bad effects. (crash, right?) cpuset.mems can not be used ... - What does this patch do to achieve that good status? fix the problem with cpuset.mems and only keep the clearing for x86 64bit. YH _______________________________________________ Containers mailing list Containers@xxxxxxxxxxxxxxxxxxxxxxxxxx https://lists.linux-foundation.org/mailman/listinfo/containers