2020년 3월 13일 (금) 오전 1:42, Vlastimil Babka <vbabka@xxxxxxx>님이 작성: > > On 3/12/20 5:13 PM, Srikar Dronamraju wrote: > > * Vlastimil Babka <vbabka@xxxxxxx> [2020-03-12 14:51:38]: > > > >> > * Vlastimil Babka <vbabka@xxxxxxx> [2020-03-12 10:30:50]: > >> > > >> >> On 3/12/20 9:23 AM, Sachin Sant wrote: > >> >> >> On 12-Mar-2020, at 10:57 AM, Srikar Dronamraju <srikar@xxxxxxxxxxxxxxxxxx> wrote: > >> >> >> * Michal Hocko <mhocko@xxxxxxxxxx> [2020-03-11 12:57:35]: > >> >> >>> On Wed 11-03-20 16:32:35, Srikar Dronamraju wrote: > >> >> >>>> To ensure a cpuless, memoryless dummy node is not online, powerpc need > >> >> >>>> to make sure all possible but not present cpu_to_node are set to a > >> >> >>>> proper node. > >> >> >>> > >> >> >>> Just curious, is this somehow related to > >> >> >>> http://lkml.kernel.org/r/20200227182650.GG3771@xxxxxxxxxxxxxx? > >> >> >>> > >> >> >> > >> >> >> The issue I am trying to fix is a known issue in Powerpc since many years. > >> >> >> So this surely not a problem after a75056fc1e7c (mm/memcontrol.c: allocate > >> >> >> shrinker_map on appropriate NUMA node"). > >> >> >> > >> > > >> > While I am not an expert in the slub area, I looked at the patch > >> > a75056fc1e7c and had some thoughts on why this could be causing this issue. > >> > > >> > On the system where the crash happens, the possible number of nodes is much > >> > greater than the number of onlined nodes. The pdgat or the NODE_DATA is only > >> > available for onlined nodes. > >> > > >> > With a75056fc1e7c memcg_alloc_shrinker_maps, we end up calling kzalloc_node > >> > for all possible nodes and in ___slab_alloc we end up looking at the > >> > node_present_pages which is NODE_DATA(nid)->node_present_pages. > >> > i.e for a node whose pdgat struct is not allocated, we are trying to > >> > dereference. > >> > >> From what we saw, the pgdat does exist, the problem is that slab's per-node data > >> doesn't exist for a node that doesn't have present pages, as it would be a waste > >> of memory. > > > > Just to be clear > > Before my 3 patches to fix dummy node: > > srikar@ltc-zzci-2 /sys/devices/system/node $ cat $PWD/possible > > 0-31 > > srikar@ltc-zzci-2 /sys/devices/system/node $ cat $PWD/online > > 0-1 > > OK > > >> > >> Uh actually you are probably right, the NODE_DATA doesn't exist anymore? In > >> Sachin's first report [1] we have > >> > >> [ 0.000000] numa: NODE_DATA [mem 0x8bfedc900-0x8bfee3fff] > >> [ 0.000000] numa: NODE_DATA(0) on node 1 > >> [ 0.000000] numa: NODE_DATA [mem 0x8bfed5200-0x8bfedc8ff] > >> > > > > So even if pgdat would exist for nodes 0 and 1, there is no pgdat for the > > rest 30 nodes. > > I see. Perhaps node_present_pages(node) is not safe in SLUB then and it should > check online first, as you suggested. > > >> But in this thread, with your patches Sachin reports: > > > > and with my patches > > srikar@ltc-zzci-2 /sys/devices/system/node $ cat $PWD/possible > > 0-31 > > srikar@ltc-zzci-2 /sys/devices/system/node $ cat $PWD/online > > 1 > > > >> > >> [ 0.000000] numa: NODE_DATA [mem 0x8bfedc900-0x8bfee3fff] > >> > > > > so we only see one pgdat. > > > >> So I assume it's just node 1. In that case, node_present_pages is really dangerous. > >> > >> [1] > >> https://lore.kernel.org/linux-next/3381CD91-AB3D-4773-BA04-E7A072A63968@xxxxxxxxxxxxxxxxxx/ > >> > >> > Also for a memoryless/cpuless node or possible but not present nodes, > >> > node_to_mem_node(node) will still end up as node (atleast on powerpc). > >> > >> I think that's the place where this would be best to fix. > >> > > > > Maybe. I thought about it but the current set_numa_mem semantics are apt > > for memoryless cpu node and not for possible nodes. We could have upto 256 > > possible nodes and only 2 nodes (1,2) with cpu and 1 node (1) with memory. > > node_to_mem_node seems to return what is set in set_numa_mem(). > > set_numa_mem() seems to say set my numa_mem node for the current memoryless > > node to the param passed. > > > > But how do we set numa_mem for all the other 253 possible nodes, which > > probably will have 0 as default? > > > > Should we introduce another API such that we could update for all possible > > nodes? > > If we want to rely on node_to_mem_node() to give us something safe for each > possible node, then probably it would have to be like that, yeah. > > >> > I tried with this hunk below and it works. > >> > > >> > But I am not sure if we need to check at other places were > >> > node_present_pages is being called. > >> > >> I think this seems to defeat the purpose of node_to_mem_node()? Shouldn't it > >> return only nodes that are online with present memory? > >> CCing Joonsoo who seems to have introduced this in ad2c8144418c ("topology: add > >> support for node_to_mem_node() to determine the fallback node") > >> > > > > Agree I lost all the memory about it. :) Anyway, how about this? 1. make node_present_pages() safer static inline node_present_pages(nid) { if (!node_online(nid)) return 0; return (NODE_DATA(nid)->node_present_pages); } 2. make node_to_mem_node() safer for all cases In ppc arch's mem_topology_setup(void) for_each_present_cpu(cpu) { numa_setup_cpu(cpu); mem_node = node_to_mem_node(numa_mem_id()); if (!node_present_pages(mem_node)) { _node_numa_mem_[numa_mem_id()] = first_online_node; } } With these two changes, we can uses node_present_pages() and node_to_mem_node() as intended. Thanks. > >> I think we do need well defined and documented rules around node_to_mem_node(), > >> cpu_to_node(), existence of NODE_DATA, various node_states bitmaps etc so > >> everyone handles it the same, safe way. > > So let's try to brainstorm how this would look like? What I mean are some rules > like below, even if some details in my current understanding are most likely > incorrect: > > with nid present in: > N_POSSIBLE - pgdat might not exist, node_to_mem_node() must return some online > node with memory so that we don't require everyone to search for it in slightly > different ways > N_ONLINE - pgdat must exist, there doesn't have to be present memory, > node_to_mem_node() still has to return something else (?) > N_NORMAL_MEMORY - there is present memory, node_to_mem_node() returns itself > N_HIGH_MEMORY - node has present high memory