On 29.03.2014 [00:40:41 -0500], Christoph Lameter wrote: > On Thu, 27 Mar 2014, Nishanth Aravamudan wrote: > > > > That looks to be the correct way to handle things. Maybe mark the node as > > > offline or somehow not present so that the kernel ignores it. > > > > This is a SLUB condition: > > > > mm/slub.c::early_kmem_cache_node_alloc(): > > ... > > page = new_slab(kmem_cache_node, GFP_NOWAIT, node); > > ... > > So the page allocation from the node failed. We have a strange boot > condition where the OS is aware of anode but allocations on that node > fail. Yep. The node exists, it's just fully exhausted at boot (due to the presence of 16GB pages reserved at boot-time). > > if (page_to_nid(page) != node) { > > printk(KERN_ERR "SLUB: Unable to allocate memory from " > > "node %d\n", node); > > printk(KERN_ERR "SLUB: Allocating a useless per node structure " > > "in order to be able to continue\n"); > > } > > ... > > > > Since this is quite early, and we have not set up the nodemasks yet, > > does it make sense to perhaps have a temporary init-time nodemask that > > we set bits in here, and "fix-up" those nodes when we setup the > > nodemasks? > > Please take care of this earlier than this. The page allocator in > general should allow allocations from all nodes with memory during > boot, I'd appreciate a bit more guidance? I'm suggesting that in this case the node functionally has no memory. So the page allocator should not allow allocations from it -- except (I need to investigate this still) userspace accessing the 16GB pages on that node, but that, I believe, doesn't go through the page allocator at all, it's all from hugetlb interfaces. It seems to me there is a bug in SLUB that we are noting that we have a useless per-node structure for a given nid, but not actually preventing requests to that node or reclaim because of those allocations. The page allocator is actually fine here, afaict. We've pulled out memory from this node, even though it's present, so none is free. All of that is working as expected, based upon the issue we've seen. The problems start when we "force" (by way of a round-robin page allocation request from /proc/sys/vm/nr_hugepages) a THISNODE allocation to come from the exhausted node, which has no memory free, causing reclaim, which progresses on other nodes, and thus never alleviates the allocation failure (and can't). I think there is a logical bug (even if it only occurs in this particular corner case) where if reclaim progresses for a THISNODE allocation, we don't check *where* the reclaim is progressing, and thus may falsely be indicating that we have done some progress when in fact the allocation that is causing reclaim will not possibly make any more progress. Thanks, Nish -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@xxxxxxxxx. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@xxxxxxxxx"> email@xxxxxxxxx </a>