We have seen the following situation on a test system: 2-node system, each node has 32GB of memory. 2 gigantic (16GB) pages reserved at boot-time, both of which are allocated from node 1. SLUB notices this: [ 0.000000] SLUB: Unable to allocate memory from node 1 [ 0.000000] SLUB: Allocating a useless per node structure in order to be able to continue After boot, user then did: echo 24 > /proc/sys/vm/nr_hugepages And tasks are stuck: [<c0000000010980b8>] kexec_stack+0xb8/0x8000 [<c0000000000144d0>] .__switch_to+0x1c0/0x390 [<c0000000001ac708>] .throttle_direct_reclaim.isra.31+0x238/0x2c0 [<c0000000001b0b34>] .try_to_free_pages+0xb4/0x210 [<c0000000001a2f1c>] .__alloc_pages_nodemask+0x75c/0xb00 [<c0000000001eafb0>] .alloc_fresh_huge_page+0x70/0x150 [<c0000000001eb2d0>] .set_max_huge_pages.part.37+0x130/0x2f0 [<c0000000001eb7c8>] .hugetlb_sysctl_handler_common+0x168/0x180 [<c0000000002ae21c>] .proc_sys_call_handler+0xfc/0x120 [<c00000000021dcc0>] .vfs_write+0xe0/0x260 [<c00000000021e8c8>] .SyS_write+0x58/0xd0 [<c000000000009e7c>] syscall_exit+0x0/0x7c [<c00000004f9334b0>] 0xc00000004f9334b0 [<c0000000000144d0>] .__switch_to+0x1c0/0x390 [<c0000000001ac708>] .throttle_direct_reclaim.isra.31+0x238/0x2c0 [<c0000000001b0b34>] .try_to_free_pages+0xb4/0x210 [<c0000000001a2f1c>] .__alloc_pages_nodemask+0x75c/0xb00 [<c0000000001eafb0>] .alloc_fresh_huge_page+0x70/0x150 [<c0000000001eb2d0>] .set_max_huge_pages.part.37+0x130/0x2f0 [<c0000000001eb7c8>] .hugetlb_sysctl_handler_common+0x168/0x180 [<c0000000002ae21c>] .proc_sys_call_handler+0xfc/0x120 [<c00000000021dcc0>] .vfs_write+0xe0/0x260 [<c00000000021e8c8>] .SyS_write+0x58/0xd0 [<c000000000009e7c>] syscall_exit+0x0/0x7c [<c00000004f91f440>] 0xc00000004f91f440 [<c0000000000144d0>] .__switch_to+0x1c0/0x390 [<c0000000001ac708>] .throttle_direct_reclaim.isra.31+0x238/0x2c0 [<c0000000001b0b34>] .try_to_free_pages+0xb4/0x210 [<c0000000001a2f1c>] .__alloc_pages_nodemask+0x75c/0xb00 [<c0000000001eafb0>] .alloc_fresh_huge_page+0x70/0x150 [<c0000000001eb2d0>] .set_max_huge_pages.part.37+0x130/0x2f0 [<c0000000001eb54c>] .nr_hugepages_store_common.isra.39+0xbc/0x1b0 [<c0000000003662cc>] .kobj_attr_store+0x2c/0x50 [<c0000000002b2c2c>] .sysfs_write_file+0xec/0x1c0 [<c00000000021dcc0>] .vfs_write+0xe0/0x260 [<c00000000021e8c8>] .SyS_write+0x58/0xd0 [<c000000000009e7c>] syscall_exit+0x0/0x7c kswapd1 is also pegged at this point at 100% cpu. If we go in and manually: echo 24 > /sys/devices/system/node/node0/hugepages/hugepages-16384kB/nr_hugepages rather than relying on the interleaving allocator from the sysctl, the allocation succeeds (and the echo returns immediately). I think we are hitting the following: mm/hugetlb.c::alloc_fresh_huge_page_node(): page = alloc_pages_exact_node(nid, htlb_alloc_mask(h)|__GFP_COMP|__GFP_THISNODE| __GFP_REPEAT|__GFP_NOWARN, huge_page_order(h)); include/linux/gfp.h: #define GFP_THISNODE (__GFP_THISNODE | __GFP_NOWARN | __GFP_NORETRY) and mm/page_alloc.c::__alloc_pages_slowpath(): /* * GFP_THISNODE (meaning __GFP_THISNODE, __GFP_NORETRY and * __GFP_NOWARN set) should not cause reclaim since the subsystem * (f.e. slab) using GFP_THISNODE may choose to trigger reclaim * using a larger set of nodes after it has established that the * allowed per node queues are empty and that nodes are * over allocated. */ if (IS_ENABLED(CONFIG_NUMA) && (gfp_mask & GFP_THISNODE) == GFP_THISNODE) goto nopage; so we *do* reclaim in this callpath. Under my reading, since node1 is exhausted, no matter how much work kswapd1 does, it will never reclaim memory from node1 to satisfy a 16M page allocation request (or any other, for that matter). I see the following possible changes/fixes, but am unsure if a) my analysis is right b) which is best. 1) Since we did notice early in boot that (in this case) node 1 was exhausted, perhaps we should mark it as such there somehow, and if a __GFP_THISNODE allocation request comes through on such a node, we immediately fallthrough to nopage? 2) There is the following check /* * For order > PAGE_ALLOC_COSTLY_ORDER, if __GFP_REPEAT is * specified, then we retry until we no longer reclaim any pages * (above), or we've reclaimed an order of pages at least as * large as the allocation's order. In both cases, if the * allocation still fails, we stop retrying. */ if (gfp_mask & __GFP_REPEAT && pages_reclaimed < (1 << order)) return 1; I wonder if we should add a check to also be sure that the pages we are reclaiming, if __GFP_THISNODE is set, are from the right node? if (gfp_mask & __GFP_THISNODE && the progress we have made is on the node requested?) 3) did_some_progress could be updated to track where the progress is occuring, and if we are in __GFP_THISNODE allocation request and we didn't make any progress on the correct node, we fail the allocation? I think this situation could be reproduced (and am working on it) by exhausting a NUMA node with 16M hugepages and then using the generic RR allocator to ask for more. Other node exhaustion cases probably exist, but since we can't swap the hugepages, it seems like the most straightforward way to try and reproduce it. Any thoughts on this? Am I way off base? Thanks, Nish -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@xxxxxxxxx. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@xxxxxxxxx"> email@xxxxxxxxx </a>