On 06/08/2017 09:45 AM, Michal Hocko wrote: > From: Michal Hocko <mhocko@xxxxxxxx> > > new_node_page will try to use the origin's next NUMA node as the > migration destination for hugetlb pages. If such a node doesn't have any > preallocated pool it falls back to __alloc_buddy_huge_page_no_mpol to > allocate a surplus page instead. This is quite subotpimal for any > configuration when hugetlb pages are no distributed to all NUMA nodes > evenly. Say we have a hotplugable node 4 and spare hugetlb pages are > node 0 > /sys/devices/system/node/node0/hugepages/hugepages-2048kB/nr_hugepages:10000 > /sys/devices/system/node/node1/hugepages/hugepages-2048kB/nr_hugepages:0 > /sys/devices/system/node/node2/hugepages/hugepages-2048kB/nr_hugepages:0 > /sys/devices/system/node/node3/hugepages/hugepages-2048kB/nr_hugepages:0 > /sys/devices/system/node/node4/hugepages/hugepages-2048kB/nr_hugepages:10000 > /sys/devices/system/node/node5/hugepages/hugepages-2048kB/nr_hugepages:0 > /sys/devices/system/node/node6/hugepages/hugepages-2048kB/nr_hugepages:0 > /sys/devices/system/node/node7/hugepages/hugepages-2048kB/nr_hugepages:0 > > Now we consume the whole pool on node 4 and try to offline this > node. All the allocated pages should be moved to node0 which has enough > preallocated pages to hold them. With the current implementation > offlining very likely fails because hugetlb allocations during runtime > are much less reliable. > > Fix this by reusing the nodemask which excludes migration source and try > to find a first node which has a page in the preallocated pool first and > fall back to __alloc_buddy_huge_page_no_mpol only when the whole pool is > consumed. > > Signed-off-by: Michal Hocko <mhocko@xxxxxxxx> Acked-by: Vlastimil Babka <vbabka@xxxxxxx> -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@xxxxxxxxx. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@xxxxxxxxx"> email@xxxxxxxxx </a>