On Wed 22-11-17 16:28:32, Michal Hocko wrote: > Hi, > is there any reason why we enforce the overcommit limit during hugetlb > pages migration? It's in alloc_huge_page_node->__alloc_buddy_huge_page > path. I am wondering whether this is really an intentional behavior. > The page migration allocates a page just temporarily so we should be > able to go over the overcommit limit for the migration duration. The > reason I am asking is that hugetlb pages tend to be utilized usually > (otherwise the memory would be just wasted and pool shrunk) but then > the migration simply fails which breaks memory hotplug and other > migration dependent functionality which is quite suboptimal. You can > workaround that by increasing the overcommit limit. > > Why don't we simply migrate as long as we are able to allocate the > target hugetlb page? I have a half baked patch to remove this > restriction, would there be an opposition to do something like that? So I finally got to think about this some more and looked at how we actually account things more thoroughly. And it is, you both of you expected, quite subtle and not easy to get around. Per NUMA pools make things quite complicated. Why? Migration can really increase the overall pool size. Say we are migrating from Node1 to Node2. Node2 doesn't have any pre-allocated pages but assume that the overcommit allows us to move on. All good. Except that the original page will return to the pool because free_huge_page will see Node1 without any surplus pages and therefore moves back the page to the pool. Node2 will release the surplus page only after it is freed which can be an unbound amount of time. While we are still effectively under the overcommit limit the semantic is kind of strange and I am not sure the behavior is really intended. I see why per node surplus counter is used here. We simply want to maintain per node counts after regular page free. So I was thinking to add a temporary/migrate state to the huge page for migration pages (start with new page, state transfered to the old page on success) and free such a page to the allocator regardless of the surplus counters. This would mean that the page migration might change inter node pool sizes but I guess that should be acceptable. What do you guys think? I can send a draft patch if that helps you to understand the idea. -- Michal Hocko SUSE Labs -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@xxxxxxxxx. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@xxxxxxxxx"> email@xxxxxxxxx </a>