On Mon 22-04-19 21:07:28, Mike Kravetz wrote: [...] > However, consider the case of a 2 node system where: > node 0 has 2GB memory > node 1 has 4GB memory > > Now, if one wants to allocate 4GB of huge pages they may be tempted to simply, > "echo 2048 > nr_hugepages". At first this will go well until node 0 is out > of memory. When this happens, alloc_pool_huge_page() will continue to be > called. Because of that for_each_node_mask_to_alloc() macro, it will likely > attempt to first allocate a page from node 0. It will call direct reclaim and > compaction until it fails. Then, it will successfully allocate from node 1. Yeah, the even distribution is quite a strong statement. We just try to distribute somehow and it is likely to not work really great on system with nodes that are different in size. I know it sucks but I've been recommending to use the /sys/devices/system/node/node$N/hugepages/hugepages-2048kB/nr_hugepages because that allows the define the actual policy much better. I guess we want to be more specific about this in the documentation at least. > In our distro kernel, I am thinking about making allocations try "less hard" > on nodes where we start to see failures. less hard == NORETRY/NORECLAIM. > I was going to try something like this on an upstream kernel when I noticed > that it seems like direct reclaim may never end/exit. It 'may' exit, but I > instrumented __alloc_pages_slowpath() and saw it take well over an hour > before I 'tricked' it into exiting. > > [ 5916.248341] hpage_slow_alloc: jiffies 5295742 tries 2 node 0 success > [ 5916.249271] reclaim 5295741 compact 1 This is unexpected though. What does tries mean? Number of reclaim attempts? If yes could you enable tracing to see what takes so long in the reclaim path? > This is where it stalled after "echo 4096 > nr_hugepages" on a little VM > with 8GB total memory. > > I have not started looking at the direct reclaim code to see exactly where > we may be stuck, or trying really hard. My question is, "Is this expected > or should direct reclaim be somewhat bounded?" With __alloc_pages_slowpath > getting 'stuck' in direct reclaim, the documented behavior for huge page > allocation is not going to happen. Well, our "how hard to try for hugetlb pages" is quite arbitrary. We used to rety as long as at least order worth of pages have been reclaimed but that didn't make any sense since the lumpy reclaim was gone. So the semantic has change to reclaim&compact as long as there is some progress. From what I understad above it seems that you are not thrashing and calling reclaim again and again but rather one reclaim round takes ages. That being said, I do not think __GFP_RETRY_MAYFAIL is wrong here. It looks like there is something wrong in the reclaim going on. -- Michal Hocko SUSE Labs