On Tue, Aug 21, 2018 at 05:30:11PM +0200, Vlastimil Babka wrote: > If it's "not possible to compact" then the expected outcome of this is > to fail? It'll just call __alloc_pages_direct_compact once and fail if that fails. > You could do that without calling watermark checking explicitly, but > it's rather complicated: > > 1. try alocating with __GFP_THISNODE and ~GFP_DIRECT_RECLAIM > 2. if that fails, try PAGE_SIZE with same flags > 3. if that fails, try THP size without __GFP_THISNODE > 4. PAGE_SIZE without __GFP_THISNODE It's not complicated, it's slow, why to call 4 times into the allocator, just to skip 1 watermark check? > Yeah, not possible in current alloc_pages_vma() which should return the > requested order. But the advantage is that it's not prone to races > between watermark checking and actual attempt. Watermark checking is always racy, zone_watermark_fast doesn't take any lock before invoking rmqueue. The racy in this case is the least issue because it doesn't need to be perfect: if once in a while a THP is allocated from a remote node despite there was a bit more of PAGE_SIZEd memory free than expected in the local node it wouldn't be an issue. If it's wrong in the other way around it'll just behave not-optimally (like today) once in a while. > Frankly, I would rather go with this option and assume that if someone > explicitly wants THP's, they don't care about NUMA locality that much. I'm fine either ways, either way will work, large NUMA will prefer __GFP_COMPACT_ONLY option 1), small NUMA will likely prefer option 2) and making this configurable at runtime with a different default is possible too later but then I'm not sure it's worth it. The main benefit of 1) really is to cause the least possible interference to NUMA balancing (and the further optimization possible with the watermark check would not cause interference either). > (Note: I hate __GFP_THISNODE, it's an endless source of issues.) > Trying to be clever about "is there still PAGE_SIZEd free memory in the > local node" is imperfect anyway. If there isn't, is it because there's > clean page cache that we can easily reclaim (so it would be worth > staying local) or is it really exhausted? Watermark check won't tell... I'm not sure if it's worth reclaiming cache to stay local, I mean it totally depends on the app running, reclaim cache to stay local is even harder to tell if it's worth it. This is why especially on small NUMA option 2) that removes __GFP_THISNODE is likely to perform best. However the tradeoff about clean cache reclaiming or not combined if to do or not the watermark check on the PAGE_SIZEd free memory, is all about the MADV_HUGEPAGE case only. The default case (not even calling compaction in the page fault) would definitely benefit from the (imperfect racy) heuristic "is there still PAGE_SIZEd free memory in the local node" and that remains true even if we go with option 2) to solve the bug (option 2 only changes the behavior of MADV_HUGEPAGE, not the default). The imperfect watermark check it's net gain for the default case without __GFP_DIRECT_RECLAIM set, because currently the code has zero chance to get THP even if already free and available in other nodes and it only tries to get them from the local node. It costs a watermark fast check only to get THP from all nodes if already available (or if made available from kswapd). Costs 1 cacheline just for the pcp test, but then we're in the page allocator anyway so all cachelines involving watermarks may be activated regardless. Thanks, Andrea