On Wed, Aug 22, 2018 at 04:45:17PM +0200, Michal Hocko wrote: > Now I am confused. How can compaction help at all then? I mean if the > node is full of GUP pins then you can hardly do anything but fallback to > other node. Or how come your new GFP flag makes any difference? It helps until the node is full. If you don't call compaction you will get zero THP even when you've plenty of free memory. So the free memory goes down and down as more and more THP are generated y compaction until compaction then fails with COMPACT_SKIPPED, there's not enough free memory to relocate an "order 9" amount of physically contiguous PAGE_SIZEd fragments. At that point the code calls reclaim to make space for a new compaction run. Then if that fails again it's not because there's no enough free memory. Problem is if you ever call reclaim when compaction fails, what happens is you free an "order 9" and then it gets eaten up by the app so then next compaction call, calls COMPACT_SKIPPED again. This is how compaction works since day zero it was introduced in kernel 2.6.x something, if you don't have crystal clear the inner workings of compaction you have an hard time to review this. So hope the above shed some light of how this plays out. So in general calling reclaim is ok because compaction fails more often than not in such case because it can't compact memory not because there aren't at least 2m free in any node. However when you use __GFP_THISNODE combined with reclaim that changes the whole angle and behavior of compaction if reclaim is still active. Not calling compaction in MADV_HUGEPAGE means you can drop MADV_HUGEPAGE as a whole. There's no point to ever set it unless we call compaction. And if you don't call at least once compaction you have near zero chances to get gigabytes of THP even if it's all compactable memory and there are gigabytes of free memory in the node, after some runtime that shakes the fragments in the buddy. To make it even more clear why compaction has to run once at least when MADV_HUGEPAGE is set, just check the second last column of your /proc/buddyinfo before and after "echo 3 >/proc/sys/vm/drop_caches; echo >/proc/sys/vm/compact_memory". Try to allocate memory without MADV_HUGEPAGE and without running the "echo 3; echo" and see how much THP you'll get. I've plenty of workloads that use MADV_HUGEPAGE not just qemu and that totally benefit immediately from THP and there's no point to ever defer compaction to khugepaged when userland says "this is a long lived allocation". Compaction is only potentially wasteful for short lived allocation, so MADV_HUGEPAGE has to call compaction. > It would still try to reclaim easy target as compaction requires. If you > do not reclaim at all you can make the current implementation of the > compaction noop due to its own watermark checks IIRC. That's the feature, if you don't make it a noop when watermark checks trigger, it'll end up wasting CPU and breaking vfio. The point is that we want compaction to run when there's free memory and compaction keeps succeeding. So when compaction fails, if it's because we finished all free memory in the node, we should just remove __GFP_THISNODE and allocate without it (i.e. the optimization). If compaction fails because the memory is fragmented but here's still free memory we should fail the allocation and trigger the THP fallback to PAGE_SIZE fault. Overall removing __GFP_THISNODE unconditionally would simply prioritize THP over NUMA locality which is the point of this special logic for THP. I can't blame the logic because it certainly helps NUMA balancing a lot in letting the memory be in the right place from the start. This is why __GFP_COMPACT_ONLY makes sense, to be able to retain the logic but still preventing the corner case of such __GFP_THISNODE that breaks the VM with MADV_HUGEPAGE. > yeah, I agree about PAGE_ALLOC_COSTLY_ORDER being an arbitrary limit for > a different behavior. But we already do handle those specially so it > kind of makes sense to me to expand on that. It's still a sign of one more place that needs magic for whatever reason. So unless it can be justified by some runtime tests I wouldn't make such change by just thinking about it. Reclaim is called if there's no free memory left anywhere for compaction to run (i.e. if __GFP_THISNODE is not set, if __GPF_THISNODE is set then the caller better use __GFP_COMPACT_ONLY). Now we could also get away without __GFP_COMPACT_ONLY, we could check __GFP_THISNODE and make it behave exactly like __GFP_COMPACT_ONLY whenever __GFP_DIRECT_RECLAIM was also set in addition of __GFP_THISNODE, but then you couldn't use __GFP_THISNODE as a mbind anymore and it would have more obscure semantics than a new flag I think. Thanks, Andrea