On Wed, Feb 24, 2010 at 04:24:16PM -0500, Rik van Riel wrote: > The hugepage patchset as it stands tries to allocate huge > pages synchronously, but will fall back to normal 4kB pages > if they are not. > > Similarly, khugepaged only compacts anonymous memory into > hugepages if/when hugepages become available. > > Trying to always allocate hugepages synchronously would > mean potentially having to defragment memory synchronously, > before we can allocate memory for a page fault. > > While I have no numbers, I have the strong suspicion that > the performance impact of potentially defragmenting 2MB > of memory before each page fault could lead to more > performance inconsistency than allocating small pages at > first and having them collapsed into large pages later... > > The amount of work involved in making a 2MB page available > could be fairly big, which is why I suspect we will be > better off doing it asynchronously - preferably on otherwise > idle CPU core. I agree. This is also why I have doubts we'll need a memory compaction kernel thread that has to provide free hugepages always available for page faults. But that's another topic and the memory compaction kernel thread may be worth it indipendent of khugepaged. Surely if there wasn't khugepaged, such a memory compaction kernel thread would be a must, but we need khugepaged for other reasons too so we can as well take advantage of it to speedup the short lived allocations by not requiring them to defrag memory. Long lived allocations will be taken care of by khugepaged. The fundamental reason why khugepaged is unavoidable, is that some memory can be fragmented and not everything can be relocated. So when a virtual machine quits and releases gigabytes of hugepages, we want to use those freely available hugepages to create huge-pmd in the other virtual machines that may be running on fragmented memory, to maximize the CPU efficiency at all times. The scan is slow, it takes nearly zero cpu time, except when it copies data (in which case it means we definitely want to pay for that cpu time) so it seems a good tradeoff. As sysctl that control defrag there is only one right now and it turns defrag on and off. We could make it more finegrined and have two files, one for the page faults in transparent_hugepage/defrag as always|madvise|never, and one yes|no in transparent_hugepage/khugepaged/defrag but that may be overdesign... I'm not sure really when and how to invoke memory compaction, so having that maximum amount of knobs really is only requires if we can't came up with an optimal design. If we can came up with an optimal solution the current system wide "yes|no" in transparent_hugepage/defrag should be enough (currently it defaults to "no" because there's no real memory compaction invoked yet, and shrinking blind isn't very helpful anyway, unless we go with __GFP_REPEAT|GFP_IO|GFP_FS which stalls the system often). -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@xxxxxxxxxx For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@xxxxxxxxx"> email@xxxxxxxxx </a>