Re: [patch 36/36] khugepaged

Andrea Arcangeli <aarcange@xxxxxxxxxx> · Wed, 24 Feb 2010 23:52:04 +0100

On Wed, Feb 24, 2010 at 04:24:16PM -0500, Rik van Riel wrote:
> The hugepage patchset as it stands tries to allocate huge
> pages synchronously, but will fall back to normal 4kB pages
> if they are not.
> 
> Similarly, khugepaged only compacts anonymous memory into
> hugepages if/when hugepages become available.
> 
> Trying to always allocate hugepages synchronously would
> mean potentially having to defragment memory synchronously,
> before we can allocate memory for a page fault.
> 
> While I have no numbers, I have the strong suspicion that
> the performance impact of potentially defragmenting 2MB
> of memory before each page fault could lead to more
> performance inconsistency than allocating small pages at
> first and having them collapsed into large pages later...
> 
> The amount of work involved in making a 2MB page available
> could be fairly big, which is why I suspect we will be
> better off doing it asynchronously - preferably on otherwise
> idle CPU core.

I agree. This is also why I have doubts we'll need a memory compaction
kernel thread that has to provide free hugepages always available for
page faults. But that's another topic and the memory compaction kernel
thread may be worth it indipendent of khugepaged. Surely if there
wasn't khugepaged, such a memory compaction kernel thread would be a
must, but we need khugepaged for other reasons too so we can as well
take advantage of it to speedup the short lived allocations by not
requiring them to defrag memory. Long lived allocations will be taken
care of by khugepaged.

The fundamental reason why khugepaged is unavoidable, is that some
memory can be fragmented and not everything can be relocated. So when
a virtual machine quits and releases gigabytes of hugepages, we want
to use those freely available hugepages to create huge-pmd in the
other virtual machines that may be running on fragmented memory, to
maximize the CPU efficiency at all times. The scan is slow, it takes
nearly zero cpu time, except when it copies data (in which case it
means we definitely want to pay for that cpu time) so it seems a good
tradeoff.

As sysctl that control defrag there is only one right now and it turns
defrag on and off. We could make it more finegrined and have two
files, one for the page faults in transparent_hugepage/defrag as
always|madvise|never, and one yes|no in
transparent_hugepage/khugepaged/defrag but that may be
overdesign... I'm not sure really when and how to invoke memory
compaction, so having that maximum amount of knobs really is only
requires if we can't came up with an optimal design. If we can came up
with an optimal solution the current system wide "yes|no" in
transparent_hugepage/defrag should be enough (currently it defaults to
"no" because there's no real memory compaction invoked yet, and
shrinking blind isn't very helpful anyway, unless we go with
__GFP_REPEAT|GFP_IO|GFP_FS which stalls the system often).

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@xxxxxxxxxx  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@xxxxxxxxx";> email@xxxxxxxxx </a>