Re: [RFC 3/4] mm, thp: try fault allocations only if we expect them to succeed

David Rientjes <rientjes@xxxxxxxxxx> · Wed, 17 Jun 2015 18:20:17 -0700 (PDT)

On Mon, 11 May 2015, Vlastimil Babka wrote:

> Since we track THP availability for khugepaged THP collapses, we can use it
> also for page fault THP allocations. If khugepaged with its sync compaction
> is not able to allocate a hugepage, then it's unlikely that the less involved
> attempt on page fault would succeed, and the cost could be higher than THP
> benefits. Also clear the THP availability flag if we do attempt and fail to
> allocate during page fault, and set the flag if we are freeing a large enough
> page from any context. The latter doesn't include merges, as that's a fast
> path and unlikely to make much difference.
> 

That depends on how long {scan,alloc}_sleep_millisecs are, so if 
khugepaged fails to allocate a hugepage on all nodes, it sleeps for 
alloc_sleep_millisecs (default 60s), and then there's immediate memory 
freeing, thp page faults don't happen again for 60s.  That's scary to me 
when thp_avail_nodes is clear, a large process terminates, and then 
immediately starts back up.  None of its memory is faulted as thp and 
depending on how large it is, khugepaged may fail to allocate hugepages 
when it wakes back up so it never scans (the only reason why 
thp_avail_nodes was clear before it terminated originally).

I'm not sure that approach can work unless the inference of whether a 
hugepage can be allocated at a given time is a very good indicator of 
whether a hugepage can be allocated alloc_sleep_millisecs later, and I'm 
afraid that's not the case.

I'm very happy that you're looking at thp fault latency and the role that 
khugepaged can play in accepting responsibility for defragmentation, 
though.  It's an area that has caused me some trouble lately and I'd like 
to be able to improve.

We see an immediate benefit when experimenting with doing synchronous 
memory compactions of all memory every 15s.  That's done using a cronjob 
rather than khugepaged, but the idea is the same.

What would your thoughts be about doing something radical like

 - having khugepaged do synchronous memory compaction of all memory at
   regulary intervals,

 - track how many pageblocks are free for thp memory to be allocated,

 - terminate collapsing if free pageblocks are below a threshold,

 - trigger a khugepaged wakeup at page fault when that number of 
   pageblocks falls below a threshold,

 - determine the next full sync memory compaction based on how many
   pageblocks were defragmented on the last wakeup, and

 - avoid memory compaction for all thp page faults.

(I'd ignore what is actually the responsibility of khugepaged and what is 
done in task work at this time.)

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@xxxxxxxxx.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@xxxxxxxxx";> email@xxxxxxxxx </a>