Re: [patch] mm, thp: always direct reclaim for MADV_HUGEPAGE even when deferred

David Rientjes <rientjes@xxxxxxxxxx> · Fri, 30 Dec 2016 14:30:32 -0800 (PST)

On Fri, 30 Dec 2016, Mel Gorman wrote:

> Michal is correct in that my intent for defer was to have "never stall"
> as the default behaviour.  This was because of the number of severe stalls
> users experienced that lead to recommendations in tuning guides to always
> disable THP. I'd also seen multiple instances in bug reports for stalls
> where it was suggested that THP be disabled even when it could not have
> been a factor. It would be preferred to keep the default behaviour to
> avoid reintroducing such bugs.
> 

I sympathize with that, I've dealt with a number of issues that we have 
encountered where thp defrag was either at fault or wasn't, and there were 
also suggestions to set defrag to "madvise" to rule it out and that 
impacted other users.

I'm curious if you could show examples where there were severe stalls 
being encountered by applications that did madvise(MADV_HUGEPAGE) and 
users were forced to set madvise to "never".  That is, after all, the only 
topic for consideration in this thread: the direct impact to users of 
madvise(MADV_HUGEPAGE).  If an application does it, I believe that's a 
demand for work to be done at allocation time to try to get hugepages.  
They can certainly provide an application-level option to not do the 
MADV_HUGEPAGE.  Qemu is no different, you can add options to do 
madvise(MADV_HUGEPAGE) or not, and you can also do it after fault.

The problem with the current option set is that we don't have the ability 
to trigger background compaction for everybody, which only very minimally 
impacts their page fault latency since it just wakes up kcompactd, and 
allow MADV_HUGEPAGE users to accept that up-front cost by doing direct 
compaction.  My usecase, remapping .text segment and faulting thp memory 
at startup, demands that ability.  Setting defrag=madvise gets that 
behavior, but nobody else triggers background compaction when thp memory 
fails and we _want_ that behavior so work is being done to defrag.  
Setting defrag=defer makes MADV_HUGEPAGE a no-op for page fault, and I 
argue that's the wrong behavior.

> I'll neither ack nor nak this patch. However, I would much prefer an
> additional option be added to sysfs called defer-fault that would avoid
> all fault-based stalls but still potentially stall for MADV_HUGEPAGE. I
> would also prefer that the default option is "defer" for both MADV_HUGEPAGE
> and faults.
> 

If you want a fifth option added to sysfs for thp defrag, that's fine, we 
can easily do that.  I'm slightly concerned with more and more options 
added that we will eventually approach the 2^4 option count that I 
mentioned earlier and nobody will know what to select.  I'm fine with the 
kernel default remaining as "madvise," we will just set it to whatever 
gets us "direct for madvise, background for everybody else" behavior as we 
were planning on using "defer."

We can either do

 (1) merge this patch and allow madvise(MADV_HUGEPAGE) users to always try
     to get hugepages, potentially adding options to qemu to suppress 
     their MADV_HUGEPAGE if users have complained (would even fix the 
     issue on 2.6 kernels) or do it after majority has been faulted, or

 (2) add a fifth defrag option to do this suggested behavior and maintain
     that option forever.

I'd obviously prefer the former since I consider MADV_HUGEPAGE and not 
willing to stall as a userspace issue that can _trivially_ be worked 
around in userspace, but in the interest of moving forward on this we can 
do the latter if you'd prefer.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@xxxxxxxxx.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@xxxxxxxxx";> email@xxxxxxxxx </a>