Re: [patch] mm, thp: always direct reclaim for MADV_HUGEPAGE even when deferred

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 



On Fri, 23 Dec 2016, Michal Hocko wrote:

> > We have no way to compact memory for users who are not using 
> > MADV_HUGEPAGE,
> 
> yes we have. it is defrag=always. If you do not want direct compaction
> and the resulting allocation stalls then you have to rely on kcompactd
> which is something we should work longterm.
> 

No, the point of madvise(MADV_HUGEPAGE) is for applications to tell the 
kernel that they really want hugepages.  Really.  Everybody else either 
never did direct compaction or did a substantially watered down version of 
it.  Now, we have a situation where you can either do direct compaction 
for MADV_HUGEPAGE and nothing for anybody else, or direct compaction for 
everybody.  In our usecase, we want everybody to kick off background 
compaction because order=9 gfp_mask & __GFP_KSWAPD_RECLAIM is the only 
thing that is going to trigger background compaction but are unable to do 
so without still incurring lengthy pagefaults for non MADV_HUGEPAGE users.

> > which is some customers, others require MADV_HUGEPAGE for 
> > .text segment remap while loading their binary, without defrag=always or 
> > defrag=defer.  The problem is that we want to demand direct compact for 
> > MADV_HUGEPAGE: they _really_ want hugepages, it's the point of the 
> > madvise.
> 
> and that is the point of defrag=madvise to give them this direct
> compaction.
> 

Do you see the problem by first suggesting defrag=always at the top of 
your reply and then defrag=madvise now?  We cannot set both at once, it's 
the entire problem with the tristate and now quadstate setting.  We want a 
combination: EVERYBODY kicks off background compaction and applications 
that really want hugepages and are fine with incuring lengthy page fault, 
such as those (for the third time) remapping .text segment and doing 
madvise(MADV_HUGEPAGE) before fault, can use the madvise.

> > We have no setting, without this patch, to ask for background 
> > compaction for everybody so that their fault does not have long latency 
> > and for some customers to demand compaction.
> 
> that is true and what I am trying to say is that we should aim to give
> this background compaction for everybody via kcompactd because there are
> more users than THP who might benefit from low latency high order pages
> availability. 

My patch does that, we _defer_ for everybody unless you're using 
madvise(MADV_HUGEPAGE) and really want hugepages.  Forget defrag=never 
exists, it's not important in the discussion.  Forget defrag=always exists 
because all apps, like batch jobs, don't want lengthy pagefaults.  We have 
two options remaining:

 - defrag=defer: everybody kicks off background compaction, _nobody_ does
   direct compaction

 - defrag=madvise: madvise(MADV_HUGEPAGE) does direct compaction,
   everybody else does nothing

The point you're missing is that we _want_ defrag=defer.  We really do.  
We don't want to stall in the page allocator to get thp, but we want to 
try to make it available in the short term.  However, apps that do 
madvise(MADV_HUGEPAGE), like remapping your .text segment and wanting your 
text backed by hugepages and incurring the expense up front, or a 
database, or a vm, _want_ hugepages now and don't care about lengthy page 
faults.

The point is that I HAVE NO SETTING to get that behavior and 
defrag=madvise is _not_ a solution because it requires the presence of an 
app that is doing madvise(MADV_HUGEPAGE) AND faulting memory to get any 
order=9 compaction.

> > ?????? Why does the admin care if a user's page fault wants to reclaim to 
> > get high order memory?
> 
> Because the whole point of the defrag knob is to allow _administrator_
> control how much we try to fault in THP. And the primary motivation were
> latencies. The whole point of introducing defer option was to _never_
> stall in the page fault while it still allows to kick the background
> compaction. If you really want to tweak any option then madvise would be
> more appropriate IMHO because the semantic would be still clear. Use
> direct compaction for MADV_HUGEPAGE vmas and kick in kswapd/kcompactd
> for others.
> 

You want defrag=madvise to start doing background compaction for 
everybody, which was never done before for existing users of 
defrag=madvise?  That might be possible, I don't really care, I just think 
it's riskier because there are existing users of defrag=madvise who are 
opting in to new behavior because of the kernel change.  This patch 
changes defrag=defer because it's the new option and people setting the 
mode know what they are getting.

I disagree with your description of what the defrag setting is intended 
for.  The setting of thp defrag is to optimize for apps that truly want 
transparent behavior, i.e. they aren't doing madvise(MADV_HUGEPAGE).  Are 
they willing to incur lengthy pagefaults for thp when not doing any 
madvise(2)?  defrag=defer should not mean that users of 
madvise(MADV_HUGEPAGE) that have clearly specified their intent should not 
be allowed to try compacting memory themselves because they have indicated 
they are fine with such an expense by doing the madvise(2).

This is obviously fine for Kirill, and I have users who remap their .text 
segment and do madvise(MADV_DONTNEED) because they really want hugepages 
when they are exec'd, so I'd kindly ask you to consider the real-world use 
cases that require background compaction to make hugepages available for 
everybody but allow apps to opt-in to take the expense of compaction on 
themselves rather than your own theory of what users want.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@xxxxxxxxx.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@xxxxxxxxx";> email@xxxxxxxxx </a>



[Index of Archives]     [Linux ARM Kernel]     [Linux ARM]     [Linux Omap]     [Fedora ARM]     [IETF Annouce]     [Bugtraq]     [Linux OMAP]     [Linux MIPS]     [eCos]     [Asterisk Internet PBX]     [Linux API]