Re: [RFC] Hugepage collapse in process context

David Rientjes <rientjes@xxxxxxxxxx> · Thu, 18 Feb 2021 14:34:56 -0800 (PST)

On Thu, 18 Feb 2021, David Hildenbrand wrote:

> > > > Hi everybody,
> > > > 
> > > > Khugepaged is slow by default, it scans at most 4096 pages every 10s.
> > > > That's normally fine as a system-wide setting, but some applications
> > > > would
> > > > benefit from a more aggressive approach (as long as they are willing to
> > > > pay for it).
> > > > 
> > > > Instead of adding priorities for eligible ranges of memory to
> > > > khugepaged,
> > > > temporarily speeding khugepaged up for the whole system, or sharding its
> > > > work for memory belonging to a certain process, one approach would be to
> > > > allow userspace to induce hugepage collapse.
> > > > 
> > > > The benefit to this approach would be that this is done in process
> > > > context
> > > > so its cpu is charged to the process that is inducing the collapse.
> > > > Khugepaged is not involved.
> > > 
> > > Yes, this makes a lot of sense to me.
> > > 
> > > > Idea was to allow userspace to induce hugepage collapse through the new
> > > > process_madvise() call.  This allows us to collapse hugepages on behalf
> > > > of
> > > > current or another process for a vectored set of ranges.
> > > 
> > > Yes, madvise sounds like a good fit for the purpose.
> > 
> > Agreed on both points.
> > 
> > > > This could be done through a new process_madvise() mode *or* it could be
> > > > a
> > > > flag to MADV_HUGEPAGE since process_madvise() allows for a flag
> > > > parameter
> > > > to be passed.  For example, MADV_F_SYNC.
> > > 
> > > Would this MADV_F_SYNC be applicable to other madvise modes? Most
> > > existing madvise modes do not seem to make much sense. We can argue that
> > > MADV_PAGEOUT would guarantee the range was indeed reclaimed but I am not
> > > sure we want to provide such a strong semantic because it can limit
> > > future reclaim optimizations.
> > > 
> > > To me MADV_HUGEPAGE_COLLAPSE sounds like the easiest way forward.
> > 
> > I guess in the old madvise(2) we could create a new combo of MADV_HUGEPAGE |
> > MADV_WILLNEED with this semantic? But you are probably more interested in
> > process_madvise() anyway. There the new flag would make more sense. But
> > there's
> > also David H.'s proposal for MADV_POPULATE and there might be benefit in
> > considering both at the same time? Should e.g. MADV_POPULATE with
> > MADV_HUGEPAGE
> > have the collapse semantics? But would MADV_POPULATE be added to
> > process_madvise() as well? Just thinking out loud so we don't end up with
> > more
> > flags than necessary, it's already confusing enough as it is.
> > 
> 
> Note that madvise() eats only a single value, not flags. Combinations as you
> describe are not possible.
> 
> Something MADV_HUGEPAGE_COLLAPSE make sense to me that does not need the mmap
> lock in write and does not modify the actual VMA, only a mapping.
> 

Agreed, and happy to see that there's a general consensus for the 
direction.  Benefit of a new madvise mode is that it can be used for 
madvise() as well if you are interested in only a single range of your own 
memory and then it doesn't need to reconcile with any of the already 
overloaded semantics of MADV_HUGEPAGE.

Otherwise, process_madvise() can be used for other processes and/or 
vectored ranges.

Song's use case for this to prioritize thp usage is very important for us 
as well.  I hadn't thought of the madvise(MADV_HUGEPAGE) + 
madvise(MADV_HUGEPAGE_COLLAPSE) use case: I was anticipating the latter 
would allocate the hugepage with khugepaged's gfp mask so it would always 
compact.  But it seems like this would actually be better to use the gfp 
mask that would be used at fault for the vma and left to userspace to 
determine whether that's MADV_HUGEPAGE or not.  Makes sense.

(Userspace could even do madvise(MADV_NOHUGEPAGE) + 
madvise(MADV_HUGEPAGE_COLLAPSE) to do the synchronous collapse but 
otherwise exclude it from khugepaged's consideration if it were inclined.)

Two other minor points:

 - Currently, process_madvise() doesn't use the flags parameter at all so 
   there's the question of whether we need generalized flags that apply to 
   most madvise modes or whether the flags can be specific to the mode 
   being used.  For example, a natural extension of this new mode would be 
   to determine the hugepage size if we were ever to support synchronous 
   collapse into a 1GB gigantic page on x86 (MADV_F_1GB? :)

 - We haven't discussed the future of khugepaged with this new mode: it 
   seems like we could simply implement khugepaged fully in userspace and 
   remove it from the kernel? :)