Re: [RFC] Hugepage collapse in process context

Michal Hocko <mhocko@xxxxxxxx> · Thu, 18 Feb 2021 11:01:30 +0100



On Thu 18-02-21 09:53:25, Song Liu wrote:
> 
> 
> > On Feb 18, 2021, at 12:39 AM, Michal Hocko <mhocko@xxxxxxxx> wrote:
> > 
> > On Thu 18-02-21 08:11:13, Song Liu wrote:
> >> 
> >> 
> >>> On Feb 16, 2021, at 8:24 PM, David Rientjes <rientjes@xxxxxxxxxx> wrote:
> >>> 
> >>> Hi everybody,
> >>> 
> >>> Khugepaged is slow by default, it scans at most 4096 pages every 10s.  
> >>> That's normally fine as a system-wide setting, but some applications would 
> >>> benefit from a more aggressive approach (as long as they are willing to 
> >>> pay for it).
> >>> 
> >>> Instead of adding priorities for eligible ranges of memory to khugepaged, 
> >>> temporarily speeding khugepaged up for the whole system, or sharding its 
> >>> work for memory belonging to a certain process, one approach would be to 
> >>> allow userspace to induce hugepage collapse.
> >>> 
> >>> The benefit to this approach would be that this is done in process context 
> >>> so its cpu is charged to the process that is inducing the collapse.  
> >>> Khugepaged is not involved.
> >>> 
> >>> Idea was to allow userspace to induce hugepage collapse through the new 
> >>> process_madvise() call.  This allows us to collapse hugepages on behalf of 
> >>> current or another process for a vectored set of ranges.
> >>> 
> >>> This could be done through a new process_madvise() mode *or* it could be a 
> >>> flag to MADV_HUGEPAGE since process_madvise() allows for a flag parameter 
> >>> to be passed.  For example, MADV_F_SYNC.
> >>> 
> >>> When done, this madvise call would allocate a hugepage on the right node 
> >>> and attempt to do the collapse in process context just as khugepaged would 
> >>> otherwise do.
> >> 
> >> This is very interesting idea. One question, IIUC, the user process will 
> >> block until all small pages in given ranges are collapsed into THPs.
> > 
> > Do you mean that PF would be blocked due to exclusive mmap_sem? Or is
> > there anything else oyu have in mind?
> 
> I was thinking about memory defragmentation when the application asks for
> many THPs. Say the application looks like
> 
> main()
> {
> 	malloc();
> 	madvise(HUGE);
> 	process_madvise();
> 	
> 	/* start doing work */
> }
> 
> IIUC, when process_madvise() finishes, the THPs should be ready. However, 
> if defragmentation takes a long time, the process will wait in process_madvise().

OK, I see. The operation is definitely free which is to be expected. You
can do the same from a thread which can spend time collapsing THPs.
There are still internal resources that might block others - e.g. the
above mentioned mmap_sem. We can try hard to reduce the lock time but
this is unlikely to be completely free of any interruption of the
workload.
-- 
Michal Hocko
SUSE Labs