Re: [RFC] Hugepage collapse in process context

Song Liu <songliubraving@xxxxxx> · Thu, 18 Feb 2021 08:11:13 +0000

> On Feb 16, 2021, at 8:24 PM, David Rientjes <rientjes@xxxxxxxxxx> wrote:
> 
> Hi everybody,
> 
> Khugepaged is slow by default, it scans at most 4096 pages every 10s.  
> That's normally fine as a system-wide setting, but some applications would 
> benefit from a more aggressive approach (as long as they are willing to 
> pay for it).
> 
> Instead of adding priorities for eligible ranges of memory to khugepaged, 
> temporarily speeding khugepaged up for the whole system, or sharding its 
> work for memory belonging to a certain process, one approach would be to 
> allow userspace to induce hugepage collapse.
> 
> The benefit to this approach would be that this is done in process context 
> so its cpu is charged to the process that is inducing the collapse.  
> Khugepaged is not involved.
> 
> Idea was to allow userspace to induce hugepage collapse through the new 
> process_madvise() call.  This allows us to collapse hugepages on behalf of 
> current or another process for a vectored set of ranges.
> 
> This could be done through a new process_madvise() mode *or* it could be a 
> flag to MADV_HUGEPAGE since process_madvise() allows for a flag parameter 
> to be passed.  For example, MADV_F_SYNC.
> 
> When done, this madvise call would allocate a hugepage on the right node 
> and attempt to do the collapse in process context just as khugepaged would 
> otherwise do.

This is very interesting idea. One question, IIUC, the user process will 
block until all small pages in given ranges are collapsed into THPs. What 
would happen if the memory is so fragmented that we cannot allocate that 
many huge pages? Do we need some fail over mechanisms? 

> 
> This would immediately be useful for a malloc implementation, for example, 
> that has released its memory back to the system using MADV_DONTNEED and 
> will subsequently refault the memory.  Rather than wait for khugepaged to 
> come along 30m later, for example, and collapse this memory into a 
> hugepage (which could take a much longer time on a very large system), an 
> alternative would be to use this process_madvise() mode to induce the 
> action up front.  In other words, say "I'm returning this memory to the 
> application and it's going to be hot, so back it by a hugepage now rather 
> than waiting until later."
> 
> It would also be useful for read-only file-backed mappings for text 
> segments.  Khugepaged should be happy, it's just less work done by generic 
> kthreads that gets charged as an overall tax to everybody.

Mixing sync-THP with async-THP (khugepaged) could be useful when there are 
different priorities of THPs. In one of the use cases, we use THP for both 
text and data. The ratio may look like 5x THPs for text, and 2000x THPs for 
data. If the system has fewer than 2005 THPs, we wouldn't wait, but we would 
prioritize THPs for text. With this new mechanism, we can use sync-THP for 
the text, and async-THP for the data. 

Thanks,
Song