On Thu 18-02-21 09:53:25, Song Liu wrote: > > > > On Feb 18, 2021, at 12:39 AM, Michal Hocko <mhocko@xxxxxxxx> wrote: > > > > On Thu 18-02-21 08:11:13, Song Liu wrote: > >> > >> > >>> On Feb 16, 2021, at 8:24 PM, David Rientjes <rientjes@xxxxxxxxxx> wrote: > >>> > >>> Hi everybody, > >>> > >>> Khugepaged is slow by default, it scans at most 4096 pages every 10s. > >>> That's normally fine as a system-wide setting, but some applications would > >>> benefit from a more aggressive approach (as long as they are willing to > >>> pay for it). > >>> > >>> Instead of adding priorities for eligible ranges of memory to khugepaged, > >>> temporarily speeding khugepaged up for the whole system, or sharding its > >>> work for memory belonging to a certain process, one approach would be to > >>> allow userspace to induce hugepage collapse. > >>> > >>> The benefit to this approach would be that this is done in process context > >>> so its cpu is charged to the process that is inducing the collapse. > >>> Khugepaged is not involved. > >>> > >>> Idea was to allow userspace to induce hugepage collapse through the new > >>> process_madvise() call. This allows us to collapse hugepages on behalf of > >>> current or another process for a vectored set of ranges. > >>> > >>> This could be done through a new process_madvise() mode *or* it could be a > >>> flag to MADV_HUGEPAGE since process_madvise() allows for a flag parameter > >>> to be passed. For example, MADV_F_SYNC. > >>> > >>> When done, this madvise call would allocate a hugepage on the right node > >>> and attempt to do the collapse in process context just as khugepaged would > >>> otherwise do. > >> > >> This is very interesting idea. One question, IIUC, the user process will > >> block until all small pages in given ranges are collapsed into THPs. > > > > Do you mean that PF would be blocked due to exclusive mmap_sem? Or is > > there anything else oyu have in mind? > > I was thinking about memory defragmentation when the application asks for > many THPs. Say the application looks like > > main() > { > malloc(); > madvise(HUGE); > process_madvise(); > > /* start doing work */ > } > > IIUC, when process_madvise() finishes, the THPs should be ready. However, > if defragmentation takes a long time, the process will wait in process_madvise(). OK, I see. The operation is definitely free which is to be expected. You can do the same from a thread which can spend time collapsing THPs. There are still internal resources that might block others - e.g. the above mentioned mmap_sem. We can try hard to reduce the lock time but this is unlikely to be completely free of any interruption of the workload. -- Michal Hocko SUSE Labs