[ Removed Richard Henderson from the CC list as the delivery fails for his address] On Tue 08-03-22 13:34:03, Zach O'Keefe wrote: > Introduction > -------------------------------- > > This series provides a mechanism for userspace to induce a collapse of > eligible ranges of memory into transparent hugepages in process context, > thus permitting users to more tightly control their own hugepage > utilization policy at their own expense. > > This idea was previously introduced by David Rientjes, and thanks to > everyone for your patience while I prepared these patches resulting from > that discussion[1]. > > [1] https://lore.kernel.org/all/C8C89F13-3F04-456B-BA76-DE2C378D30BF@xxxxxxxxxx/ > > Interface > -------------------------------- > > The proposed interface adds a new madvise(2) mode, MADV_COLLAPSE, and > leverages the new process_madvise(2) call. > > (*) process_madvise(2) > > Performs a synchronous collapse of the native pages mapped by > the list of iovecs into transparent hugepages. The default gfp > flags used will be the same as those used at-fault for the VMA > region(s) covered. Could you expand on reasoning here? The default allocation mode for #PF is rather light. Madvised will try harder. The reasoning is that we want to make stalls due to #PF as small as possible and only try harder for madvised areas (also a subject of configuration). Wouldn't it make more sense to try harder for an explicit calls like madvise? > When multiple VMA regions are spanned, if > faulting-in memory from any VMA would permit synchronous > compaction and reclaim, then all hugepage allocations required > to satisfy the request may enter compaction and reclaim. I am not sure I follow here. Let's have a memory range spanning two vmas, one with MADV_HUGEPAGE. > Diverging from the at-fault semantics, VM_NOHUGEPAGE is ignored > by default, as the user is explicitly requesting this action. > Define two flags to control collapse semantics, passed through > process_madvise(2)’s optional flags parameter: This part is discussed later in the thread. > > MADV_F_COLLAPSE_LIMITS > > If supplied, collapse respects pte collapse limits set via > sysfs: > /transparent_hugepage/khugepaged/max_ptes_[none|swap|shared]. > Required if calling on behalf of another process and not > CAP_SYS_ADMIN. > > MADV_F_COLLAPSE_DEFRAG > > If supplied, permit synchronous compaction and reclaim, > regardless of VMA flags. Why do we need this? -- Michal Hocko SUSE Labs