Hey Michal, thanks for taking the time to review / comment. On Mon, Mar 21, 2022 at 7:38 AM Michal Hocko <mhocko@xxxxxxxx> wrote: > > [ Removed Richard Henderson from the CC list as the delivery fails for > his address] Thank you :) > On Tue 08-03-22 13:34:03, Zach O'Keefe wrote: > > Introduction > > -------------------------------- > > > > This series provides a mechanism for userspace to induce a collapse of > > eligible ranges of memory into transparent hugepages in process context, > > thus permitting users to more tightly control their own hugepage > > utilization policy at their own expense. > > > > This idea was previously introduced by David Rientjes, and thanks to > > everyone for your patience while I prepared these patches resulting from > > that discussion[1]. > > > > [1] https://lore.kernel.org/all/C8C89F13-3F04-456B-BA76-DE2C378D30BF@xxxxxxxxxx/ > > > > Interface > > -------------------------------- > > > > The proposed interface adds a new madvise(2) mode, MADV_COLLAPSE, and > > leverages the new process_madvise(2) call. > > > > (*) process_madvise(2) > > > > Performs a synchronous collapse of the native pages mapped by > > the list of iovecs into transparent hugepages. The default gfp > > flags used will be the same as those used at-fault for the VMA > > region(s) covered. > > Could you expand on reasoning here? The default allocation mode for #PF > is rather light. Madvised will try harder. The reasoning is that we want > to make stalls due to #PF as small as possible and only try harder for > madvised areas (also a subject of configuration). Wouldn't it make more > sense to try harder for an explicit calls like madvise? > The reasoning is that the user has presumably configured system/vmas to tell the kernel how badly they want thps, and so this call aligns with current expectations. I.e. a user who goes about the trouble of trying to fault-in a thp at a given memory address likely wants a thp "as bad" as the same user MADV_COLLAPSE'ing the same memory to get a thp. If this is not the case, then the MADV_F_COLLAPSE_DEFRAG flag could be used to explicitly request the kernel to try harder, as you mention. > > When multiple VMA regions are spanned, if > > faulting-in memory from any VMA would permit synchronous > > compaction and reclaim, then all hugepage allocations required > > to satisfy the request may enter compaction and reclaim. > > I am not sure I follow here. Let's have a memory range spanning two > vmas, one with MADV_HUGEPAGE. I think you are rightly confused here, since the code doesn't currently match this description - thanks for pointing it out. The idea* was that, in the case you provided, the gfp flags used for all thp allocations would match those used for a MADV_HUGEPAGE vma, under current system settings. IOW, we treat the semantics of the collapse for the entire range uniformly (aside from MADV_NOHUGEPAGE, as per earlier discussions). So, for example, if transparent_hugepage/enabled was set to "always" and transparent_hugepage/defrag was set to "madvise", then all allocations could enter direct reclaim. The reasoning for this is, #1 the user has already told us that entering direct reclaim is tolerable for this syscall, and they can wait. #2 is that MADV_COLLAPSE might yield confusing results otherwise; some ranges might get backed by thps, while others may not. Also, a single MADV_HUGEPAGE vma early in the range might permit enough reclaim/compaction that allows successive non-MADV_HUGEPAGE allocations to succeed where they otherwise may not have. However, the code and this description disagree, since madvise decomposes the call over multiple vmas into iterative madvise_vma_behavior() over a single vma, with no state shared between calls. If the motivation above is sufficient, then this could be added. > > > Diverging from the at-fault semantics, VM_NOHUGEPAGE is ignored > > by default, as the user is explicitly requesting this action. > > Define two flags to control collapse semantics, passed through > > process_madvise(2)’s optional flags parameter: > > This part is discussed later in the thread. > > > > > MADV_F_COLLAPSE_LIMITS > > > > If supplied, collapse respects pte collapse limits set via > > sysfs: > > /transparent_hugepage/khugepaged/max_ptes_[none|swap|shared]. > > Required if calling on behalf of another process and not > > CAP_SYS_ADMIN. > > > > MADV_F_COLLAPSE_DEFRAG > > > > If supplied, permit synchronous compaction and reclaim, > > regardless of VMA flags. > > Why do we need this? Do you mean MADV_F_COLLAPSE_DEFRAG specifically, or both? * MADV_F_COLLAPSE_LIMITS is included because we'd like some form of inter-process protection for collapsing memory in another process' address space (which a malevolent program could exploit to cause oom conditions in another memcg hierarchy, for example), but we want privileged (CAP_SYS_ADMIN) users to otherwise be able to optimize thp utilization as they wish. * MADV_F_COLLAPSE_DEFRAG is useful as mentioned above, where we want to explicitly tell the kernel to try harder to back this by thps, regardless of the current system/vma configuration. Note that when used together, these flags can be used to implement the exact behavior of khugepaged, through MADV_COLLAPSE. > -- > Michal Hocko > SUSE Labs