On Tue, Apr 19, 2022 at 7:33 AM David Hildenbrand <david@xxxxxxxxxx> wrote: > > On 16.04.22 21:26, Peter Xu wrote: > > Hi, Zach, > > > > On Fri, Apr 15, 2022 at 01:04:04PM -0700, Zach O'Keefe wrote: > >> On Fri, Apr 15, 2022 at 6:39 AM Peter Xu <peterx@xxxxxxxxxx> wrote: > >>> > >>> On Thu, Apr 14, 2022 at 05:52:43PM -0700, Zach O'Keefe wrote: > >>>> Hey Peter, > >>>> > >>>> Thanks for taking the time to review! > >>>> > >>>> On Thu, Apr 14, 2022 at 5:04 PM Peter Xu <peterx@xxxxxxxxxx> wrote: > >>>>> > >>>>> Hi, Zach, > >>>>> > >>>>> On Thu, Apr 14, 2022 at 11:06:00AM -0700, Zach O'Keefe wrote: > >>>>>> process_madvise(2) > >>>>>> > >>>>>> Performs a synchronous collapse of the native pages > >>>>>> mapped by the list of iovecs into transparent hugepages. > >>>>>> > >>>>>> Allocation semantics are the same as khugepaged, and depend on > >>>>>> (1) the active sysfs settings > >>>>>> /sys/kernel/mm/transparent_hugepage/enabled and > >>>>>> /sys/kernel/mm/transparent_hugepage/khugepaged/defrag, and (2) > >>>>>> the VMA flags of the memory range being collapsed. > >>>>>> > >>>>>> Collapse eligibility criteria differs from khugepaged in that > >>>>>> the sysfs files > >>>>>> /sys/kernel/mm/transparent_hugepage/khugepaged/max_ptes_[none|swap|shared] > >>>>>> are ignored. > >>>>> > >>>>> The userspace khugepaged idea definitely makes sense to me, though I'm > >>>>> curious how the line is drown on the different behaviors here by explicitly > >>>>> ignoring the max_ptes_* entries. > >>>>> > >>>>> Let's assume the initiative is to duplicate a more data-aware khugepaged in > >>>>> the userspace, then IMHO it makes more sense to start with all the policies > >>>>> that applies to khugepaged already, including max_pte_*. > >>>>> > >>>>> I can understand the willingness to provide even stronger semantics here > >>>>> than khugepaged since the userspace could have very clear knowledge of how > >>>>> to provision the memories (better than a kernel scanner). It's just that > >>>>> IMHO it could be slightly confusing if the new interface only partially > >>>>> apply the khugepaged rules. > >>>>> > >>>>> No strong opinion here. It could already been a trade-off after the > >>>>> discussion from the RFC with Michal which I read.. Just curious about how > >>>>> you made that design decision so feel free to read it as a pure question. > >>>>> > >>>> > >>>> Understand your point here. The allocation and max_pte_* semantics are > >>>> split between khugepaged-like and fault-like, respectively - which > >>>> could be confusing. Originally, I proposed a MADV_F_COLLAPSE_LIMITS > >>>> flag to control the former's behavior, but agreed to keep things > >>>> simple to start, and expand the interface if/when necessary. I opted > >>>> to ignore max_ptes_* as the default since I envisioned that early > >>>> adopters would "just want it to work". One such example would be > >>>> backing executable text by hugepages on program load when many pages > >>>> haven't been demand-paged in yet. > >>>> > >>>> What do you think? > >>> > >>> I'm just slightly worried that'll make the default MADV_COLLAPSE semantics > >>> blurred. > >>> > >>> To me, a clean default definition for MADV_COLLAPSE would be nice, as "do > >>> khugepaged on this range, and with current thread context". IMHO any > >>> feature bits then can be supplementing special needs, and I'll take the thp > >>> backing executable example to be one of the (good?) reason we'd need an > >>> extra flag for ignoring the max_ptes_* knobs. > >>> > >>> So personally if I were you maybe I'll start with the simple scheme of that > >>> (even if it won't immediately service a thing) but then add either the > >>> defrag or ignore_max_ptes_* as feature bits later on, with clear use case > >>> descriptions about why we need each of the feature flags. IMHO numbers > >>> would be even more helpful when there's specific use cases on the show. > >>> > >>> Or, perhaps you think all potential MADV_COLLAPSE users should literally > >>> skip max_ptes_* limitations always? > >>> > >> > >> Thanks for your time and valuable feedback here, Peter. I had a response typed > >> up, but after a few iterations became increasingly unsatisfied with my > >> own response. > >> > >> I think this feature should be able to stand on its own without > >> consideration of a userspace khugepaged, as we have existing concrete > >> examples where it would be useful. In these cases, and I assume almost > >> all other use-cases outside userspace khugepaged, max_ptes_* should be > >> ignored as the fundamental assumption of MADV_COLLAPSE is that the > >> user knows better, and IMHO, khugepaged heuristics shouldn't tell > >> users they are wrong. > > > > Valid point. And actually right after I replied I thought similarly on > > whether we need to connect the two interfaces at all.. > > > > It's just that it's very easy to go think like that after reading the cover > > letter since that's exactly what it is comparing to. :) > > > > There's definitely a difference view on user/kernel level of things, then > > it sounds reasonable to me if we add a new interface it by default has a > > stronger semantics otherwise we may not bother if with MADV_HUGEPAGE's > > existance. > > > > So maybe max_ptes_* won't even make sense for MADV_COLLAPSE in most cases > > as you said. And that's a real pure question I asked above, and I feel > > like your answer is actually "yes" we should always ignore the max_ptes_* > > fields until there's a proof that it'll be helpful. > > > >> > >> But this, as you mention, unsatisfactorily blurs the semantics of > >> MADV_COLLAPSE: "act like khugepaged here, but not here". > >> > >> As such, WDYT about the reverse-side of the coin of what you proposed: > >> to not couple the default behavior of MADV_COLLAPSE with khugepaged at > >> all? I.e. Not tie the allocation semantics to > >> /sys/kernel/mm/transparent_hugepage/khugepaged/defrag. We can add > >> flags as necessary when/if a reimplementation of khugepaged in > >> userspace proves fruitful. > > > > Let's see whether others have thoughts, but what you proposed here makes > > sense to me. > > Hey David - thanks again for taking the time to read and comment. > Makes sense to me. IIUC, the whole handling of max_ptes_* is in place as > tunable because we don't know what user space is up to. > Agreed. > E.g., have with a very sparse memory layout, we don't want to waste > memory by allocating memory where we actually have no page populated yet > -- could be user space won't reuse that memory in the foreseeable > future. With too many swap entries, we don't want to trigger an > eventually unnecessary overhead of swapping in entries if user space > won't access them in the foreseeable future. Something similar applies > to max_ptes_shared, where one might just end up wasting a lot of memory > eventually in some applications. > > So IMHO, with MADV_COLLAPSE we should ignore/disable any heuristics that > try figuring out what user space might be doing. We know exactly what > user space asks for -- and that can be documented properly. > Sounds good to me. Would you also be in favor of decoupling allocation semantics from khugepaged? I.e. we'll pick some default gfp flags and not depend on /sys/kernel/mm/transparent_hugepage/khugepaged/defrag? Thanks again, Zach > -- > Thanks, > > David / dhildenb >