On Fri, Apr 15, 2022 at 6:39 AM Peter Xu <peterx@xxxxxxxxxx> wrote: > > On Thu, Apr 14, 2022 at 05:52:43PM -0700, Zach O'Keefe wrote: > > Hey Peter, > > > > Thanks for taking the time to review! > > > > On Thu, Apr 14, 2022 at 5:04 PM Peter Xu <peterx@xxxxxxxxxx> wrote: > > > > > > Hi, Zach, > > > > > > On Thu, Apr 14, 2022 at 11:06:00AM -0700, Zach O'Keefe wrote: > > > > process_madvise(2) > > > > > > > > Performs a synchronous collapse of the native pages > > > > mapped by the list of iovecs into transparent hugepages. > > > > > > > > Allocation semantics are the same as khugepaged, and depend on > > > > (1) the active sysfs settings > > > > /sys/kernel/mm/transparent_hugepage/enabled and > > > > /sys/kernel/mm/transparent_hugepage/khugepaged/defrag, and (2) > > > > the VMA flags of the memory range being collapsed. > > > > > > > > Collapse eligibility criteria differs from khugepaged in that > > > > the sysfs files > > > > /sys/kernel/mm/transparent_hugepage/khugepaged/max_ptes_[none|swap|shared] > > > > are ignored. > > > > > > The userspace khugepaged idea definitely makes sense to me, though I'm > > > curious how the line is drown on the different behaviors here by explicitly > > > ignoring the max_ptes_* entries. > > > > > > Let's assume the initiative is to duplicate a more data-aware khugepaged in > > > the userspace, then IMHO it makes more sense to start with all the policies > > > that applies to khugepaged already, including max_pte_*. > > > > > > I can understand the willingness to provide even stronger semantics here > > > than khugepaged since the userspace could have very clear knowledge of how > > > to provision the memories (better than a kernel scanner). It's just that > > > IMHO it could be slightly confusing if the new interface only partially > > > apply the khugepaged rules. > > > > > > No strong opinion here. It could already been a trade-off after the > > > discussion from the RFC with Michal which I read.. Just curious about how > > > you made that design decision so feel free to read it as a pure question. > > > > > > > Understand your point here. The allocation and max_pte_* semantics are > > split between khugepaged-like and fault-like, respectively - which > > could be confusing. Originally, I proposed a MADV_F_COLLAPSE_LIMITS > > flag to control the former's behavior, but agreed to keep things > > simple to start, and expand the interface if/when necessary. I opted > > to ignore max_ptes_* as the default since I envisioned that early > > adopters would "just want it to work". One such example would be > > backing executable text by hugepages on program load when many pages > > haven't been demand-paged in yet. > > > > What do you think? > > I'm just slightly worried that'll make the default MADV_COLLAPSE semantics > blurred. > > To me, a clean default definition for MADV_COLLAPSE would be nice, as "do > khugepaged on this range, and with current thread context". IMHO any > feature bits then can be supplementing special needs, and I'll take the thp > backing executable example to be one of the (good?) reason we'd need an > extra flag for ignoring the max_ptes_* knobs. > > So personally if I were you maybe I'll start with the simple scheme of that > (even if it won't immediately service a thing) but then add either the > defrag or ignore_max_ptes_* as feature bits later on, with clear use case > descriptions about why we need each of the feature flags. IMHO numbers > would be even more helpful when there's specific use cases on the show. > > Or, perhaps you think all potential MADV_COLLAPSE users should literally > skip max_ptes_* limitations always? > Thanks for your time and valuable feedback here, Peter. I had a response typed up, but after a few iterations became increasingly unsatisfied with my own response. I think this feature should be able to stand on its own without consideration of a userspace khugepaged, as we have existing concrete examples where it would be useful. In these cases, and I assume almost all other use-cases outside userspace khugepaged, max_ptes_* should be ignored as the fundamental assumption of MADV_COLLAPSE is that the user knows better, and IMHO, khugepaged heuristics shouldn't tell users they are wrong. But this, as you mention, unsatisfactorily blurs the semantics of MADV_COLLAPSE: "act like khugepaged here, but not here". As such, WDYT about the reverse-side of the coin of what you proposed: to not couple the default behavior of MADV_COLLAPSE with khugepaged at all? I.e. Not tie the allocation semantics to /sys/kernel/mm/transparent_hugepage/khugepaged/defrag. We can add flags as necessary when/if a reimplementation of khugepaged in userspace proves fruitful. Thanks for your time and input, Zach > Anyway, I won't pretend I am an expert in this area. :) So please take that > with a grain of salt. > > Thanks, > > -- > Peter Xu >