On Tue, Apr 19, 2022 at 1:03 PM David Hildenbrand <david@xxxxxxxxxx> wrote: > > >> E.g., have with a very sparse memory layout, we don't want to waste > >> memory by allocating memory where we actually have no page populated yet > >> -- could be user space won't reuse that memory in the foreseeable > >> future. With too many swap entries, we don't want to trigger an > >> eventually unnecessary overhead of swapping in entries if user space > >> won't access them in the foreseeable future. Something similar applies > >> to max_ptes_shared, where one might just end up wasting a lot of memory > >> eventually in some applications. > >> > >> So IMHO, with MADV_COLLAPSE we should ignore/disable any heuristics that > >> try figuring out what user space might be doing. We know exactly what > >> user space asks for -- and that can be documented properly. > >> > > Just a thought, if we ever want to implement khugepaged in user space, > it could theoretically obtain similar information using e.g., the > pagemap. It wouldn't be race-free, but the question is if it would matter. > > I consider the primary use case of giving an application more precise > control over actual THP placement. > Good point about the pagemap and agree about the primary use case - I'll make that clear in v3 cover letter. > > > > Sounds good to me. Would you also be in favor of decoupling allocation > > semantics from khugepaged? I.e. we'll pick some default gfp flags and > > not depend on /sys/kernel/mm/transparent_hugepage/khugepaged/defrag? > > Good question. It's not really a heuristic like that other stuff. > > Easy answer: we're not dealing with khugepaged, so anything in > /sys/kernel/mm/transparent_hugepage/khugepaged/ shouldn't apply? > That's what I'm thinking now too. If there's no objections, I'll proceed in that direction for v3. > Sure, we could have a separate toggles for MADV_COLLAPSE. > > Maybe we simply want a dedicated syscall where we can specify additional > options ... but maybe that simply over-complicates the problem. > Thankfully process_madvise(2) has flags, and madvise(2) users can always migrate to using process_madvise(2) on self. Piggy-backing off madvise infrastructure for these "non-advice actions" (e.g. MADV_PAGEOUT) seems to be the norm. Thanks as always for your time and thoughts! Zach > -- > Thanks, > > David / dhildenb >