[please CC linux-api if you are going to repost with the fix suggested by Nadav] On Thu 03-03-22 16:47:34, Johannes Weiner wrote: > On Thu, Mar 03, 2022 at 04:29:56PM -0500, Johannes Weiner wrote: > > MADV_DONTNEED historically rejects mlocked ranges, but with > > MLOCK_ONFAULT and MCL_ONFAULT allowing to mlock without populating, > > there are valid use cases for depopulating locked ranges as well. > > > > Users mlock memory to protect secrets. There are allocators for secure > > buffers that want in-use memory generally mlocked, but cleared and > > invalidated memory to give up the physical pages. This could be done > > with explicit munlock -> mlock calls on free -> alloc of course, but > > that adds two unnecessary syscalls, heavy mmap_sem write locks, vma > > splits and re-merges - only to get rid of the backing pages. > > > > Users also mlockall(MCL_ONFAULT) to suppress sustained paging, but are > > okay with on-demand initial population. It seems valid to selectively > > free some memory during the lifetime of such a process, without having > > to mess with its overall policy. > > > > Why add a separate flag? Isn't this a pretty niche usecase? > > > > - MADV_DONTNEED has been bailing on locked vmas forever. It's at least > > conceivable that someone, somewhere is relying on mlock to protect > > data from perhaps broader invalidation calls. Changing this behavior > > now could lead to quiet data corruption. > > > > - It also clarifies expectations around MADV_FREE and maybe > > MADV_REMOVE. It avoids the situation where one quietly behaves > > different than the others. MADV_FREE_LOCKED can be added later. > > > > - The combination of mlock() and madvise() in the first place is > > probably niche. But where it happens, I'd say that dropping pages > > from a locked region once they don't contain secrets or won't page > > anymore is much saner than relying on mlock to protect memory from > > speculative or errant invalidation calls. It's just that we can't > > change the default behavior because of the two previous points. > > > > Given that, an explicit new flag seems to make the most sense. > > > > Signed-off-by: Johannes Weiner <hannes@xxxxxxxxxxx> > > Just for context, I found this discussion back from 2018: > > https://lkml.iu.edu/hypermail/linux/kernel/1806.1/00483.html > > It seems to me that the usecase wasn't really in question, but people > weren't sure about the API, and then Jason found a workaround before > the discussion really concluded. I was asked internally about this > feature, so I'm submitting another patch in this direction, but with > more thoughts on why I chose to go with a new flag. Hopefully we can > work it out this time around :-) Thanks for the link. The topic sounded familiar but I couldn't really remember any details anymore. Now I do remember that I wasn't happy about special casing MLOCK_ONFAULT. A dedicated madvise operation is definitely safer and I am OK with that. Presented usecases make sense to me as well. Btw. I have a recollection that Mike is working on MADV_DONTNEED support for hugetlb pages. I do not know the current state of that work. Not that it would make nay impact on your new flag but some minor changes might be needed. Anyway, after the madvise_need_mmap_write is addressed, feel free to add Acked-by: Michal Hocko <mhocko@xxxxxxxx> Thanks! -- Michal Hocko SUSE Labs