On Tue, May 21, 2019 at 08:17:43AM +0200, Michal Hocko wrote: > On Tue 21-05-19 11:41:07, Minchan Kim wrote: > > On Mon, May 20, 2019 at 11:18:29AM +0200, Michal Hocko wrote: > > > [Cc linux-api] > > > > > > On Mon 20-05-19 12:52:52, Minchan Kim wrote: > > > > There is some usecase that centralized userspace daemon want to give > > > > a memory hint like MADV_[COOL|COLD] to other process. Android's > > > > ActivityManagerService is one of them. > > > > > > > > It's similar in spirit to madvise(MADV_WONTNEED), but the information > > > > required to make the reclaim decision is not known to the app. Instead, > > > > it is known to the centralized userspace daemon(ActivityManagerService), > > > > and that daemon must be able to initiate reclaim on its own without > > > > any app involvement. > > > > > > Could you expand some more about how this all works? How does the > > > centralized daemon track respective ranges? How does it synchronize > > > against parallel modification of the address space etc. > > > > Currently, we don't track each address ranges because we have two > > policies at this moment: > > > > deactive file pages and reclaim anonymous pages of the app. > > > > Since the daemon has a ability to let background apps resume(IOW, process > > will be run by the daemon) and both hints are non-disruptive stabilty point > > of view, we are okay for the race. > > Fair enough but the API should consider future usecases where this might > be a problem. So we should really think about those potential scenarios > now. If we are ok with that, fine, but then we should be explicit and > document it that way. Essentially say that any sort of synchronization > is supposed to be done by monitor. This will make the API less usable > but maybe that is enough. Okay, I will add more about that in the description. > > > > > To solve the issue, this patch introduces new syscall process_madvise(2) > > > > which works based on pidfd so it could give a hint to the exeternal > > > > process. > > > > > > > > int process_madvise(int pidfd, void *addr, size_t length, int advise); > > > > > > OK, this makes some sense from the API point of view. When we have > > > discussed that at LSFMM I was contemplating about something like that > > > except the fd would be a VMA fd rather than the process. We could extend > > > and reuse /proc/<pid>/map_files interface which doesn't support the > > > anonymous memory right now. > > > > > > I am not saying this would be a better interface but I wanted to mention > > > it here for a further discussion. One slight advantage would be that > > > you know the exact object that you are operating on because you have a > > > fd for the VMA and we would have a more straightforward way to reject > > > operation if the underlying object has changed (e.g. unmapped and reused > > > for a different mapping). > > > > I agree your point. If I didn't miss something, such kinds of vma level > > modify notification doesn't work even file mapped vma at this moment. > > For anonymous vma, I think we could use userfaultfd, pontentially. > > It would be great if someone want to do with disruptive hints like > > MADV_DONTNEED. > > > > I'd like to see it further enhancement after landing address range based > > operation via limiting hints process_madvise supports to non-disruptive > > only(e.g., MADV_[COOL|COLD]) so that we could catch up the usercase/workload > > when someone want to extend the API. > > So do you think we want both interfaces (process_madvise and madvisefd)? What I have in mind is to extend process_madvise later like this struct pr_madvise_param { int size; /* the size of this structure */ union { const struct iovec __user *vec; /* address range array */ int fd; /* supported from 6.0 */ } } with introducing new hint Or-able PR_MADV_RANGE_FD, so that process_madvise can go with fd instead of address range.