On Thu, Apr 11, 2019 at 8:23 AM Suren Baghdasaryan <surenb@xxxxxxxxxx> wrote: > > > On Wed, Apr 10, 2019 at 06:43:53PM -0700, Suren Baghdasaryan wrote: > > > > Add new SS_EXPEDITE flag to be used when sending SIGKILL via > > > > pidfd_send_signal() syscall to allow expedited memory reclaim of the > > > > victim process. The usage of this flag is currently limited to SIGKILL > > > > signal and only to privileged users. FWIW, I like Suren's general idea, but I was thinking of a different way of exposing the same general functionality to userspace. The way I look at it, it's very useful for an auto-balancing memory system like Android (or, I presume, something that uses oomd) to recover memory *immediately* after a SIGKILL instead of waiting for the process to kill itself: a process's death can be delayed for a long time due to factors like scheduling and being blocked in various uninterruptible kernel-side operations. Suren's proposal is basically about pulling forward in time page reclaimation that would happen anyway. What if we let userspace control exactly when this reclaimation happens? I'm imagining a new* kernel facility that basically looks like this. It lets lmkd determine for itself how much work the system should expend on reclaiming memory from dying processes. size_t try_reap_dying_process( int pidfd, int flags /* must be zero */, size_t maximum_bytes_to_reap); Precondition: process is pending group-exit (someone already sent it SIGKILL) Postcondition: some memory reclaimed from dying process Invariant: doesn't sleep; stops reaping after MAXIMUM_BYTES_TO_REAP -> success: return number of bytes reaped -> failure: (size_t)-1 EBUSY: couldn't get mmap_sem EINVAL: PIDFD isn't a pidfd or otherwise invalid arguments EPERM: process hasn't been send SIGKILL: try_reap_dying_process on a process that isn't dying is illegal Kernel-side, try_reap_dying_process would try-acquire mmap_sem and just fail if it couldn't get it. Once acquired, it would release "easy" pages (using the same check the oom reaper uses) until it either ran out of pages or hit the MAXIMUM_BYTES_TO_REAP cap. The purpose of MAXIMUM_BYTES_TO_REAP is to let userspace bound-above the amount of time we spend reclaiming pages. It'd be up to userspace to set policy on retries, the actual value of the reap cap, the priority at which we run TRY_REAP_DYING_PROCESS, and so on. We return the number of bytes we managed to free this way so that lmkd can make an informed decision about what to do next, e.g., kill something else or wait a little while. Personally, I like th approach a bit more that recruiting the oom reaper through because it doesn't affect any kind of emergency memory reserve permission and because it frees us from having to think about whether the oom reaper's thread priority is right for this particular job. It also occurred to me that try_reap_dying_process might make a decent shrinker callback. Shrinkers are there, AIUI, to reclaim memory that's easy to free and that's not essential for correct kernel operation. Usually, it's some kind of cache that meets these criteria. But the private pages of a dying process also meet the criteria, don't they? I'm imagining the shrinker just picking an arbitrary doomed (dying but not yet dead) process and freeing some of its pages. I know there are concerns about slow shrinkers causing havoc throughout the system, but since this shrinker would be bounded above on CPU time and would never block, I feel like it'd be pretty safe. * insert standard missive about system calls being cheap, but we can talk about the way in which we expose this functionality after we agree that it's a good idea generally