On Wed, Jun 30, 2021 at 11:01 AM Shakeel Butt <shakeelb@xxxxxxxxxx> wrote: > > Hi Suren, > > On Wed, Jun 23, 2021 at 12:28 PM Suren Baghdasaryan <surenb@xxxxxxxxxx> wrote: > > > > In modern systems it's not unusual to have a system component monitoring > > memory conditions of the system and tasked with keeping system memory > > pressure under control. One way to accomplish that is to kill > > non-essential processes to free up memory for more important ones. > > Examples of this are Facebook's OOM killer daemon called oomd and > > Android's low memory killer daemon called lmkd. > > For such system component it's important to be able to free memory > > quickly and efficiently. Unfortunately the time process takes to free > > up its memory after receiving a SIGKILL might vary based on the state > > of the process (uninterruptible sleep), size and OPP level of the core > > the process is running. A mechanism to free resources of the target > > process in a more predictable way would improve system's ability to > > control its memory pressure. > > Introduce process_reap system call that reclaims memory of a dying process > > from the context of the caller. This way the memory in freed in a more > > controllable way with CPU affinity and priority of the caller. The workload > > of freeing the memory will also be charged to the caller. > > The operation is allowed only on a dying process. > > > > Previously I proposed a number of alternatives to accomplish this: > > - https://lore.kernel.org/patchwork/patch/1060407 extending > > pidfd_send_signal to allow memory reaping using oom_reaper thread; > > - https://lore.kernel.org/patchwork/patch/1338196 extending > > pidfd_send_signal to reap memory of the target process synchronously from > > the context of the caller; > > - https://lore.kernel.org/patchwork/patch/1344419/ to add MADV_DONTNEED > > support for process_madvise implementing synchronous memory reaping. > > > > The end of the last discussion culminated with suggestion to introduce a > > dedicated system call (https://lore.kernel.org/patchwork/patch/1344418/#1553875) > > The reasoning was that the new variant of process_madvise > > a) does not work on an address range > > b) is destructive > > c) doesn't share much code at all with the rest of process_madvise > > From the userspace point of view it was awkward and inconvenient to provide > > memory range for this operation that operates on the entire address space. > > Using special flags or address values to specify the entire address space > > was too hacky. > > > > The API is as follows, > > > > int process_reap(int pidfd, unsigned int flags); > > > > DESCRIPTION > > The process_reap() system call is used to free the memory of a > > dying process. > > > > The pidfd selects the process referred to by the PID file > > descriptor. > > (See pidofd_open(2) for further information) > > *pidfd_open Ack > > > > > The flags argument is reserved for future use; currently, this > > argument must be specified as 0. > > > > RETURN VALUE > > On success, process_reap() returns 0. On error, -1 is returned > > and errno is set to indicate the error. > > > > Signed-off-by: Suren Baghdasaryan <surenb@xxxxxxxxxx> > > Thanks for continuously pushing this. One question I have is how do > you envision this syscall to be used for the cgroup based workloads. > Traverse the target tree, read pids from cgroup.procs files, > pidfd_open them, send SIGKILL and then process_reap them. Is that > right? Yes, at least that's how Android does that. It's a bit more involved but it's a technical detail. Userspace low memory killer kills a process (sends SIGKILL and calls process_reap) and another system component detects that a process died and will kill all processes belonging to the same cgroup (that's how we identify related processes). > > Orthogonal to this patch I wonder if we should have an optimized way > to reap processes from a cgroup. Something similar to cgroup.kill (or > maybe overload cgroup.kill with reaping as well). Seems reasonable to me. We could use that in the above scenario. > > [...] > > > + > > +SYSCALL_DEFINE2(process_reap, int, pidfd, unsigned int, flags) > > +{ > > + struct pid *pid; > > + struct task_struct *task; > > + struct mm_struct *mm = NULL; > > + unsigned int f_flags; > > + long ret = 0; > > + > > + if (flags != 0) > > + return -EINVAL; > > + > > + pid = pidfd_get_pid(pidfd, &f_flags); > > + if (IS_ERR(pid)) > > + return PTR_ERR(pid); > > + > > + task = get_pid_task(pid, PIDTYPE_PID); > > + if (!task) { > > + ret = -ESRCH; > > + goto put_pid; > > + } > > + > > + /* > > + * If the task is dying and in the process of releasing its memory > > + * then get its mm. > > + */ > > + task_lock(task); > > + if (task_will_free_mem(task) && (task->flags & PF_KTHREAD) == 0) { > > task_will_free_mem() is fine here but I think in parallel we should > optimize this function. At the moment it is traversing all the > processes on the machine. It is very normal to have tens of thousands > of processes on big machines, so it would be really costly when > reaping a bunch of processes. Hmm. But I think we still need to make sure that the mm is not shared with another non-dying process. IIUC that's the point of that traversal. Am I mistaken?