Re: [PATCH v3 1/2] mm: introduce process_mrelease system call

Shakeel Butt <shakeelb@xxxxxxxxxx> · Fri, 23 Jul 2021 10:00:26 -0700

On Fri, Jul 23, 2021 at 9:09 AM Suren Baghdasaryan <surenb@xxxxxxxxxx> wrote:
>
> On Fri, Jul 23, 2021 at 6:46 AM Shakeel Butt <shakeelb@xxxxxxxxxx> wrote:
> >
> > On Fri, Jul 23, 2021 at 1:53 AM Michal Hocko <mhocko@xxxxxxxx> wrote:
> > >
> > [...]
> > > > However
> > > > retrying means issuing another syscall, so additional overhead...
> > > > I guess such "best effort" approach would be unusual for a syscall, so
> > > > maybe we can keep it as it is now and if such "do not block" mode is needed
> > > > we can use flags to implement it later?
> > >
> > > Yeah, an explicit opt-in via flags would be an option if that turns out
> > > to be really necessary.
> > >
> >
> > I am fine with keeping it as it is but we do need the non-blocking
> > option (via flags) to enable userspace to act more aggressively.
>
> I think you want to check memory conditions shortly after issuing
> kill/reap requests irrespective of mmap_sem contention. The reason is
> that even when memory release is not blocked, allocations from other
> processes might consume memory faster than we release it. For example,
> in Android we issue kill and start waiting on pidfd for its death
> notification. As soon as the process is dead we reassess the situation
> and possibly kill again. If the process is not dead within a
> configurable timeout we check conditions again and might issue more
> kill requests (IOW our wait for the process to die has a timeout). If
> process_mrelease() is blocked on mmap_sem, we might timeout like this.
> I imagine that a non-blocking option for process_mrelease() would not
> really change this logic.

On a containerized system, killing a job requires killing multiple
processes and then process_mrelease() them. Now there is cgroup.kill
to kill all the processes in a cgroup tree but we would still need to
process_mrelease() all the processes in that tree. There is a chance
that we get stuck in reaping the early process. Making
process_mrelease() non-blocking will enable the userspace to go to
other processes in the list.

An alternative would be to have a cgroup specific interface for
reaping similar to cgroup.kill.

> Adding such an option is trivial but I would like to make sure it's
> indeed useful. Maybe after the syscall is in place you can experiment
> with it and see if such an option would really change the way you use
> it?

SGTM.