On Fri, Jul 23, 2021 at 9:09 AM Suren Baghdasaryan <surenb@xxxxxxxxxx> wrote: > > On Fri, Jul 23, 2021 at 6:46 AM Shakeel Butt <shakeelb@xxxxxxxxxx> wrote: > > > > On Fri, Jul 23, 2021 at 1:53 AM Michal Hocko <mhocko@xxxxxxxx> wrote: > > > > > [...] > > > > However > > > > retrying means issuing another syscall, so additional overhead... > > > > I guess such "best effort" approach would be unusual for a syscall, so > > > > maybe we can keep it as it is now and if such "do not block" mode is needed > > > > we can use flags to implement it later? > > > > > > Yeah, an explicit opt-in via flags would be an option if that turns out > > > to be really necessary. > > > > > > > I am fine with keeping it as it is but we do need the non-blocking > > option (via flags) to enable userspace to act more aggressively. > > I think you want to check memory conditions shortly after issuing > kill/reap requests irrespective of mmap_sem contention. The reason is > that even when memory release is not blocked, allocations from other > processes might consume memory faster than we release it. For example, > in Android we issue kill and start waiting on pidfd for its death > notification. As soon as the process is dead we reassess the situation > and possibly kill again. If the process is not dead within a > configurable timeout we check conditions again and might issue more > kill requests (IOW our wait for the process to die has a timeout). If > process_mrelease() is blocked on mmap_sem, we might timeout like this. > I imagine that a non-blocking option for process_mrelease() would not > really change this logic. On a containerized system, killing a job requires killing multiple processes and then process_mrelease() them. Now there is cgroup.kill to kill all the processes in a cgroup tree but we would still need to process_mrelease() all the processes in that tree. There is a chance that we get stuck in reaping the early process. Making process_mrelease() non-blocking will enable the userspace to go to other processes in the list. An alternative would be to have a cgroup specific interface for reaping similar to cgroup.kill. > Adding such an option is trivial but I would like to make sure it's > indeed useful. Maybe after the syscall is in place you can experiment > with it and see if such an option would really change the way you use > it? SGTM.