Re: [RFC]: userspace memory reaping

Michal Hocko <mhocko@xxxxxxxx> · Thu, 15 Oct 2020 11:20:30 +0200

On Wed 14-10-20 09:57:20, Suren Baghdasaryan wrote:
> On Wed, Oct 14, 2020 at 5:09 AM Michal Hocko <mhocko@xxxxxxxx> wrote:
[...]
> > > > The need is similar to why oom-reaper was introduced - when a process
> > > > is being killed to free memory we want to make sure memory is freed
> > > > even if the victim is in uninterruptible sleep or is busy and reaction
> > > > to SIGKILL is delayed by an unpredictable amount of time. I
> > > > experimented with enabling process_madvise(MADV_DONTNEED) operation
> > > > and using it to force memory reclaim of the target process after
> > > > sending SIGKILL. Unfortunately this approach requires the caller to
> > > > read proc/pid/maps to extract the list of VMAs to pass as an input to
> > > > process_madvise().
> >
> > Well I would argue that this is not really necessary. You can simply
> > call process_madvise with the full address range and let the kernel
> > operated only on ranges which are safe to tear down asynchronously.
> > Sure that would require some changes to the existing code to not fail
> > on those ranges if they contain incompatible vmas but that should be
> > possible. If we are worried about backward compatibility then a
> > dedicated flag could override.
> >
> 
> IIUC this is very similar to the last option I proposed. I think this
> is doable if we treat it as a special case. process_madvise() return
> value not being able to handle a large range would still be a problem.
> Maybe we can return MAX_INT in those cases?

madvise is documented to return 
       On success, madvise() returns zero.  On error, it returns -1 and
       errno is set appropriately.
[...]
NOTES
   Linux notes
       The Linux implementation requires that the address addr be
       page-aligned, and allows length to be zero.  If there are some
       parts of the specified address range that are not mapped, the
       Linux version of madvise() ignores them and applies the call to
       the rest (but returns ENOMEM from the system call, as it should).

I have learned about ENOMEM case only now. And it seems this is indeed
what we are implementing. So if we want to add a new mode to
opportunistically attempt madvise on the whole given range without a
failure then we need a specific flag for that. Advice is a number rather
than a bitmask but (ab)using the top bit or use negative number space
(e.g. -MADV_DONTNEED) for that sounds possible albeit bit hackish.

[...]
> > I do have a vague recollection that we have discussed a kill(2) based
> > approach as well in the past. Essentially SIG_KILL_SYNC which would
> > not only send the signal but it would start a teardown of resources
> > owned by the task - at least those we can remove safely. The interface
> > would be much more simple and less tricky to use. You just make your
> > userspace oom killer or potentially other users call SIG_KILL_SYNC which
> > will be more expensive but you would at least know that as many
> > resources have been freed as the kernel can afford at the moment.
> 
> Correct, my early RFC here
> https://patchwork.kernel.org/project/linux-mm/patch/20190411014353.113252-3-surenb@xxxxxxxxxx
> was using a new flag for pidfd_send_signal() to request mm reaping by
> oom-reaper kthread. IIUC you propose to have a new SIG_KILL_SYNC
> signal instead of a new pidfd_send_signal() flag and otherwise a very
> similar solution. Is my understanding correct?

Well, I think you shouldn't focus too much on the oom-reaper aspect
of it. Sure it can be used for that but I believe that a new signal
should provide a sync behavior. People more familiar with the process
management would be better off defining what is possible for a new sync
signal.  Ideally not only pro-active process destruction but also sync
waiting until the target process is released so that you know that once
kill syscall returns the process is gone.

-- 
Michal Hocko
SUSE Labs