On Fri, Apr 12, 2019 at 7:14 AM Daniel Colascione <dancol@xxxxxxxxxx> wrote: > > On Thu, Apr 11, 2019 at 11:53 PM Michal Hocko <mhocko@xxxxxxxxxx> wrote: > > > > On Thu 11-04-19 08:33:13, Matthew Wilcox wrote: > > > On Wed, Apr 10, 2019 at 06:43:53PM -0700, Suren Baghdasaryan wrote: > > > > Add new SS_EXPEDITE flag to be used when sending SIGKILL via > > > > pidfd_send_signal() syscall to allow expedited memory reclaim of the > > > > victim process. The usage of this flag is currently limited to SIGKILL > > > > signal and only to privileged users. > > > > > > What is the downside of doing expedited memory reclaim? ie why not do it > > > every time a process is going to die? > > > > Well, you are tearing down an address space which might be still in use > > because the task not fully dead yeat. So there are two downsides AFAICS. > > Core dumping which will not see the reaped memory so the resulting > > Test for SIGNAL_GROUP_COREDUMP before doing any of this then. If you > try to start a core dump after reaping begins, too bad: you could have > raced with process death anyway. > > > coredump might be incomplete. And unexpected #PF/gup on the reaped > > memory will result in SIGBUS. > > It's a dying process. Why even bother returning from the fault > handler? Just treat that situation as a thread exit. There's no need > to make this observable to userspace at all. I've spent some more time to investigate possible effects of reaping on coredumps and asked Oleg Nesterov who worked on patchsets that prioritize SIGKILLs over coredump activity (https://lkml.org/lkml/2013/2/17/118). Current do_coredump implementation seems to handle the case of SIGKILL interruption by bailing out whenever dump_interrupted() returns true and that would be the case with pending SIGKILL. So in the case of race when coredump happens first and SIGKILL comes next interrupting the coredump seems to result in no change in behavior and reaping memory proactively seems to have no side effects. An opposite race when SIGKILL gets posted and then coredump happens seems impossible because do_coredump won't be called from get_signal due to signal_group_exit check (get_signal checks signal_group_exit while holding sighand->siglock and complete_signal sets SIGNAL_GROUP_EXIT while holding the same lock). There is a path from __seccomp_filter calling do_coredump while processing SECCOMP_RET_KILL_xxx but even then it should bail out when coredump_wait()->zap_threads(current) checks signal_group_exit(). (Thanks Oleg for clarifying this for me). If we are really concerned about possible increase in failed coredumps because of the proactive reaping I could make it conditional on whether coredumping mm is possible by placing this feature behind !get_dumpable(mm) condition. Another possibility is to check RLIMIT_CORE to decide if coredumps are possible (although if pipe is used for coredump that limit seems to be ignored, so that check would have to take this into consideration). On the issue of SIGBUS happening when accessed memory was already reaped, my understanding that SIGBUS being a synchronous signal will still have to be fetched using dequeue_synchronous_signal from get_signal but not before signal_group_exit is checked. So again if SIGKILL is pending I think SIGBUS will be ignored (please correct me if that's not correct). One additional question I would like to clarify is whether per-node reapers like Roman suggested would make a big difference (All CPUs that I've seen used for Android are single-node ones, so looking for more feedback here). If it's important then reaping victim's memory by the killer is probably not an option.