Hi Tim, Thanks for the detailed and excellent write-up. It will serve as a good future reference for low memory killer requirements. I made some comments below on the "how to kill" part. On Tue, Mar 12, 2019 at 10:17 AM Tim Murray <timmurray@xxxxxxxxxx> wrote: > > On Tue, Mar 12, 2019 at 9:37 AM Sultan Alsawaf <sultan@xxxxxxxxxxxxxxx> wrote: > > > > On Tue, Mar 12, 2019 at 09:05:32AM +0100, Michal Hocko wrote: > > > The only way to control the OOM behavior pro-actively is to throttle > > > allocation speed. We have memcg high limit for that purpose. Along with > > > PSI, I can imagine a reasonably working user space early oom > > > notifications and reasonable acting upon that. > > > > The issue with pro-active memory management that prompted me to create this was > > poor memory utilization. All of the alternative means of reclaiming pages in the > > page allocator's slow path turn out to be very useful for maximizing memory > > utilization, which is something that we would have to forgo by relying on a > > purely pro-active solution. I have not had a chance to look at PSI yet, but > > unless a PSI-enabled solution allows allocations to reach the same point as when > > the OOM killer is invoked (which is contradictory to what it sets out to do), > > then it cannot take advantage of all of the alternative memory-reclaim means > > employed in the slowpath, and will result in killing a process before it is > > _really_ necessary. > > There are two essential parts of a lowmemorykiller implementation: > when to kill and how to kill. > > There are a million possible approaches to decide when to kill an > unimportant process. They usually trade off between the same two > failure modes depending on the workload. > > If you kill too aggressively, a transient spike that could be > imperceptibly absorbed by evicting some file pages or moving some > pages to ZRAM will result in killing processes, which then get started > up later and have a performance/battery cost. > > If you don't kill aggressively enough, you will encounter a workload > that thrashes the page cache, constantly evicting and reloading file > pages and moving things in and out of ZRAM, which makes the system > unusable when a process should have been killed instead. > > As far as I've seen, any methodology that uses single points in time > to decide when to kill without completely biasing toward one or the > other is susceptible to both. The minfree approach used by > lowmemorykiller/lmkd certainly is; it is both too aggressive for some > workloads and not aggressive enough for other workloads. My guess is > that simple LMK won't kill on transient spikes but will be extremely > susceptible to page cache thrashing. This is not an improvement; page > cache thrashing manifests as the entire system running very slowly. > > What you actually want from lowmemorykiller/lmkd on Android is to only > kill once it becomes clear that the system will continue to try to > reclaim memory to the extent that it could impact what the user > actually cares about. That means tracking how much time is spent in > reclaim/paging operations and the like, and that's exactly what PSI > does. lmkd has had support for using PSI as a replacement for > vmpressure for use as a wakeup trigger (to check current memory levels > against the minfree thresholds) since early February. It works fine; > unsurprisingly it's better than vmpressure at avoiding false wakeups. > > Longer term, there's a lot of work to be done in lmkd to turn PSI into > a kill trigger and remove minfree entirely. It's tricky (mainly > because of the "when to kill another process" problem discussed > later), but I believe it's feasible. > > How to kill is similarly messy. The latency of reclaiming memory post > SIGKILL can be severe (usually tens of milliseconds, occasionally > >100ms). The latency we see on Android usually isn't because those > threads are blocked in uninterruptible sleep, it's because times of > memory pressure are also usually times of significant CPU contention > and these are overwhelmingly CFS threads, some of which may be > assigned a very low priority. lmkd now sets priorities and resets > cpusets upon killing a process, and we have seen improved reclaim > latency because of this. oom reaper might be a good approach to avoid > this latency (I think some in-kernel lowmemorykiller implementations > rely on it), but we can't use it from userspace. Something for future > consideration. > This makes sense. If the process receiving the SIGKILL does not get CPU time, then the kernel may not be able to execute the unconditional signal handling paths in the context of the victim process to free the memory. I don't see how proc-fds approach will solve this though. Say you have process L (which is LMKd) which sends a SIGKILL to process V(which is a victim). Now L sends SIGKILL to V. Unless V executes the signal-handling code in kernel context and is scheduled at high enough priority to get CPU time, I don't think the SIGKILL will be processed. The exact path that the process being killed executes to free its memory is: do_signal-> get_signal-> do_group_exit-> do_exit-> mmput. And this needs to execute in the context of V which needs to get CPU-time to do such execution. So my point is to be notified of process death, you still need SIGKILL to be processed. So you may still need to make sure the task is at a high enough priority and scheduler puts it on the CPU. Only *after that* can he proc-fds notification mechanism (or whichever) notification mechanism can kick in. Speaking of which I wonder if the scheduler should special case SIGKILLed threads as higher priority automatically so that they get CPU time, but don't know if this can cause denial of service kind of attacks. I don't know if it does something like this already. Peter should know this right off the bat and he is on CC so he can comment more. About the 100ms latency, I wonder whether it is that high because of the way Android's lmkd is observing that a process has died. There is a gap between when a process memory is freed and when it disappears from the process-table. Once a process is SIGKILLed, it becomes a zombie. Its memory is freed instantly during the SIGKILL delivery (I traced this so that's how I know), but until it is reaped by its parent thread, it will still exist in /proc/<pid> . So if testing the existence of /proc/<pid> is how Android is observing that the process died, then there can be a large latency where it takes a very long time for the parent to actually reap the child way after its memory was long freed. A quicker way to know if a process's memory is freed before it is reaped could be to read back /proc/<pid>/maps in userspace of the victim <pid>, and that file will be empty for zombie processes. So then one does not need wait for the parent to reap it. I wonder how much of that 100ms you mentioned is actually the "Waiting while Parent is reaping the child", than "memory freeing time". So yeah for this second problem, the procfds work will help. By the way another approach that can provide a quick and asynchronous notification of when the process memory is freed, is to monitor sched_process_exit trace event using eBPF. You can tell eBPF the PID that you want to monitor before the SIGKILL. As soon as the process dies and its memory is freed, the eBPF program can send a notification to user space (using the perf_events polling infra). The sched_process_exit fires just after the mmput() happens so it is quite close to when the memory is reclaimed. This also doesn't need any kernel changes. I could come up with a prototype for this and benchmark it on Android, if you want. Just let me know. thanks, - Joel > A non-obvious consequence of both of these concerns is that when to > kill a second process is a distinct and more difficult problem than > when to kill the first. A second process should be killed if reclaim > from the first process has finished and there has been insufficient > memory reclaimed to avoid perceptible impact. Identifying whether > memory pressure continues at the same level can probably be handled > through multiple PSI monitors with different thresholds and window > lengths, but this is an area of future work. > > Knowing whether a SIGKILL'd process has finished reclaiming is as far > as I know not possible without something like procfds. That's where > the 100ms timeout in lmkd comes in. lowmemorykiller and lmkd both > attempt to wait up to 100ms for reclaim to finish by checking for the > continued existence of the thread that received the SIGKILL, but this > really means that they wait up to 100ms for the _thread_ to finish, > which doesn't tell you anything about the memory used by that process. > If those threads terminate early and lowmemorykiller/lmkd get a signal > to kill again, then there may be two processes competing for CPU time > to reclaim memory. That doesn't reclaim any faster and may be an > unnecessary kill. > > So, in summary, the impactful LMK improvements seem like > > - get lmkd and PSI to the point that lmkd can use PSI signals as a > kill trigger and remove all static memory thresholds from lmkd > completely. I think this is mostly on the lmkd side, but there may be > some PSI or PSI monitor changes that would help > - give userspace some path to start reclaiming memory without waiting > for every thread in a process to be scheduled--could be oom reaper, > could be something else > - offer a way to wait for process termination so lmkd can tell when > reclaim has finished and know when killing another process is > appropriate