On Tue, Mar 12, 2019 at 9:37 AM Sultan Alsawaf <sultan@xxxxxxxxxxxxxxx> wrote: > > On Tue, Mar 12, 2019 at 09:05:32AM +0100, Michal Hocko wrote: > > The only way to control the OOM behavior pro-actively is to throttle > > allocation speed. We have memcg high limit for that purpose. Along with > > PSI, I can imagine a reasonably working user space early oom > > notifications and reasonable acting upon that. > > The issue with pro-active memory management that prompted me to create this was > poor memory utilization. All of the alternative means of reclaiming pages in the > page allocator's slow path turn out to be very useful for maximizing memory > utilization, which is something that we would have to forgo by relying on a > purely pro-active solution. I have not had a chance to look at PSI yet, but > unless a PSI-enabled solution allows allocations to reach the same point as when > the OOM killer is invoked (which is contradictory to what it sets out to do), > then it cannot take advantage of all of the alternative memory-reclaim means > employed in the slowpath, and will result in killing a process before it is > _really_ necessary. There are two essential parts of a lowmemorykiller implementation: when to kill and how to kill. There are a million possible approaches to decide when to kill an unimportant process. They usually trade off between the same two failure modes depending on the workload. If you kill too aggressively, a transient spike that could be imperceptibly absorbed by evicting some file pages or moving some pages to ZRAM will result in killing processes, which then get started up later and have a performance/battery cost. If you don't kill aggressively enough, you will encounter a workload that thrashes the page cache, constantly evicting and reloading file pages and moving things in and out of ZRAM, which makes the system unusable when a process should have been killed instead. As far as I've seen, any methodology that uses single points in time to decide when to kill without completely biasing toward one or the other is susceptible to both. The minfree approach used by lowmemorykiller/lmkd certainly is; it is both too aggressive for some workloads and not aggressive enough for other workloads. My guess is that simple LMK won't kill on transient spikes but will be extremely susceptible to page cache thrashing. This is not an improvement; page cache thrashing manifests as the entire system running very slowly. What you actually want from lowmemorykiller/lmkd on Android is to only kill once it becomes clear that the system will continue to try to reclaim memory to the extent that it could impact what the user actually cares about. That means tracking how much time is spent in reclaim/paging operations and the like, and that's exactly what PSI does. lmkd has had support for using PSI as a replacement for vmpressure for use as a wakeup trigger (to check current memory levels against the minfree thresholds) since early February. It works fine; unsurprisingly it's better than vmpressure at avoiding false wakeups. Longer term, there's a lot of work to be done in lmkd to turn PSI into a kill trigger and remove minfree entirely. It's tricky (mainly because of the "when to kill another process" problem discussed later), but I believe it's feasible. How to kill is similarly messy. The latency of reclaiming memory post SIGKILL can be severe (usually tens of milliseconds, occasionally >100ms). The latency we see on Android usually isn't because those threads are blocked in uninterruptible sleep, it's because times of memory pressure are also usually times of significant CPU contention and these are overwhelmingly CFS threads, some of which may be assigned a very low priority. lmkd now sets priorities and resets cpusets upon killing a process, and we have seen improved reclaim latency because of this. oom reaper might be a good approach to avoid this latency (I think some in-kernel lowmemorykiller implementations rely on it), but we can't use it from userspace. Something for future consideration. A non-obvious consequence of both of these concerns is that when to kill a second process is a distinct and more difficult problem than when to kill the first. A second process should be killed if reclaim from the first process has finished and there has been insufficient memory reclaimed to avoid perceptible impact. Identifying whether memory pressure continues at the same level can probably be handled through multiple PSI monitors with different thresholds and window lengths, but this is an area of future work. Knowing whether a SIGKILL'd process has finished reclaiming is as far as I know not possible without something like procfds. That's where the 100ms timeout in lmkd comes in. lowmemorykiller and lmkd both attempt to wait up to 100ms for reclaim to finish by checking for the continued existence of the thread that received the SIGKILL, but this really means that they wait up to 100ms for the _thread_ to finish, which doesn't tell you anything about the memory used by that process. If those threads terminate early and lowmemorykiller/lmkd get a signal to kill again, then there may be two processes competing for CPU time to reclaim memory. That doesn't reclaim any faster and may be an unnecessary kill. So, in summary, the impactful LMK improvements seem like - get lmkd and PSI to the point that lmkd can use PSI signals as a kill trigger and remove all static memory thresholds from lmkd completely. I think this is mostly on the lmkd side, but there may be some PSI or PSI monitor changes that would help - give userspace some path to start reclaiming memory without waiting for every thread in a process to be scheduled--could be oom reaper, could be something else - offer a way to wait for process termination so lmkd can tell when reclaim has finished and know when killing another process is appropriate _______________________________________________ devel mailing list devel@xxxxxxxxxxxxxxxxxxxxxx http://driverdev.linuxdriverproject.org/mailman/listinfo/driverdev-devel