Hi all, I mentioned before that I had some ideas to overhaul lowmemorykiller, which would have the side effect of getting it out of the kernel. I've been working through some prototypes over the past few weeks (actually started before Michal sent his patch out), and I'd appreciate some feedback on what I'd like to do so I can start working on more complete patches. First, Michal has mentioned why the current lowmemorykiller implementation is bad. However, the design and implementation of lowmemorykiller is bad for Android users as well. Rather than fixing lowmemorykiller in the kernel or enabling an equivalent reimplementation of lowmemorykiller in userspace, I think we can solve the Android problems and remove lowmemorykiller from the tree at the same time. What's wrong with lowmemorykiller from an Android user's POV? 1. lowmemorykiller can be way too aggressive when there are transient spikes in memory consumption. LMK relies on hand-tuned thresholds to determine when to kill a process, but hitting the threshold shouldn't always imply a kill. For example, on some current high-end Android devices, lowmemorykiller will start to kill oom_score_adj 200 processes once there is less than 112MB in the page cache and less than 112MB of free pages. oom_score_adj 200 is used for processes that are important and visible to the user but not the currently-used foreground app; music playback or camera post-processing for some apps usually runs as adj 200. This threshold means that even if the system would quiesce at 110MB in the page cache and 110MB of free pages, something important to the user may die. This is bad! 2. lowmemorykiller can be way too passive on lower memory devices. Because lowmemorykiller has a shared threshold for the amount of free pages and the size of the page cache before it will kill a process, there is a scenario that we hit all the time that results in low memory devices becoming unusable. Assume the current application and supporting system software need X bytes in the page cache in order to provide reasonable UI performance, and X is larger than the zone_high watermark that stops kswapd. The number of free pages can drop below zone_low and kswapd will start evicting pages from the page cache; however, because the working set is actually of size X, those pages will be paged back in about as quickly as they can be paged out. This manifests as kswapd constantly evicting file pages and the foreground UI workload constantly waiting on page faults. Meanwhile, even if there are very unimportant processes to kill, lowmemorykiller won't do anything to kill them. #2 can be addressed somewhat by separating the limits for number of free pages and the size of the page cache, but then lowmemorykiller would have two sets of arbitrary hand-tuned values and still no knowledge of kswapd/reclaim. It doesn't make sense to do that if we can avoid it. We have plenty of evidence for both of these on real Android devices. I'm bringing up these issues to not only explain the problems that we'd like to solve, but also to provide some evidence that we're serious about fixing lowmemorykiller once and for all. Here's where I'd like to go. First of all, lowmemorykiller should not be in the kernel, and Android should move to per-app mem cgroups and kill unnecessary background tasks when under memory pressure from userspace, not the kernel. Second, Android has good knowledge of what's important to the user and what's not. I'd like the ability to use that information to drive decisions about reclaiming memory, so kswapd can shrink the mem cgroups associated with background tasks before moving on to foreground tasks. As far as I can tell, what I'm suggesting isn't a soft limit or something like that. We don't have specific limits on memory consumption for particular processes, and there's no size we definitely want to get background processes to via reclaim before we start reclaiming from foreground or persistent processes. In practice, I think this looks like a per-memory-cgroup reclaim priority. I have a prototype of this where I've added a new knob called memory.priority from 0 to 10 that serves two purposes: - Skip reclaiming from higher-priority cgroups entirely until the priority from shrink_zone is high enough. - Reduce the number of pages scanned from higher-priority cgroups once they are eligible for reclamation. This would let kswapd reclaim from applications that aren't critical to the user while still occasionally reclaiming from persistent processes (evicting pages that are used very rarely from always-running system processes). This would effectively reduce the size of backgrounded applications without impacting UI performance--a huge improvement over what Android can do today. Third, assuming we can do this kind of prioritized reclaim, I'd like more information available via vmpressure (or similar) about the current state of kswapd in terms of what priority it's trying to reclaim. If lmkd knew that kswapd had moved on to higher-priority cgroups while there were very unimportant processes remaining, lmkd could be much more accurate about when to kill a process. This means lmkd would run only in response to actual kswapd/direct reclaim behavior, not because of arbitrary thresholds. This would unify how to tune Android devices for memory consumption; the knobs in /proc/sys/vm (primarily min_free_kbytes and extra_free_kbytes) would control both kswapd *and* lmkd. I think this would also solve the "too aggressive killing during transient spike" issue. I'm working on an RFC of prioritized reclaim to follow (I hope) sometime next week. I don't have a vmpressure patch prototyped yet, since it depends on what the prioritized reclaim interface looks like. Also, to be perfectly clear, I don't think my current approach is necessarily the right one at all. All I have right now is a minimal patch (against 3.18, hence the delay) to support memory cgroup priorities: the interface makes no sense if you aren't familiar with mm internals, I haven't thought through how this interacts with soft limits, it doesn't make sense with cgroup hierarchies, etc. At this stage, I'm mainly wondering if the broader community thinks prioritized reclaim is a viable direction. Thanks for any feedback you can provide. Tim On Fri, Feb 24, 2017 at 10:42 AM, Rom Lemarchand <romlem@xxxxxxxxxx> wrote: > +surenb > > On Fri, Feb 24, 2017 at 10:38 AM, Tim Murray <timmurray@xxxxxxxxxx> wrote: >> >> Hi all, I've recently been looking at lowmemorykiller, userspace lmkd, and >> memory cgroups on Android. >> >> First of all, no, an Android device will probably not function without a >> kernel or userspace version of lowmemorykiller. Android userspace expects >> that if there are many apps running in the background on a machine and the >> foreground app allocates additional memory, something on the system will >> kill background apps to free up more memory. If this doesn't happen, I >> expect that at the very least you'll see page cache thrashing, and you'll >> probably see the OOM killer run regularly, which has a tendency to cause >> Android userspace to restart. To the best of my knowledge, no device has >> shipped with a userspace lmkd. >> >> Second, yes, the current design and implementation of lowmemorykiller are >> unsatisfactory. I now have some concrete evidence that the design of >> lowmemorykiller is directly responsible for some very negative user-visible >> behaviors (particularly the triggers for when to kill), so I'm currently >> working on an overhaul to the Android memory model that would use mem >> cgroups and userspace lmkd to make smarter decisions about reclaim vs >> killing. Yes, this means that we would move to vmpressure (which will >> require improvements to vmpressure). I can't give a firm ETA for this, as >> it's still in the prototype phase, but the initial results are promising. >> >> On Fri, Feb 24, 2017 at 1:34 AM, Michal Hocko <mhocko@xxxxxxxxxx> wrote: >>> >>> On Thu 23-02-17 21:36:00, Martijn Coenen wrote: >>> > On Thu, Feb 23, 2017 at 9:24 PM, John Stultz <john.stultz@xxxxxxxxxx> >>> > wrote: >>> [...] >>> > > This is reportedly because while the mempressure notifiers provide a >>> > > the signal to userspace, the work the deamon then has to do to look >>> > > up >>> > > per process memory usage, in order to figure out who is best to kill >>> > > at that point was too costly and resulted in poor device performance. >>> > >>> > In particular, mempressure requires memory cgroups to function, and we >>> > saw performance regressions due to the accounting done in mem cgroups. >>> > At the time we didn't have enough time left to solve this before the >>> > release, and we reverted back to kernel lmkd. >>> >>> I would be more than interested to hear details. We used to have some >>> visible charge path performance footprint but this should be gone now. >>> >>> [...] >>> > > It would be great however to get a discussion going here on what the >>> > > ulmkd needs from the kernel in order to efficiently determine who >>> > > best >>> > > to kill, and how we might best implement that. >>> > >>> > The two main issues I think we need to address are: >>> > 1) Getting the right granularity of events from the kernel; I once >>> > tried to submit a patch upstream to address this: >>> > https://lkml.org/lkml/2016/2/24/582 >>> >>> Not only that, the implementation of tht vmpressure needs some serious >>> rethinking as well. The current one can hit critical events >>> unexpectedly. The calculation also doesn't consider slab reclaim >>> sensibly. >>> >>> > 2) Find out where exactly the memory cgroup overhead is coming from, >>> > and how to reduce it or work around it to acceptable levels for >>> > Android. This was also on 3.10, and maybe this has long been fixed or >>> > improved in more recent kernel versions. >>> >>> 3e32cb2e0a12 ("mm: memcontrol: lockless page counters") has improved >>> situation a lot as all the charging is lockless since then (3.19). >>> -- >>> Michal Hocko >>> SUSE Labs >> >> > -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@xxxxxxxxx. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@xxxxxxxxx"> email@xxxxxxxxx </a>