Re: Making direct reclaim fail when thrashing

Johannes Weiner <hannes@xxxxxxxxxxx> · Fri, 27 Jul 2018 16:22:36 -0400

On Fri, Jul 27, 2018 at 11:21:43AM -0500, Daniel Drake wrote:
> Split from the thread
>   [PATCH 0/10] psi: pressure stall information for CPU, memory, and IO v2
> where we were discussing if/how to make the direct reclaim codepath
> fail if we're excessively thrashing, so that the OOM killer might
> step in. This is potentially desirable when the thrashing is so bad
> that the UI stops responding, causing the user to pull the plug.
> 
> On Tue, Jul 17, 2018 at 7:23 AM, Michal Hocko <mhocko@xxxxxxxxxx> wrote:
> > mm/workingset.c allows for tracking when an actual page got evicted.
> > workingset_refault tells us whether a give filemap fault is a recent
> > refault and activates the page if that is the case. So what you need is
> > to note how many refaulted pages we have on the active LRU list. If that
> > is a large part of the list and if the inactive list is really small
> > then we know we are trashing. This all sounds much easier than it will
> > eventually turn out to be of course but I didn't really get to play with
> > this much.

I've mentioned it in the other thread, but whether refaults are a
performance/latency problem depends 99% on your available IO capacity
and the IO patterns. On a highly contended IO device, refaults of a
single unfortunately located page can lead to multi-second stalls. On
an idle SSD, thousands of refaults might not be noticable to the user.

Without measuring how much time these events take out of your day, you
can't really tell eif they're a problem or not. The event rate or the
proportion between pages and refaults doesn't carry that signal.