Re: [PATCH 0/10] psi: pressure stall information for CPU, memory, and IO v2

Michal Hocko <mhocko@xxxxxxxxxx> · Tue, 17 Jul 2018 14:23:27 +0200

On Tue 17-07-18 07:13:52, Daniel Drake wrote:
> On Tue, Jul 17, 2018 at 6:25 AM, Michal Hocko <mhocko@xxxxxxxxxx> wrote:
> > Yes this is really unfortunate. One thing that could help would be to
> > consider a trashing level during the reclaim (get_scan_count) to simply
> > forget about LRUs which are constantly refaulting pages back. We already
> > have the infrastructure for that. We just need to plumb it in.
> 
> Can you go into a bit more detail about that infrastructure and how we
> might detect which pages are being constantly refaulted? I'm
> interested in spending a few hours on this topic to see if I can come
> up with anything.

mm/workingset.c allows for tracking when an actual page got evicted.
workingset_refault tells us whether a give filemap fault is a recent
refault and activates the page if that is the case. So what you need is
to note how many refaulted pages we have on the active LRU list. If that
is a large part of the list and if the inactive list is really small
then we know we are trashing. This all sounds much easier than it will
eventually turn out to be of course but I didn't really get to play with
this much.

HTH even though it is not really thought through well.
-- 
Michal Hocko
SUSE Labs