Re: [PATCH 10/10] mm: balance LRU lists based on relative thrashing

Johannes Weiner <hannes@xxxxxxxxxxx> · Tue, 7 Jun 2016 12:23:11 -0400

Hi Tim,

On Mon, Jun 06, 2016 at 04:50:23PM -0700, Tim Chen wrote:
> On Mon, 2016-06-06 at 15:48 -0400, Johannes Weiner wrote:
> > To tell inactive from active refaults, a page flag is introduced that
> > marks pages that have been on the active list in their lifetime. This
> > flag is remembered in the shadow page entry on reclaim, and restored
> > when the page refaults. It is also set on anonymous pages during
> > swapin. When a page with that flag set is added to the LRU, the LRU
> > balance is adjusted for the IO cost of reclaiming the thrashing list.
> 
> Johannes,
> 
> It seems like you are saying that the shadow entry is also present
> for anonymous pages that are swapped out.  But once a page is swapped
> out, its entry is removed from the radix tree and we won't be able
> to store the shadow page entry as for file mapped page 
> in __remove_mapping.  Or are you thinking of modifying
> the current code to keep the radix tree entry? I may be missing something
> so will appreciate if you can clarify.

Sorry if this was ambiguously phrased.

You are correct, there are no shadow entries for anonymous evictions,
only page cache evictions. All swap-ins are treated as "eligible"
refaults and push back against cache, whereas cache only pushes
against anon if the cache workingset is determined to fit into memory.

That implies a fixed hierarchy where the VM always tries to fit the
anonymous workingset into memory first and the page cache second. If
the anonymous set is bigger than memory, the algorithm won't stop
counting IO cost from anonymous refaults and pressuring page cache.

[ Although you can set the effective cost of these refaults to 0
  (swappiness = 200) and reduce effective cache to a minimum -
  possibly to a level where LRU rotations consume most of it.
  But yeah. ]

So the current code works well when we assume that cache workingsets
might exceed memory, but anonymous workingsets don't.

For SSDs and non-DIMM pmem devices this assumption is fine, because
nobody wants half their frequent anonymous memory accesses to be major
faults. Anonymous workingsets will continue to target RAM size there.

Secondary memory types, which userspace can continue to map directly
after "swap out", are a different story. That might need workingset
estimation for anonymous pages. But it would have to build on top of
this series here. These patches are about eliminating or mitigating IO
by swapping idle or colder anon pages when the cache is thrashing.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@xxxxxxxxx.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@xxxxxxxxx";> email@xxxxxxxxx </a>