Re: [PATCH] mm: make kswapd try harder to keep active pages in cache

Josef Bacik <josef@xxxxxxxxxxxxxx> · Wed, 24 May 2017 14:50:42 -0400

On Wed, May 24, 2017 at 01:46:10PM -0400, Johannes Weiner wrote:
> Hi Josef,
> 
> On Tue, May 23, 2017 at 10:23:23AM -0400, Josef Bacik wrote:
> > @@ -308,7 +317,8 @@ EXPORT_SYMBOL(unregister_shrinker);
> >  static unsigned long do_shrink_slab(struct shrink_control *shrinkctl,
> >  				    struct shrinker *shrinker,
> >  				    unsigned long nr_scanned,
> > -				    unsigned long nr_eligible)
> > +				    unsigned long nr_eligible,
> > +				    unsigned long *slab_scanned)
> 
> Once you pass in pool size ratios here, nr_scanned and nr_eligible
> become confusing. Can you update the names?
> 

Yeah I kept changing them and eventually decided my names were equally as
shitty, so I just left them.  I'll change them to something useful.

> > @@ -2292,6 +2310,15 @@ static void get_scan_count(struct lruvec *lruvec, struct mem_cgroup *memcg,
> >  				scan = 0;
> >  			}
> >  			break;
> > +		case SCAN_INACTIVE:
> > +			if (file && !is_active_lru(lru)) {
> > +				if (scan && size > sc->nr_to_reclaim)
> > +					scan = sc->nr_to_reclaim;
> 
> Why is the scan target different than with regular cache reclaim? I'd
> expect that we only need to zero the active list sizes here, not that
> we'd also need any further updates to 'scan'.
> 

Huh I actually screwed this up slightly from what I wanted.  Since

scan = size >> sc->priority

we'd sometimes end up with scan < nr_to_reclaim, but since we're only scanning
inactive we really want to try as hard as possible to reclaim what we need from
inactive.  What I should have done is something like

scan = max(sc->nr_to_reclaim, scan);

instead, I'll fix that.

> > @@ -2509,8 +2536,62 @@ static bool shrink_node(pg_data_t *pgdat, struct scan_control *sc)
> >  {
> >  	struct reclaim_state *reclaim_state = current->reclaim_state;
> >  	unsigned long nr_reclaimed, nr_scanned;
> > +	unsigned long nr_reclaim, nr_slab, total_high_wmark = 0, nr_inactive;
> > +	int z;
> >  	bool reclaimable = false;
> > +	bool skip_slab = false;
> > +
> > +	nr_slab = sum_zone_node_page_state(pgdat->node_id,
> > +					   NR_SLAB_RECLAIMABLE);
> > +	nr_inactive = node_page_state(pgdat, NR_INACTIVE_FILE);
> > +	nr_reclaim = pgdat_reclaimable_pages(pgdat);
> > +
> > +	for (z = 0; z < MAX_NR_ZONES; z++) {
> > +		struct zone *zone = &pgdat->node_zones[z];
> > +		if (!managed_zone(zone))
> > +			continue;
> > +		total_high_wmark += high_wmark_pages(zone);
> > +	}
> 
> This function is used for memcg target reclaim, in which case you're
> only looking at a subset of the pgdats and zones. Any pgdat or zone
> state read here would be scoped incorrectly; and the ratios on the
> node level are independent from ratios on the cgroup level and can
> diverge heavily from each other.
> 
> These size inquiries to drive the balancing will have to be made
> inside the memcg iteration loop further down with per-cgroup numbers.
> 

Ok so I suppose I need to look at the actual lru list sizes instead for these
numbers for !global_reclaim(sc)?

> > +	/*
> > +	 * If we don't have a lot of inactive or slab pages then there's no
> > +	 * point in trying to free them exclusively, do the normal scan stuff.
> > +	 */
> > +	if (nr_inactive < total_high_wmark && nr_slab < total_high_wmark)
> > +		sc->inactive_only = 0;
> 
> Yes, we need something like this, to know when to fall back to full
> reclaim. Cgroups don't have high watermarks, but I guess some magic
> number for "too few pages" could do the trick.
> 
> > +	/*
> > +	 * We don't have historical information, we can't make good decisions
> > +	 * about ratio's and where we should put pressure, so just apply
> > +	 * pressure based on overall consumption ratios.
> > +	 */
> > +	if (!sc->slab_diff && !sc->inactive_diff)
> > +		sc->inactive_only = 0;
> 
> This one I'm not sure about. If we have enough slabs and and inactive
> pages why shouldn't we go for them first anyway - regardless of
> whether they have grown since the last reclaim invocation?
> 

Because we use them for the ratio of where to put pressure, but I suppose I
could just drop this and do

foo = max(sc->slab_diff, 1);
bar = max(sc->inactive_diff, 1);

so if we have no historical information we just equally scan both.  I'll do that
instead.

> > @@ -2543,10 +2626,27 @@ static bool shrink_node(pg_data_t *pgdat, struct scan_control *sc)
> >  			shrink_node_memcg(pgdat, memcg, sc, &lru_pages);
> >  			node_lru_pages += lru_pages;
> >  
> > -			if (memcg)
> > -				shrink_slab(sc->gfp_mask, pgdat->node_id,
> > -					    memcg, sc->nr_scanned - scanned,
> > -					    lru_pages);
> > +			/*
> > +			 * We don't want to put a lot of pressure on all of the
> > +			 * slabs if a memcg is mostly full, so use the ratio of
> > +			 * the lru size to the total reclaimable space on the
> > +			 * system.  If we have sc->inactive_only set then we
> > +			 * want to use the ratio of the difference between the
> > +			 * two since the last kswapd run so we apply pressure to
> > +			 * the consumer appropriately.
> > +			 */
> > +			if (memcg && !skip_slab) {
> > +				unsigned long numerator = lru_pages;
> > +				unsigned long denominator = nr_reclaim;
> 
> I don't quite follow this.
> 
> It calculates the share of this memcg's pages on the node, which is
> the ratio we should apply to the global slab pool to have equivalent
> pressure. However, it's being applied to the *memcg's* share of slab
> pages. This means that the smaller the memcg relative to the node, the
> less of its tiny share of slab objects we reclaim.
> 
> We're not translating from fraction to total, we're translating from
> fraction to fraction. Shouldn't the ratio be always 1:1?
> 
> For example, if there is only one cgroup on the node, the ratio would
> be 1 share of LRU pages and 1 share of slab pages. But if there are
> two cgroups, we still scan one share of each cgroup's LRU pages but
> only half a share of each cgroup's slab pages. That doesn't add up.
> 
> Am I missing something?
> 

We hashed this out offline, but basically we concluded to add a memcg specific
slab reclaimable counter so we can make these ratios be consistent with the
global ratios.

> > +				if (sc->inactive_only) {
> > +					numerator = sc->slab_diff;
> > +					denominator = sc->inactive_diff;
> > +				}
> 
> Shouldn't these diffs already be reflected in the pool sizes? If we
> scan pools proportional to their size, we also go more aggressively
> for the one that grows faster relative to the other one, right?
> 

Sure unless the aggressive growth is from a different cgroup, we want to apply
proportional pressure everywhere.  I suppose that should only be done in the
global reclaim case.

> I guess this *could* be more adaptive to fluctuations, but I wonder if
> that's an optimization that could be split out into a separate patch,
> to make it easier to review on its own merit. Start with a simple size
> based balancing in the first patch, add improved adaptiveness after.
> 
> As mentioned above, this function is used not just by kswapd but also
> by direct reclaim, which doesn't initialize these fields and so always
> passes 0:0. We should be able to retain sensible balancing for them as
> well, but it would require moving the diff sampling.
> 
> To make it work for cgroup reclaim, it would have to look at the
> growths not on the node level, but on the lruvec level in
> get_scan_count() or thereabouts.
> 
> Anyway, I think it might be less confusing to nail down the size-based
> pressure balancing for slab caches first, and then do the recent diff
> balancing on top of it as an optimization.

Yeah I had it all separate but it got kind of weird and hard to tell which part
was needed where.  Now that I've taken a step back I see where I can split it
up, so I'll fix these things and split the patches up.  Thanks!

Josef

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@xxxxxxxxx.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@xxxxxxxxx";> email@xxxxxxxxx </a>