Re: [PATCH v2] mm: workingset: make workingset detection logic memcg aware

Vladimir Davydov <vdavydov@xxxxxxxxxxxxx> · Tue, 26 Jan 2016 11:27:56 +0300

On Mon, Jan 25, 2016 at 11:39:07AM -0500, Johannes Weiner wrote:
> On Sun, Jan 24, 2016 at 07:56:16PM +0300, Vladimir Davydov wrote:
> > Currently, inactive_age is maintained per zone, which results in
> > unexpected file page activations in case memory cgroups are used. For
> > example, if the total number of active pages is big, a memory cgroup
> > might get every refaulted file page activated even if refault distance
> > is much greater than the number of active file pages in the cgroup. This
> > patch fixes this issue by making inactive_age per lruvec.
> 
> Argh!!
> 
> It's great that you're still interested in this and kept working on
> it. I just regret that I worked on the same stuff the last couple days
> without pinging you before picking it up. Oh well...

:-)

> 
> However, my patches are sufficiently different that I think it makes
> sense to discuss them both and figure out the best end result.  I have
> some comments below and will followup this email with my version.
> 
> > The patch is pretty straightforward and self-explaining, but there are
> > two things that should be noted:
> > 
> >  - workingset_{eviction,activation} need to get lruvec given a page.
> >    On the default hierarchy one can safely access page->mem_cgroup
> >    provided the page is pinned, but on the legacy hierarchy a page can
> >    be migrated from one cgroup to another at any moment, so extra care
> >    must be taken to assure page->mem_cgroup will stay put.
> > 
> >    workingset_eviction is passed a locked page, so it is safe to use
> >    page->mem_cgroup in this function. workingset_activation is trickier:
> >    it is called from mark_page_accessed, where the page is not
> >    necessarily locked. To protect it against page->mem_cgroup change, we
> >    move it to __activate_page, which is called by mark_page_accessed
> >    once there's enough pages on percpu pagevec. This function is called
> >    with zone->lru_lock held, which rules out page charge migration.
> 
> When a page moves to another cgroup at the same time it's activated,
> there really is no wrong lruvec to age. Both would be correct. The
> locking guarantees a stable answer, but we don't need it here. It's
> enough to take the rcu lock here to ensure page_memcg() isn't freed.

Yeah, that's true, but I think that taking rcu lock when memcg is
disabled or even compiled out is ugly. May be, we should introduce
helpers lock/unlock_page_memcg(), which would be no-op on the unified
hierarchy while on the legacy hierarchy they would act just like
mem_cgroup_begin/end_page_stat, i.e. we could just rename the latter
two. This would also simplify dropping legacy code once the legacy
hierarchy has passed away.

> 
> >  - To calculate refault distance correctly even in case a page is
> >    refaulted by a different cgroup, we need to store memcg id in shadow
> >    entry. There's no problem with it on 64-bit, but on 32-bit there's
> >    not much space left in radix tree slot after storing information
> >    about node, zone, and memory cgroup, so we can't just save eviction
> >    counter as is, because it would trim max refault distance making it
> >    unusable.
> > 
> >    To overcome this problem, we increase refault distance granularity,
> >    as proposed by Johannes Weiner. We disregard 10 least significant
> >    bits of eviction counter. This reduces refault distance accuracy to
> >    4MB, which is still fine. With the default NODE_SHIFT (3) this leaves
> >    us 9 bits for storing eviction counter, hence maximal refault
> >    distance will be 2GB, which should be enough for 32-bit systems.
> 
> If we limit it to 2G it becomes a clear-cut correctness issue once you
> have more memory. Instead, we should continue to stretch out the
> distance with an ever-increasing bucket size. The more memory you
> have, the less important the granularity becomes anyway. With 8G, an
> 8M granularity is still okay, and so forth. And once we get beyond a
> certain point, and it creates problems for people, it should be fair
> enough to recommend upgrading to 64 bit.

That's a great idea. I'm all for it.

> 
> > @@ -152,8 +152,72 @@
> >   * refault distance will immediately activate the refaulting page.
> >   */
> >  
> > -static void *pack_shadow(unsigned long eviction, struct zone *zone)
> > +#ifdef CONFIG_MEMCG
> > +/*
> > + * On 32-bit there is not much space left in radix tree slot after
> > + * storing information about node, zone, and memory cgroup, so we
> > + * disregard 10 least significant bits of eviction counter. This
> > + * reduces refault distance accuracy to 4MB, which is still fine.
> > + *
> > + * With the default NODE_SHIFT (3) this leaves us 9 bits for storing
> > + * eviction counter, hence maximal refault distance will be 2GB, which
> > + * should be enough for 32-bit systems.
> > + */
> > +#ifdef CONFIG_64BIT
> > +# define REFAULT_DISTANCE_GRANULARITY		0
> > +#else
> > +# define REFAULT_DISTANCE_GRANULARITY		10
> > +#endif
> > +
> > +static unsigned long pack_shadow_memcg(unsigned long eviction,
> > +				       struct mem_cgroup *memcg)
> > +{
> > +	if (mem_cgroup_disabled())
> > +		return eviction;
> > +
> > +	eviction >>= REFAULT_DISTANCE_GRANULARITY;
> > +	eviction = (eviction << MEM_CGROUP_ID_SHIFT) | mem_cgroup_id(memcg);
> > +	return eviction;
> > +}
> > +
> > +static unsigned long unpack_shadow_memcg(unsigned long entry,
> > +					 unsigned long *mask,
> > +					 struct mem_cgroup **memcg)
> > +{
> > +	if (mem_cgroup_disabled()) {
> > +		*memcg = NULL;
> > +		return entry;
> > +	}
> > +
> > +	rcu_read_lock();
> > +	*memcg = mem_cgroup_from_id(entry & MEM_CGROUP_ID_MAX);
> > +	rcu_read_unlock();
> > +
> > +	entry >>= MEM_CGROUP_ID_SHIFT;
> > +	entry <<= REFAULT_DISTANCE_GRANULARITY;
> > +	*mask >>= MEM_CGROUP_ID_SHIFT - REFAULT_DISTANCE_GRANULARITY;
> > +	return entry;
> > +}
> > +#else /* !CONFIG_MEMCG */
> > +static unsigned long pack_shadow_memcg(unsigned long eviction,
> > +				       struct mem_cgroup *memcg)
> > +{
> > +	return eviction;
> > +}
> > +
> > +static unsigned long unpack_shadow_memcg(unsigned long entry,
> > +					 unsigned long *mask,
> > +					 struct mem_cgroup **memcg)
> > +{
> > +	*memcg = NULL;
> > +	return entry;
> > +}
> > +#endif /* CONFIG_MEMCG */
> > +
> > +static void *pack_shadow(unsigned long eviction, struct zone *zone,
> > +			 struct mem_cgroup *memcg)
> >  {
> > +	eviction = pack_shadow_memcg(eviction, memcg);
> >  	eviction = (eviction << NODES_SHIFT) | zone_to_nid(zone);
> >  	eviction = (eviction << ZONES_SHIFT) | zone_idx(zone);
> >  	eviction = (eviction << RADIX_TREE_EXCEPTIONAL_SHIFT);
> 
> For !CONFIG_MEMCG, we can define MEMCG_ID_SHIFT to 0 and pass in a
> cssid of 0. That would save much of the special casing here.

But what about mem_cgroup_disabled() case? Do we want to waste 16 bits
in this case?

> 
> > @@ -213,10 +282,16 @@ static void unpack_shadow(void *shadow,
> >  void *workingset_eviction(struct address_space *mapping, struct page *page)
> >  {
> >  	struct zone *zone = page_zone(page);
> > +	struct mem_cgroup *memcg = page_memcg(page);
> > +	struct lruvec *lruvec;
> >  	unsigned long eviction;
> >  
> > -	eviction = atomic_long_inc_return(&zone->inactive_age);
> > -	return pack_shadow(eviction, zone);
> > +	if (!mem_cgroup_disabled())
> > +		mem_cgroup_get(memcg);
> > +
> > +	lruvec = mem_cgroup_zone_lruvec(zone, memcg);
> > +	eviction = atomic_long_inc_return(&lruvec->inactive_age);
> > +	return pack_shadow(eviction, zone, memcg);
> 
> I don't think we need to hold a reference to the memcg here, it should
> be enough to verify whether the cssid is still existent upon refault.
> 
> What could theoretically happen then is that the original memcg gets
> deleted, a new one will reuse the same id and then refault the same
> pages. However, there are two things that should make this acceptable:
> 1. a couple pages don't matter in this case, and sharing data between
> cgroups on a large scale already leads to weird accounting artifacts
> in other places; this wouldn't be much different. 2. from a system
> perspective, those pages were in fact recently evicted, so even if we
> activate them by accident in the new cgroup, it wouldn't be entirely
> unreasonable. The workload will shake out what the true active list
> frequency is on its own.
> 
> So I think we can just do away with the reference counting for now,
> and reconsider it should the false sharing in this case create new
> problems that are worse than existing consequences of false sharing.

Memory cgroup creation/destruction are rare events, so I agree that we
can close our eyes on this and not clutter the code with
mem_cgroup_get/put.

> 
> > @@ -230,13 +305,22 @@ void *workingset_eviction(struct address_space *mapping, struct page *page)
> >   */
> >  bool workingset_refault(void *shadow)
> >  {
> > -	unsigned long refault_distance;
> > +	unsigned long refault_distance, nr_active;
> >  	struct zone *zone;
> > +	struct mem_cgroup *memcg;
> > +	struct lruvec *lruvec;
> >  
> > -	unpack_shadow(shadow, &zone, &refault_distance);
> > +	unpack_shadow(shadow, &zone, &memcg, &refault_distance);
> >  	inc_zone_state(zone, WORKINGSET_REFAULT);
> >  
> > -	if (refault_distance <= zone_page_state(zone, NR_ACTIVE_FILE)) {
> > +	if (!mem_cgroup_disabled()) {
> > +		lruvec = mem_cgroup_zone_lruvec(zone, memcg);
> > +		nr_active = mem_cgroup_get_lru_size(lruvec, LRU_ACTIVE_FILE);
> > +		mem_cgroup_put(memcg);
> > +	} else
> > +		nr_active = zone_page_state(zone, NR_ACTIVE_FILE);
> 
> This is basically get_lru_size(), so I reused that instead.

Yeah. In the previous version I did the same.

>From quick glance, I think that in general, your patch set looks better.

Thanks,
Vladimir

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@xxxxxxxxx.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@xxxxxxxxx";> email@xxxxxxxxx </a>