On Mon, Jan 25, 2016 at 11:39:07AM -0500, Johannes Weiner wrote: > On Sun, Jan 24, 2016 at 07:56:16PM +0300, Vladimir Davydov wrote: > > Currently, inactive_age is maintained per zone, which results in > > unexpected file page activations in case memory cgroups are used. For > > example, if the total number of active pages is big, a memory cgroup > > might get every refaulted file page activated even if refault distance > > is much greater than the number of active file pages in the cgroup. This > > patch fixes this issue by making inactive_age per lruvec. > > Argh!! > > It's great that you're still interested in this and kept working on > it. I just regret that I worked on the same stuff the last couple days > without pinging you before picking it up. Oh well... :-) > > However, my patches are sufficiently different that I think it makes > sense to discuss them both and figure out the best end result. I have > some comments below and will followup this email with my version. > > > The patch is pretty straightforward and self-explaining, but there are > > two things that should be noted: > > > > - workingset_{eviction,activation} need to get lruvec given a page. > > On the default hierarchy one can safely access page->mem_cgroup > > provided the page is pinned, but on the legacy hierarchy a page can > > be migrated from one cgroup to another at any moment, so extra care > > must be taken to assure page->mem_cgroup will stay put. > > > > workingset_eviction is passed a locked page, so it is safe to use > > page->mem_cgroup in this function. workingset_activation is trickier: > > it is called from mark_page_accessed, where the page is not > > necessarily locked. To protect it against page->mem_cgroup change, we > > move it to __activate_page, which is called by mark_page_accessed > > once there's enough pages on percpu pagevec. This function is called > > with zone->lru_lock held, which rules out page charge migration. > > When a page moves to another cgroup at the same time it's activated, > there really is no wrong lruvec to age. Both would be correct. The > locking guarantees a stable answer, but we don't need it here. It's > enough to take the rcu lock here to ensure page_memcg() isn't freed. Yeah, that's true, but I think that taking rcu lock when memcg is disabled or even compiled out is ugly. May be, we should introduce helpers lock/unlock_page_memcg(), which would be no-op on the unified hierarchy while on the legacy hierarchy they would act just like mem_cgroup_begin/end_page_stat, i.e. we could just rename the latter two. This would also simplify dropping legacy code once the legacy hierarchy has passed away. > > > - To calculate refault distance correctly even in case a page is > > refaulted by a different cgroup, we need to store memcg id in shadow > > entry. There's no problem with it on 64-bit, but on 32-bit there's > > not much space left in radix tree slot after storing information > > about node, zone, and memory cgroup, so we can't just save eviction > > counter as is, because it would trim max refault distance making it > > unusable. > > > > To overcome this problem, we increase refault distance granularity, > > as proposed by Johannes Weiner. We disregard 10 least significant > > bits of eviction counter. This reduces refault distance accuracy to > > 4MB, which is still fine. With the default NODE_SHIFT (3) this leaves > > us 9 bits for storing eviction counter, hence maximal refault > > distance will be 2GB, which should be enough for 32-bit systems. > > If we limit it to 2G it becomes a clear-cut correctness issue once you > have more memory. Instead, we should continue to stretch out the > distance with an ever-increasing bucket size. The more memory you > have, the less important the granularity becomes anyway. With 8G, an > 8M granularity is still okay, and so forth. And once we get beyond a > certain point, and it creates problems for people, it should be fair > enough to recommend upgrading to 64 bit. That's a great idea. I'm all for it. > > > @@ -152,8 +152,72 @@ > > * refault distance will immediately activate the refaulting page. > > */ > > > > -static void *pack_shadow(unsigned long eviction, struct zone *zone) > > +#ifdef CONFIG_MEMCG > > +/* > > + * On 32-bit there is not much space left in radix tree slot after > > + * storing information about node, zone, and memory cgroup, so we > > + * disregard 10 least significant bits of eviction counter. This > > + * reduces refault distance accuracy to 4MB, which is still fine. > > + * > > + * With the default NODE_SHIFT (3) this leaves us 9 bits for storing > > + * eviction counter, hence maximal refault distance will be 2GB, which > > + * should be enough for 32-bit systems. > > + */ > > +#ifdef CONFIG_64BIT > > +# define REFAULT_DISTANCE_GRANULARITY 0 > > +#else > > +# define REFAULT_DISTANCE_GRANULARITY 10 > > +#endif > > + > > +static unsigned long pack_shadow_memcg(unsigned long eviction, > > + struct mem_cgroup *memcg) > > +{ > > + if (mem_cgroup_disabled()) > > + return eviction; > > + > > + eviction >>= REFAULT_DISTANCE_GRANULARITY; > > + eviction = (eviction << MEM_CGROUP_ID_SHIFT) | mem_cgroup_id(memcg); > > + return eviction; > > +} > > + > > +static unsigned long unpack_shadow_memcg(unsigned long entry, > > + unsigned long *mask, > > + struct mem_cgroup **memcg) > > +{ > > + if (mem_cgroup_disabled()) { > > + *memcg = NULL; > > + return entry; > > + } > > + > > + rcu_read_lock(); > > + *memcg = mem_cgroup_from_id(entry & MEM_CGROUP_ID_MAX); > > + rcu_read_unlock(); > > + > > + entry >>= MEM_CGROUP_ID_SHIFT; > > + entry <<= REFAULT_DISTANCE_GRANULARITY; > > + *mask >>= MEM_CGROUP_ID_SHIFT - REFAULT_DISTANCE_GRANULARITY; > > + return entry; > > +} > > +#else /* !CONFIG_MEMCG */ > > +static unsigned long pack_shadow_memcg(unsigned long eviction, > > + struct mem_cgroup *memcg) > > +{ > > + return eviction; > > +} > > + > > +static unsigned long unpack_shadow_memcg(unsigned long entry, > > + unsigned long *mask, > > + struct mem_cgroup **memcg) > > +{ > > + *memcg = NULL; > > + return entry; > > +} > > +#endif /* CONFIG_MEMCG */ > > + > > +static void *pack_shadow(unsigned long eviction, struct zone *zone, > > + struct mem_cgroup *memcg) > > { > > + eviction = pack_shadow_memcg(eviction, memcg); > > eviction = (eviction << NODES_SHIFT) | zone_to_nid(zone); > > eviction = (eviction << ZONES_SHIFT) | zone_idx(zone); > > eviction = (eviction << RADIX_TREE_EXCEPTIONAL_SHIFT); > > For !CONFIG_MEMCG, we can define MEMCG_ID_SHIFT to 0 and pass in a > cssid of 0. That would save much of the special casing here. But what about mem_cgroup_disabled() case? Do we want to waste 16 bits in this case? > > > @@ -213,10 +282,16 @@ static void unpack_shadow(void *shadow, > > void *workingset_eviction(struct address_space *mapping, struct page *page) > > { > > struct zone *zone = page_zone(page); > > + struct mem_cgroup *memcg = page_memcg(page); > > + struct lruvec *lruvec; > > unsigned long eviction; > > > > - eviction = atomic_long_inc_return(&zone->inactive_age); > > - return pack_shadow(eviction, zone); > > + if (!mem_cgroup_disabled()) > > + mem_cgroup_get(memcg); > > + > > + lruvec = mem_cgroup_zone_lruvec(zone, memcg); > > + eviction = atomic_long_inc_return(&lruvec->inactive_age); > > + return pack_shadow(eviction, zone, memcg); > > I don't think we need to hold a reference to the memcg here, it should > be enough to verify whether the cssid is still existent upon refault. > > What could theoretically happen then is that the original memcg gets > deleted, a new one will reuse the same id and then refault the same > pages. However, there are two things that should make this acceptable: > 1. a couple pages don't matter in this case, and sharing data between > cgroups on a large scale already leads to weird accounting artifacts > in other places; this wouldn't be much different. 2. from a system > perspective, those pages were in fact recently evicted, so even if we > activate them by accident in the new cgroup, it wouldn't be entirely > unreasonable. The workload will shake out what the true active list > frequency is on its own. > > So I think we can just do away with the reference counting for now, > and reconsider it should the false sharing in this case create new > problems that are worse than existing consequences of false sharing. Memory cgroup creation/destruction are rare events, so I agree that we can close our eyes on this and not clutter the code with mem_cgroup_get/put. > > > @@ -230,13 +305,22 @@ void *workingset_eviction(struct address_space *mapping, struct page *page) > > */ > > bool workingset_refault(void *shadow) > > { > > - unsigned long refault_distance; > > + unsigned long refault_distance, nr_active; > > struct zone *zone; > > + struct mem_cgroup *memcg; > > + struct lruvec *lruvec; > > > > - unpack_shadow(shadow, &zone, &refault_distance); > > + unpack_shadow(shadow, &zone, &memcg, &refault_distance); > > inc_zone_state(zone, WORKINGSET_REFAULT); > > > > - if (refault_distance <= zone_page_state(zone, NR_ACTIVE_FILE)) { > > + if (!mem_cgroup_disabled()) { > > + lruvec = mem_cgroup_zone_lruvec(zone, memcg); > > + nr_active = mem_cgroup_get_lru_size(lruvec, LRU_ACTIVE_FILE); > > + mem_cgroup_put(memcg); > > + } else > > + nr_active = zone_page_state(zone, NR_ACTIVE_FILE); > > This is basically get_lru_size(), so I reused that instead. Yeah. In the previous version I did the same. >From quick glance, I think that in general, your patch set looks better. Thanks, Vladimir -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@xxxxxxxxx. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@xxxxxxxxx"> email@xxxxxxxxx </a>