On Wed, Apr 17, 2019 at 2:55 PM Roman Gushchin <guroan@xxxxxxxxx> wrote: > > This commit makes several important changes in the lifecycle > of a non-root kmem_cache, which also affect the lifecycle > of a memory cgroup. > > Currently each charged slab page has a page->mem_cgroup pointer > to the memory cgroup and holds a reference to it. > Kmem_caches are held by the cgroup. On offlining empty kmem_caches > are freed, all other are freed on cgroup release. No, they are not freed (i.e. destroyed) on offlining, only deactivated. All memcg kmem_caches are freed/destroyed on memcg's css_free. > > So the current scheme can be illustrated as: > page->mem_cgroup->kmem_cache. > > To implement the slab memory reparenting we need to invert the scheme > into: page->kmem_cache->mem_cgroup. > > Let's make every page to hold a reference to the kmem_cache (we > already have a stable pointer), and make kmem_caches to hold a single > reference to the memory cgroup. What about memcg_kmem_get_cache()? That function assumes that by taking reference on memcg, it's kmem_caches will stay. I think you need to get reference on the kmem_cache in memcg_kmem_get_cache() within the rcu lock where you get the memcg through css_tryget_online. > > To make this possible we need to introduce a new refcounter > for non-root kmem_caches. It's atomic for now, but can be easily > converted to a percpu counter, had we any performance penalty*. > The initial value is set to 1, and it's decremented on deactivation, > so we never shutdown an active cache. > > To shutdown non-active empty kmem_caches, let's reuse the > infrastructure of the RCU-delayed work queue, used previously for > the deactivation. After the generalization, it's perfectly suited > for our needs. > > Since now we can release a kmem_cache at any moment after the > deactivation, let's call sysfs_slab_remove() only from the shutdown > path. It makes deactivation path simpler. > > Because we don't set the page->mem_cgroup pointer, we need to change > the way how memcg-level stats is working for slab pages. We can't use > mod_lruvec_page_state() helpers anymore, so switch over to > mod_lruvec_state(). > > * I used the following simple approach to test the performance > (stolen from another patchset by T. Harding): > > time find / -name fname-no-exist > echo 2 > /proc/sys/vm/drop_caches > repeat several times > > Results (I've chosen best results in several runs): > > orig patched > > real 0m0.712s 0m0.690s > user 0m0.104s 0m0.101s > sys 0m0.346s 0m0.340s > > real 0m0.728s 0m0.723s > user 0m0.114s 0m0.115s > sys 0m0.342s 0m0.338s > > real 0m0.685s 0m0.767s > user 0m0.118s 0m0.114s > sys 0m0.343s 0m0.336s > > So it looks like the difference is not noticeable in this test. > > Signed-off-by: Roman Gushchin <guro@xxxxxx> > --- > include/linux/slab.h | 2 +- > mm/memcontrol.c | 9 ++++---- > mm/slab.c | 15 +----------- > mm/slab.h | 54 +++++++++++++++++++++++++++++++++++++++++--- > mm/slab_common.c | 51 +++++++++++++++++------------------------ > mm/slub.c | 22 +----------------- > 6 files changed, 79 insertions(+), 74 deletions(-) > > diff --git a/include/linux/slab.h b/include/linux/slab.h > index 47923c173f30..4daaade76c63 100644 > --- a/include/linux/slab.h > +++ b/include/linux/slab.h > @@ -152,7 +152,6 @@ int kmem_cache_shrink(struct kmem_cache *); > > void memcg_create_kmem_cache(struct mem_cgroup *, struct kmem_cache *); > void memcg_deactivate_kmem_caches(struct mem_cgroup *); > -void memcg_destroy_kmem_caches(struct mem_cgroup *); > > /* > * Please use this macro to create slab caches. Simply specify the > @@ -641,6 +640,7 @@ struct memcg_cache_params { > struct mem_cgroup *memcg; > struct list_head children_node; > struct list_head kmem_caches_node; > + atomic_long_t refcnt; > > void (*work_fn)(struct kmem_cache *); > union { > diff --git a/mm/memcontrol.c b/mm/memcontrol.c > index b2c39f187cbb..87c06e342e05 100644 > --- a/mm/memcontrol.c > +++ b/mm/memcontrol.c > @@ -2719,9 +2719,6 @@ int __memcg_kmem_charge_memcg(struct page *page, gfp_t gfp, int order, > cancel_charge(memcg, nr_pages); > return -ENOMEM; > } > - > - page->mem_cgroup = memcg; > - > return 0; > } > > @@ -2744,8 +2741,10 @@ int __memcg_kmem_charge(struct page *page, gfp_t gfp, int order) > memcg = get_mem_cgroup_from_current(); > if (!mem_cgroup_is_root(memcg)) { > ret = __memcg_kmem_charge_memcg(page, gfp, order, memcg); > - if (!ret) > + if (!ret) { > + page->mem_cgroup = memcg; > __SetPageKmemcg(page); > + } > } > css_put(&memcg->css); > return ret; > @@ -3238,7 +3237,7 @@ static void memcg_free_kmem(struct mem_cgroup *memcg) > memcg_offline_kmem(memcg); > > if (memcg->kmem_state == KMEM_ALLOCATED) { > - memcg_destroy_kmem_caches(memcg); > + WARN_ON(!list_empty(&memcg->kmem_caches)); > static_branch_dec(&memcg_kmem_enabled_key); > WARN_ON(page_counter_read(&memcg->kmem)); > } > diff --git a/mm/slab.c b/mm/slab.c > index 14466a73d057..171b21ca617f 100644 > --- a/mm/slab.c > +++ b/mm/slab.c > @@ -1389,7 +1389,6 @@ static struct page *kmem_getpages(struct kmem_cache *cachep, gfp_t flags, > int nodeid) > { > struct page *page; > - int nr_pages; > > flags |= cachep->allocflags; > > @@ -1404,12 +1403,6 @@ static struct page *kmem_getpages(struct kmem_cache *cachep, gfp_t flags, > return NULL; > } > > - nr_pages = (1 << cachep->gfporder); > - if (cachep->flags & SLAB_RECLAIM_ACCOUNT) > - mod_lruvec_page_state(page, NR_SLAB_RECLAIMABLE, nr_pages); > - else > - mod_lruvec_page_state(page, NR_SLAB_UNRECLAIMABLE, nr_pages); > - > __SetPageSlab(page); > /* Record if ALLOC_NO_WATERMARKS was set when allocating the slab */ > if (sk_memalloc_socks() && page_is_pfmemalloc(page)) > @@ -1424,12 +1417,6 @@ static struct page *kmem_getpages(struct kmem_cache *cachep, gfp_t flags, > static void kmem_freepages(struct kmem_cache *cachep, struct page *page) > { > int order = cachep->gfporder; > - unsigned long nr_freed = (1 << order); > - > - if (cachep->flags & SLAB_RECLAIM_ACCOUNT) > - mod_lruvec_page_state(page, NR_SLAB_RECLAIMABLE, -nr_freed); > - else > - mod_lruvec_page_state(page, NR_SLAB_UNRECLAIMABLE, -nr_freed); > > BUG_ON(!PageSlab(page)); > __ClearPageSlabPfmemalloc(page); > @@ -1438,7 +1425,7 @@ static void kmem_freepages(struct kmem_cache *cachep, struct page *page) > page->mapping = NULL; > > if (current->reclaim_state) > - current->reclaim_state->reclaimed_slab += nr_freed; > + current->reclaim_state->reclaimed_slab += 1 << order; > memcg_uncharge_slab(page, order, cachep); > __free_pages(page, order); > } > diff --git a/mm/slab.h b/mm/slab.h > index 4a261c97c138..1f49945f5c1d 100644 > --- a/mm/slab.h > +++ b/mm/slab.h > @@ -173,7 +173,9 @@ void __kmem_cache_release(struct kmem_cache *); > int __kmem_cache_shrink(struct kmem_cache *); > void __kmemcg_cache_deactivate(struct kmem_cache *s); > void __kmemcg_cache_deactivate_after_rcu(struct kmem_cache *s); > +void kmemcg_cache_shutdown(struct kmem_cache *s); > void slab_kmem_cache_release(struct kmem_cache *); > +void kmemcg_queue_cache_shutdown(struct kmem_cache *s); > > struct seq_file; > struct file; > @@ -274,19 +276,65 @@ static inline struct kmem_cache *memcg_root_cache(struct kmem_cache *s) > return s->memcg_params.root_cache; > } > > +static __always_inline void kmemcg_cache_get_many(struct kmem_cache *s, long n) > +{ > + atomic_long_add(n, &s->memcg_params.refcnt); > +} > + > +static __always_inline void kmemcg_cache_put_many(struct kmem_cache *s, long n) > +{ > + if (atomic_long_sub_and_test(n, &s->memcg_params.refcnt)) > + kmemcg_queue_cache_shutdown(s); > +} > + > static __always_inline int memcg_charge_slab(struct page *page, > gfp_t gfp, int order, > struct kmem_cache *s) > { > - if (is_root_cache(s)) > + int idx = (s->flags & SLAB_RECLAIM_ACCOUNT) ? > + NR_SLAB_RECLAIMABLE : NR_SLAB_UNRECLAIMABLE; > + struct mem_cgroup *memcg; > + struct lruvec *lruvec; > + int ret; > + > + if (is_root_cache(s)) { > + mod_node_page_state(page_pgdat(page), idx, 1 << order); > return 0; > - return memcg_kmem_charge_memcg(page, gfp, order, s->memcg_params.memcg); > + } > + > + memcg = s->memcg_params.memcg; > + ret = memcg_kmem_charge_memcg(page, gfp, order, memcg); > + if (!ret) { > + lruvec = mem_cgroup_lruvec(page_pgdat(page), memcg); > + mod_lruvec_state(lruvec, idx, 1 << order); > + > + /* transer try_charge() page references to kmem_cache */ > + kmemcg_cache_get_many(s, 1 << order); > + css_put_many(&memcg->css, 1 << order); > + } > + > + return 0; > } > > static __always_inline void memcg_uncharge_slab(struct page *page, int order, > struct kmem_cache *s) > { > - memcg_kmem_uncharge(page, order); > + int idx = (s->flags & SLAB_RECLAIM_ACCOUNT) ? > + NR_SLAB_RECLAIMABLE : NR_SLAB_UNRECLAIMABLE; > + struct mem_cgroup *memcg; > + struct lruvec *lruvec; > + > + if (is_root_cache(s)) { > + mod_node_page_state(page_pgdat(page), idx, -(1 << order)); > + return; > + } > + > + memcg = s->memcg_params.memcg; > + lruvec = mem_cgroup_lruvec(page_pgdat(page), memcg); > + mod_lruvec_state(lruvec, idx, -(1 << order)); > + memcg_kmem_uncharge_memcg(page, order, memcg); > + > + kmemcg_cache_put_many(s, 1 << order); > } > > extern void slab_init_memcg_params(struct kmem_cache *); > diff --git a/mm/slab_common.c b/mm/slab_common.c > index 4e5b4292a763..3fdd02979a1c 100644 > --- a/mm/slab_common.c > +++ b/mm/slab_common.c > @@ -148,6 +148,7 @@ static int init_memcg_params(struct kmem_cache *s, > s->memcg_params.root_cache = root_cache; > INIT_LIST_HEAD(&s->memcg_params.children_node); > INIT_LIST_HEAD(&s->memcg_params.kmem_caches_node); > + atomic_long_set(&s->memcg_params.refcnt, 1); > return 0; > } > > @@ -225,6 +226,7 @@ void memcg_link_cache(struct kmem_cache *s, struct mem_cgroup *memcg) > if (is_root_cache(s)) { > list_add(&s->root_caches_node, &slab_root_caches); > } else { > + css_get(&memcg->css); > s->memcg_params.memcg = memcg; > list_add(&s->memcg_params.children_node, > &s->memcg_params.root_cache->memcg_params.children); > @@ -240,6 +242,7 @@ static void memcg_unlink_cache(struct kmem_cache *s) > } else { > list_del(&s->memcg_params.children_node); > list_del(&s->memcg_params.kmem_caches_node); > + css_put(&s->memcg_params.memcg->css); > } > } > #else > @@ -703,21 +706,19 @@ static void kmemcg_after_rcu_workfn(struct work_struct *work) > > s->memcg_params.work_fn(s); > s->memcg_params.work_fn = NULL; > + kmemcg_cache_put_many(s, 1); > > mutex_unlock(&slab_mutex); > > put_online_mems(); > put_online_cpus(); > - > - /* done, put the ref from slab_deactivate_memcg_cache_rcu_sched() */ > - css_put(&s->memcg_params.memcg->css); > } > > /* > * We need to grab blocking locks. Bounce to ->work. The > * work item shares the space with the RCU head and can't be > - * initialized eariler. > -*/ > + * initialized earlier. > + */ > static void kmemcg_schedule_work_after_rcu(struct rcu_head *head) > { > struct kmem_cache *s = container_of(head, struct kmem_cache, > @@ -727,6 +728,21 @@ static void kmemcg_schedule_work_after_rcu(struct rcu_head *head) > queue_work(memcg_kmem_cache_wq, &s->memcg_params.work); > } > > +static void kmemcg_cache_shutdown_after_rcu(struct kmem_cache *s) > +{ > + WARN_ON(shutdown_cache(s)); > +} > + > +void kmemcg_queue_cache_shutdown(struct kmem_cache *s) > +{ > + if (s->memcg_params.root_cache->memcg_params.dying) > + return; > + > + WARN_ON(s->memcg_params.work_fn); > + s->memcg_params.work_fn = kmemcg_cache_shutdown_after_rcu; > + call_rcu(&s->memcg_params.rcu_head, kmemcg_schedule_work_after_rcu); > +} > + > static void kmemcg_cache_deactivate_after_rcu(struct kmem_cache *s) > { > __kmemcg_cache_deactivate_after_rcu(s); > @@ -739,9 +755,6 @@ static void kmemcg_cache_deactivate(struct kmem_cache *s) > if (s->memcg_params.root_cache->memcg_params.dying) > return; > > - /* pin memcg so that @s doesn't get destroyed in the middle */ > - css_get(&s->memcg_params.memcg->css); > - > WARN_ON_ONCE(s->memcg_params.work_fn); > s->memcg_params.work_fn = kmemcg_cache_deactivate_after_rcu; > call_rcu(&s->memcg_params.rcu_head, kmemcg_schedule_work_after_rcu); > @@ -775,28 +788,6 @@ void memcg_deactivate_kmem_caches(struct mem_cgroup *memcg) > put_online_cpus(); > } > > -void memcg_destroy_kmem_caches(struct mem_cgroup *memcg) > -{ > - struct kmem_cache *s, *s2; > - > - get_online_cpus(); > - get_online_mems(); > - > - mutex_lock(&slab_mutex); > - list_for_each_entry_safe(s, s2, &memcg->kmem_caches, > - memcg_params.kmem_caches_node) { > - /* > - * The cgroup is about to be freed and therefore has no charges > - * left. Hence, all its caches must be empty by now. > - */ > - BUG_ON(shutdown_cache(s)); > - } > - mutex_unlock(&slab_mutex); > - > - put_online_mems(); > - put_online_cpus(); > -} > - > static int shutdown_memcg_caches(struct kmem_cache *s) > { > struct memcg_cache_array *arr; > diff --git a/mm/slub.c b/mm/slub.c > index 195f61785c7d..a28b3b3abf29 100644 > --- a/mm/slub.c > +++ b/mm/slub.c > @@ -1692,11 +1692,6 @@ static struct page *allocate_slab(struct kmem_cache *s, gfp_t flags, int node) > if (!page) > return NULL; > > - mod_lruvec_page_state(page, > - (s->flags & SLAB_RECLAIM_ACCOUNT) ? > - NR_SLAB_RECLAIMABLE : NR_SLAB_UNRECLAIMABLE, > - 1 << oo_order(oo)); > - > inc_slabs_node(s, page_to_nid(page), page->objects); > > return page; > @@ -1730,11 +1725,6 @@ static void __free_slab(struct kmem_cache *s, struct page *page) > check_object(s, page, p, SLUB_RED_INACTIVE); > } > > - mod_lruvec_page_state(page, > - (s->flags & SLAB_RECLAIM_ACCOUNT) ? > - NR_SLAB_RECLAIMABLE : NR_SLAB_UNRECLAIMABLE, > - -pages); > - > __ClearPageSlabPfmemalloc(page); > __ClearPageSlab(page); > > @@ -4037,18 +4027,8 @@ void __kmemcg_cache_deactivate_after_rcu(struct kmem_cache *s) > { > /* > * Called with all the locks held after a sched RCU grace period. > - * Even if @s becomes empty after shrinking, we can't know that @s > - * doesn't have allocations already in-flight and thus can't > - * destroy @s until the associated memcg is released. > - * > - * However, let's remove the sysfs files for empty caches here. > - * Each cache has a lot of interface files which aren't > - * particularly useful for empty draining caches; otherwise, we can > - * easily end up with millions of unnecessary sysfs files on > - * systems which have a lot of memory and transient cgroups. > */ > - if (!__kmem_cache_shrink(s)) > - sysfs_slab_remove(s); > + __kmem_cache_shrink(s); > } > > void __kmemcg_cache_deactivate(struct kmem_cache *s) > -- > 2.20.1 >