On Mon, Jun 8, 2020 at 4:07 PM Roman Gushchin <guro@xxxxxx> wrote: > > Obj_cgroup API provides an ability to account sub-page sized kernel > objects, which potentially outlive the original memory cgroup. > > The top-level API consists of the following functions: > bool obj_cgroup_tryget(struct obj_cgroup *objcg); > void obj_cgroup_get(struct obj_cgroup *objcg); > void obj_cgroup_put(struct obj_cgroup *objcg); > > int obj_cgroup_charge(struct obj_cgroup *objcg, gfp_t gfp, size_t size); > void obj_cgroup_uncharge(struct obj_cgroup *objcg, size_t size); > > struct mem_cgroup *obj_cgroup_memcg(struct obj_cgroup *objcg); > struct obj_cgroup *get_obj_cgroup_from_current(void); > > Object cgroup is basically a pointer to a memory cgroup with a per-cpu > reference counter. It substitutes a memory cgroup in places where > it's necessary to charge a custom amount of bytes instead of pages. > > All charged memory rounded down to pages is charged to the > corresponding memory cgroup using __memcg_kmem_charge(). > > It implements reparenting: on memcg offlining it's getting reattached > to the parent memory cgroup. Each online memory cgroup has an > associated active object cgroup to handle new allocations and the list > of all attached object cgroups. On offlining of a cgroup this list is > reparented and for each object cgroup in the list the memcg pointer is > swapped to the parent memory cgroup. It prevents long-living objects > from pinning the original memory cgroup in the memory. > > The implementation is based on byte-sized per-cpu stocks. A sub-page > sized leftover is stored in an atomic field, which is a part of > obj_cgroup object. So on cgroup offlining the leftover is automatically > reparented. > > memcg->objcg is rcu protected. > objcg->memcg is a raw pointer, which is always pointing at a memory > cgroup, but can be atomically swapped to the parent memory cgroup. So > the caller What type of caller? The allocator? > must ensure the lifetime of the cgroup, e.g. grab > rcu_read_lock or css_set_lock. > > Suggested-by: Johannes Weiner <hannes@xxxxxxxxxxx> > Signed-off-by: Roman Gushchin <guro@xxxxxx> > --- > include/linux/memcontrol.h | 51 +++++++ > mm/memcontrol.c | 288 ++++++++++++++++++++++++++++++++++++- > 2 files changed, 338 insertions(+), 1 deletion(-) > > diff --git a/include/linux/memcontrol.h b/include/linux/memcontrol.h > index 93dbc7f9d8b8..c69e66fe4f12 100644 > --- a/include/linux/memcontrol.h > +++ b/include/linux/memcontrol.h > @@ -23,6 +23,7 @@ > #include <linux/page-flags.h> > > struct mem_cgroup; > +struct obj_cgroup; > struct page; > struct mm_struct; > struct kmem_cache; > @@ -192,6 +193,22 @@ struct memcg_cgwb_frn { > struct wb_completion done; /* tracks in-flight foreign writebacks */ > }; > > +/* > + * Bucket for arbitrarily byte-sized objects charged to a memory > + * cgroup. The bucket can be reparented in one piece when the cgroup > + * is destroyed, without having to round up the individual references > + * of all live memory objects in the wild. > + */ > +struct obj_cgroup { > + struct percpu_ref refcnt; > + struct mem_cgroup *memcg; > + atomic_t nr_charged_bytes; So, we still charge the mem page counter in pages but keep the remaining sub-page slack charge in nr_charge_bytes, right? > + union { > + struct list_head list; > + struct rcu_head rcu; > + }; > +}; > + > /* > * The memory controller data structure. The memory controller controls both > * page cache and RSS per cgroup. We would eventually like to provide > @@ -301,6 +318,8 @@ struct mem_cgroup { > int kmemcg_id; > enum memcg_kmem_state kmem_state; > struct list_head kmem_caches; > + struct obj_cgroup __rcu *objcg; > + struct list_head objcg_list; > #endif > [snip] > + > +static void memcg_reparent_objcgs(struct mem_cgroup *memcg, > + struct mem_cgroup *parent) > +{ > + struct obj_cgroup *objcg, *iter; > + > + objcg = rcu_replace_pointer(memcg->objcg, NULL, true); > + > + spin_lock_irq(&css_set_lock); > + > + /* Move active objcg to the parent's list */ > + xchg(&objcg->memcg, parent); > + css_get(&parent->css); > + list_add(&objcg->list, &parent->objcg_list); So, memcg->objcs_list will always only contain the offlined descendants objcgs. I would recommend to rename objcg_list to clearly show that. Maybe offlined_objcg_list or descendants_objcg_list or something else. > + > + /* Move already reparented objcgs to the parent's list */ > + list_for_each_entry(iter, &memcg->objcg_list, list) { > + css_get(&parent->css); > + xchg(&iter->memcg, parent); > + css_put(&memcg->css); > + } > + list_splice(&memcg->objcg_list, &parent->objcg_list); > + > + spin_unlock_irq(&css_set_lock); > + > + percpu_ref_kill(&objcg->refcnt); > +} > + > /* [snip] > > +__always_inline struct obj_cgroup *get_obj_cgroup_from_current(void) > +{ > + struct obj_cgroup *objcg = NULL; > + struct mem_cgroup *memcg; > + > + if (unlikely(!current->mm)) > + return NULL; I have not seen the users of this function yet but shouldn't the above check be (!current->mm && !current->active_memcg)? Do we need a mem_cgroup_disabled() check as well? > + > + rcu_read_lock(); > + if (unlikely(current->active_memcg)) > + memcg = rcu_dereference(current->active_memcg); > + else > + memcg = mem_cgroup_from_task(current); > + > + for (; memcg != root_mem_cgroup; memcg = parent_mem_cgroup(memcg)) { > + objcg = rcu_dereference(memcg->objcg); > + if (objcg && obj_cgroup_tryget(objcg)) > + break; > + } > + rcu_read_unlock(); > + > + return objcg; > +} > + [...] > + > +static void refill_obj_stock(struct obj_cgroup *objcg, unsigned int nr_bytes) > +{ > + struct memcg_stock_pcp *stock; > + unsigned long flags; > + > + local_irq_save(flags); > + > + stock = this_cpu_ptr(&memcg_stock); > + if (stock->cached_objcg != objcg) { /* reset if necessary */ > + drain_obj_stock(stock); > + obj_cgroup_get(objcg); > + stock->cached_objcg = objcg; > + stock->nr_bytes = atomic_xchg(&objcg->nr_charged_bytes, 0); > + } > + stock->nr_bytes += nr_bytes; > + > + if (stock->nr_bytes > PAGE_SIZE) > + drain_obj_stock(stock); The normal stock can go to 32*nr_cpus*PAGE_SIZE. I am wondering if just PAGE_SIZE is too less for obj stock. > + > + local_irq_restore(flags); > +} > +