On Thu, May 07, 2020 at 05:03:14PM -0400, Johannes Weiner wrote: > On Wed, Apr 22, 2020 at 01:46:55PM -0700, Roman Gushchin wrote: > > --- a/mm/memcontrol.c > > +++ b/mm/memcontrol.c > > @@ -257,6 +257,78 @@ struct cgroup_subsys_state *vmpressure_to_css(struct vmpressure *vmpr) > > } > > > > #ifdef CONFIG_MEMCG_KMEM > > +extern spinlock_t css_set_lock; > > + > > +static void obj_cgroup_release(struct percpu_ref *ref) > > +{ > > + struct obj_cgroup *objcg = container_of(ref, struct obj_cgroup, refcnt); > > + struct mem_cgroup *memcg; > > + unsigned int nr_bytes; > > + unsigned int nr_pages; > > + unsigned long flags; > > + > > + nr_bytes = atomic_read(&objcg->nr_charged_bytes); > > + WARN_ON_ONCE(nr_bytes & (PAGE_SIZE - 1)); > > + nr_pages = nr_bytes >> PAGE_SHIFT; > > What guarantees that we don't have a partial page in there at this > point? I guess any outstanding allocations would pin the objcg, so > when it's released all objects have been freed. Right, this is exactly the reason why there can't be a partial page at this point. > > But if that's true, how can we have full pages remaining in there now? Imagine the following sequence: 1) CPU0: objcg == stock->cached_objcg 2) CPU1: we do a small allocation (e.g. 92 bytes), page is charged 3) CPU1: a process from another memcg is allocating something, stock if flushed, objcg->nr_charged_bytes = PAGE_SIZE - 92 5) CPU0: we do release this object, 92 bytes are added to stock->nr_bytes 6) CPU0: stock is flushed, 92 bytes are added to objcg->nr_charged_bytes In the result, nr_charged_bytes == PAGE_SIZE. This PAGE will be uncharged in obj_cgroup_release(). I've double checked this, it's actually pretty easy to trigger in the real life. > > > @@ -2723,6 +2820,30 @@ struct mem_cgroup *mem_cgroup_from_obj(void *p) > > return page->mem_cgroup; > > } > > > > +__always_inline struct obj_cgroup *get_obj_cgroup_from_current(void) > > +{ > > + struct obj_cgroup *objcg = NULL; > > + struct mem_cgroup *memcg; > > + > > + if (unlikely(!current->mm)) > > + return NULL; > > + > > + rcu_read_lock(); > > + if (unlikely(current->active_memcg)) > > + memcg = rcu_dereference(current->active_memcg); > > + else > > + memcg = mem_cgroup_from_task(current); > > + > > + for (; memcg != root_mem_cgroup; memcg = parent_mem_cgroup(memcg)) { > > + objcg = rcu_dereference(memcg->objcg); > > + if (objcg && obj_cgroup_tryget(objcg)) > > + break; > > + } > > + rcu_read_unlock(); > > + > > + return objcg; > > +} > > Thanks for moving this here from one of the later patches, it helps > understanding the life cycle of obj_cgroup better. > > > +int obj_cgroup_charge(struct obj_cgroup *objcg, gfp_t gfp, size_t size) > > +{ > > + struct mem_cgroup *memcg; > > + unsigned int nr_pages, nr_bytes; > > + int ret; > > + > > + if (consume_obj_stock(objcg, size)) > > + return 0; > > + > > + rcu_read_lock(); > > + memcg = obj_cgroup_memcg(objcg); > > + css_get(&memcg->css); > > + rcu_read_unlock(); > > + > > + nr_pages = size >> PAGE_SHIFT; > > + nr_bytes = size & (PAGE_SIZE - 1); > > + > > + if (nr_bytes) > > + nr_pages += 1; > > + > > + ret = __memcg_kmem_charge(memcg, gfp, nr_pages); > > If consume_obj_stock() fails because some other memcg is cached, > should this try to consume the partial page in objcg->nr_charged_bytes > before getting more pages? We can definitely do it, but I'm not sure if it's good for the performance. Dealing with nr_charged_bytes will require up to two atomic writes, so calling __memcg_kmem_charge() can be faster if memcg is cached on percpu stock. > > > + if (!ret && nr_bytes) > > + refill_obj_stock(objcg, PAGE_SIZE - nr_bytes); > > This will put the cgroup into the cache if the allocation resulted in > a partially free page. > > But if this was a page allocation, we may have objcg->nr_cache_bytes > from a previous subpage allocation that we should probably put back > into the stock. Yeah, we can do this, but I don't know if there will be any benefits. Actually we don't wanna to touch objcg->nr_cache_bytes too often, as it can become a contention point if there are many threads allocating in the memory cgroup. So maybe we want to do the opposite: relax it a bit and stop flushing it on every stock refill and flush only if it exceeds a certain value. > > It's not a common case, I'm just trying to understand what > objcg->nr_cache_bytes holds and when it does so. So it's actually a centralized leftover from the rounding of the actual charge to the page size. > > The rest of this patch looks good to me! Great! Thank you very much for the review!