On Thu, Oct 31, 2019 at 03:07:02PM +0000, Roman Gushchin wrote: > On Thu, Oct 31, 2019 at 10:41:51AM -0400, Johannes Weiner wrote: > > On Thu, Oct 31, 2019 at 01:52:44AM +0000, Roman Gushchin wrote: > > > On Fri, Oct 25, 2019 at 03:41:18PM -0400, Johannes Weiner wrote: > > > > @@ -3117,15 +3095,24 @@ void __memcg_kmem_uncharge(struct page *page, int order) > > > > css_put_many(&memcg->css, nr_pages); > > > > } > > > > > > > > -int __memcg_kmem_charge_subpage(struct mem_cgroup *memcg, size_t size, > > > > - gfp_t gfp) > > > > +int obj_cgroup_charge(struct obj_cgroup *objcg, size_t size, gfp_t gfp) > > > > { > > > > - return try_charge(memcg, gfp, size, true); > > > > + int ret; > > > > + > > > > + if (consume_obj_stock(objcg, nr_bytes)) > > > > + return 0; > > > > + > > > > + ret = try_charge(objcg->memcg, gfp, 1); > > > > + if (ret) > > > > + return ret; > > > > > The second problem is also here. If a task belonging to a different memcg > > > is scheduled on this cpu, most likely we will need to refill both stocks, > > > even if we need only a small temporarily allocation. > > > > Yes, that's a good thing. The reason we have the per-cpu caches in the > > first place is because most likely the same cgroup will perform > > several allocations. Both the slab allocator and the page allocator > > have per-cpu caches for the same reason. I don't really understand > > what the argument is. > > I mean it seems strange (and most likely will show up in perf numbers) > to move a page from one stock to another. Is there a reason why do you want > to ask try_charge() and stock only a single page? > > Can we do the following instead? > > 1) add a boolean argument to try_charge() to bypass the consume_stock() call > at the beginning and just go slow path immediately > 2) use try_charge() with this argument set to true to fill the objc/subpage > stock with MEMCG_CHARGE_BATCH pages No, think this through. If you have disjunct caches for the page_counter, it means the cache work cannot be shared. A slab allocation has to hit the page_counter, and a subsequent page allocation has to hit it again; likewise, a slab allocation cannot benefit from the caching of prior page allocations. You're trading cheap, unlocked, cpu-local subtractions against costly atomic RMW ops on shared cachelines. You also double the amount of cached per-cpu memory and introduce a layering violation. Hotpath (bytes cached) stacked: disjunct: consume_subpage_stock() try_charge() consume_subpage_stock() Warmpath (pages cached) stacked: disjunct: consume_subpage_stock() try_charge() try_charge() consume_subpage_stock() consume_stock() page_counter_charge() refill_subpage_stock() refill_subpage_stock() Coldpath (nothing cached) stacked: disjunct consume_subpage_stock() try_charge() try_charge() consume_subpage_stock() consume_stock() page_counter_charge() page_counter_charge() refill_subpage_stock() refill_stock() refill_subpage_stock()