On Tue, Sep 17, 2019 at 10:50:04AM +0200, Johannes Weiner wrote: > On Tue, Sep 17, 2019 at 02:27:19AM +0000, Roman Gushchin wrote: > > On Mon, Sep 16, 2019 at 02:56:11PM +0200, Johannes Weiner wrote: > > > On Thu, Sep 05, 2019 at 02:45:45PM -0700, Roman Gushchin wrote: > > > > Introduce an API to charge subpage objects to the memory cgroup. > > > > The API will be used by the new slab memory controller. Later it > > > > can also be used to implement percpu memory accounting. > > > > In both cases, a single page can be shared between multiple cgroups > > > > (and in percpu case a single allocation is split over multiple pages), > > > > so it's not possible to use page-based accounting. > > > > > > > > The implementation is based on percpu stocks. Memory cgroups are still > > > > charged in pages, and the residue is stored in perpcu stock, or on the > > > > memcg itself, when it's necessary to flush the stock. > > > > > > Did you just implement a slab allocator for page_counter to track > > > memory consumed by the slab allocator? > > > > :) > > > > > > > > > @@ -2500,8 +2577,9 @@ void mem_cgroup_handle_over_high(void) > > > > } > > > > > > > > static int try_charge(struct mem_cgroup *memcg, gfp_t gfp_mask, > > > > - unsigned int nr_pages) > > > > + unsigned int amount, bool subpage) > > > > { > > > > + unsigned int nr_pages = subpage ? ((amount >> PAGE_SHIFT) + 1) : amount; > > > > unsigned int batch = max(MEMCG_CHARGE_BATCH, nr_pages); > > > > int nr_retries = MEM_CGROUP_RECLAIM_RETRIES; > > > > struct mem_cgroup *mem_over_limit; > > > > @@ -2514,7 +2592,9 @@ static int try_charge(struct mem_cgroup *memcg, gfp_t gfp_mask, > > > > if (mem_cgroup_is_root(memcg)) > > > > return 0; > > > > retry: > > > > - if (consume_stock(memcg, nr_pages)) > > > > + if (subpage && consume_subpage_stock(memcg, amount)) > > > > + return 0; > > > > + else if (!subpage && consume_stock(memcg, nr_pages)) > > > > return 0; > > > > > > The layering here isn't clean. We have an existing per-cpu cache to > > > batch-charge the page counter. Why does the new subpage allocator not > > > sit on *top* of this, instead of wedged in between? > > > > > > I think what it should be is a try_charge_bytes() that simply gets one > > > page from try_charge() and then does its byte tracking, regardless of > > > how try_charge() chooses to implement its own page tracking. > > > > > > That would avoid the awkward @amount + @subpage multiplexing, as well > > > as annotating all existing callsites of try_charge() with a > > > non-descript "false" parameter. > > > > > > You can still reuse the stock data structures, use the lower bits of > > > stock->nr_bytes for a different cgroup etc., but the charge API should > > > really be separate. > > > > Hm, I kinda like the idea, however there is a complication: for the subpage > > accounting the css reference management is done in a different way, so that > > all existing code should avoid changing the css refcounter. So I'd need > > to pass a boolean argument anyway. > > Can you elaborate on the refcounting scheme? I don't quite understand > how there would be complications with that. > > Generally, references are held for each page that is allocated in the > page_counter. try_charge() allocates a batch of css references, > returns one and keeps the rest in stock. > > So couldn't the following work? > > When somebody allocates a subpage, the css reference returned by > try_charge() is shared by the allocated subpage object and the > remainder that is kept via stock->subpage_cache and stock->nr_bytes > (or memcg->nr_stocked_bytes when the percpu cache is reset). Because individual objects are a subject of reparenting and can outlive the origin memory cgroup, they shouldn't hold a direct reference to the memory cgroup. Instead they hold a reference to the mem_cgroup_ptr object, and this objects holds a single reference to the memory cgroup. Underlying pages shouldn't hold a reference too. Btw, it's already true, just kmem_cache plays the role of such intermediate object, and we do an explicit transfer of charge (look at memcg_charge_slab()). So we initially associate a page with the memcg, and almost immediately after break this association and insert kmem_cache in between. But with subpage accounting it's not possible, as a page is shared between multiple cgroups, and it can't be attributed to any specific cgroup at any time. > > When the subpage objects are freed, you'll eventually have a full page > again in stock->nr_bytes, at which point you page_counter_uncharge() > paired with css_put(_many) as per usual. > > A remainder left in old->nr_stocked_bytes would continue to hold on to > one css reference. (I don't quite understand who is protecting this > remainder in your current version, actually. A bug?) > > Instead of doing your own batched page_counter uncharging in > refill_subpage_stock() -> drain_subpage_stock(), you should be able to > call refill_stock() when stock->nr_bytes adds up to a whole page again. > > Again, IMO this would be much cleaner architecture if there was a > try_charge_bytes() byte allocator that would sit on top of a cleanly > abstracted try_charge() page allocator, just like the slab allocator > is sitting on top of the page allocator - instead of breaking through > the abstraction layer of the underlying page allocator. > As I said, I like the idea to put it on top, but it can't be put on top without changes in css refcounting (or I don't see how). I don't know how to mix stocks which are holding css references and which are not, so I might end up with two stocks as in current implementation. Then the idea of having another layer of caching on top looks slightly less appealing, but maybe still worth a try. Thanks!