On Thu, Sep 24, 2020 at 03:45:08PM -0400, Johannes Weiner wrote: > On Tue, Sep 22, 2020 at 01:36:57PM -0700, Roman Gushchin wrote: > > Currently there are many open-coded reads and writes of the > > page->mem_cgroup pointer, as well as a couple of read helpers, > > which are barely used. > > > > It creates an obstacle on a way to reuse some bits of the pointer > > for storing additional bits of information. In fact, we already do > > this for slab pages, where the last bit indicates that a pointer has > > an attached vector of objcg pointers instead of a regular memcg > > pointer. > > > > This commits introduces 4 new helper functions and converts all > > raw accesses to page->mem_cgroup to calls of these helpers: > > struct mem_cgroup *page_mem_cgroup(struct page *page); > > struct mem_cgroup *page_mem_cgroup_check(struct page *page); > > void set_page_mem_cgroup(struct page *page, struct mem_cgroup *memcg); > > void clear_page_mem_cgroup(struct page *page); > > Sounds reasonable to me! > > > page_mem_cgroup_check() is intended to be used in cases when the page > > can be a slab page and have a memcg pointer pointing at objcg vector. > > It does check the lowest bit, and if set, returns NULL. > > page_mem_cgroup() contains a VM_BUG_ON_PAGE() check for the page not > > being a slab page. So do set_page_mem_cgroup() and clear_page_mem_cgroup(). > > > > To make sure nobody uses a direct access, struct page's > > mem_cgroup/obj_cgroups is converted to unsigned long memcg_data. > > Only new helpers and a couple of slab-accounting related functions > > access this field directly. > > > > page_memcg() and page_memcg_rcu() helpers defined in mm.h are removed. > > New page_mem_cgroup() is a direct analog of page_memcg(), while > > page_memcg_rcu() has a single call site in a small rcu-read-lock > > section, so it's just not worth it to have a separate helper. So > > it's replaced with page_mem_cgroup() too. > > page_memcg_rcu() does READ_ONCE(). We need to keep that for lockless > accesses. Ok, how about page_memcg() and page_objcgs() which always do READ_ONCE()? Because page_memcg_rcu() has only a single call site, I would prefer to have one helper instead of two. > > > @@ -343,6 +343,72 @@ struct mem_cgroup { > > > > extern struct mem_cgroup *root_mem_cgroup; > > > > +/* > > + * page_mem_cgroup - get the memory cgroup associated with a page > > + * @page: a pointer to the page struct > > + * > > + * Returns a pointer to the memory cgroup associated with the page, > > + * or NULL. This function assumes that the page is known to have a > > + * proper memory cgroup pointer. It's not safe to call this function > > + * against some type of pages, e.g. slab pages or ex-slab pages. > > + */ > > +static inline struct mem_cgroup *page_mem_cgroup(struct page *page) > > +{ > > + VM_BUG_ON_PAGE(PageSlab(page), page); > > + return (struct mem_cgroup *)page->memcg_data; > > +} > > This would also be a good place to mention what's required for the > function to be called safely, or in a way that produces a stable > result - i.e. the list of conditions in commit_charge(). Makes sense. > > > + * page_mem_cgroup_check - get the memory cgroup associated with a page > > + * @page: a pointer to the page struct > > + * > > + * Returns a pointer to the memory cgroup associated with the page, > > + * or NULL. This function unlike page_mem_cgroup() can take any page > > + * as an argument. It has to be used in cases when it's not known if a page > > + * has an associated memory cgroup pointer or an object cgroups vector. > > + */ > > +static inline struct mem_cgroup *page_mem_cgroup_check(struct page *page) > > +{ > > + unsigned long memcg_data = page->memcg_data; > > + > > + /* > > + * The lowest bit set means that memcg isn't a valid > > + * memcg pointer, but a obj_cgroups pointer. > > + * In this case the page is shared and doesn't belong > > + * to any specific memory cgroup. > > + */ > > + if (memcg_data & 0x1UL) > > + return NULL; > > + > > + return (struct mem_cgroup *)memcg_data; > > +} > > Here as well. > > > + > > +/* > > + * set_page_mem_cgroup - associate a page with a memory cgroup > > + * @page: a pointer to the page struct > > + * @memcg: a pointer to the memory cgroup > > + * > > + * Associates a page with a memory cgroup. > > + */ > > +static inline void set_page_mem_cgroup(struct page *page, > > + struct mem_cgroup *memcg) > > +{ > > + VM_BUG_ON_PAGE(PageSlab(page), page); > > + page->memcg_data = (unsigned long)memcg; > > +} > > + > > +/* > > + * clear_page_mem_cgroup - clear an association of a page with a memory cgroup > > + * @page: a pointer to the page struct > > + * > > + * Clears an association of a page with a memory cgroup. > > + */ > > +static inline void clear_page_mem_cgroup(struct page *page) > > +{ > > + VM_BUG_ON_PAGE(PageSlab(page), page); > > + page->memcg_data = 0; > > +} > > + > > static __always_inline bool memcg_stat_item_in_bytes(int idx) > > { > > if (idx == MEMCG_PERCPU_B) > > @@ -743,15 +809,15 @@ static inline void mod_memcg_state(struct mem_cgroup *memcg, > > static inline void __mod_memcg_page_state(struct page *page, > > int idx, int val) > > { > > - if (page->mem_cgroup) > > - __mod_memcg_state(page->mem_cgroup, idx, val); > > + if (page_mem_cgroup(page)) > > + __mod_memcg_state(page_mem_cgroup(page), idx, val); > > } > > > > static inline void mod_memcg_page_state(struct page *page, > > int idx, int val) > > { > > - if (page->mem_cgroup) > > - mod_memcg_state(page->mem_cgroup, idx, val); > > + if (page_mem_cgroup(page)) > > + mod_memcg_state(page_mem_cgroup(page), idx, val); > > } > > > > static inline unsigned long lruvec_page_state(struct lruvec *lruvec, > > @@ -838,12 +904,12 @@ static inline void __mod_lruvec_page_state(struct page *page, > > struct lruvec *lruvec; > > > > /* Untracked pages have no memcg, no lruvec. Update only the node */ > > - if (!head->mem_cgroup) { > > + if (!page_mem_cgroup(head)) { > > __mod_node_page_state(pgdat, idx, val); > > return; > > } > > > > - lruvec = mem_cgroup_lruvec(head->mem_cgroup, pgdat); > > + lruvec = mem_cgroup_lruvec(page_mem_cgroup(head), pgdat); > > __mod_lruvec_state(lruvec, idx, val); > > The repetition of the function call is a bit jarring, especially in > configs with VM_BUG_ON() enabled (some distros use it for their beta > release kernels, so it's not just kernel developer test machines that > pay this cost). Can you please use a local variable when the function > needs the memcg more than once? Sure. > > > @@ -878,8 +944,8 @@ static inline void count_memcg_events(struct mem_cgroup *memcg, > > static inline void count_memcg_page_event(struct page *page, > > enum vm_event_item idx) > > { > > - if (page->mem_cgroup) > > - count_memcg_events(page->mem_cgroup, idx, 1); > > + if (page_mem_cgroup(page)) > > + count_memcg_events(page_mem_cgroup(page), idx, 1); > > } > > > > static inline void count_memcg_event_mm(struct mm_struct *mm, > > @@ -941,6 +1007,25 @@ void mem_cgroup_split_huge_fixup(struct page *head); > > > > struct mem_cgroup; > > > > +static inline struct mem_cgroup *page_mem_cgroup(struct page *page) > > +{ > > + return NULL; > > +} > > + > > +static inline struct mem_cgroup *page_mem_cgroup_check(struct page *page) > > +{ > > + return NULL; > > +} > > + > > +static inline void set_page_mem_cgroup(struct page *page, > > + struct mem_cgroup *memcg) > > +{ > > +} > > + > > +static inline void clear_page_mem_cgroup(struct page *page) > > +{ > > +} > > + > > static inline bool mem_cgroup_is_root(struct mem_cgroup *memcg) > > { > > return true; > > @@ -1430,7 +1515,7 @@ static inline void mem_cgroup_track_foreign_dirty(struct page *page, > > if (mem_cgroup_disabled()) > > return; > > > > - if (unlikely(&page->mem_cgroup->css != wb->memcg_css)) > > + if (unlikely(&page_mem_cgroup(page)->css != wb->memcg_css)) > > mem_cgroup_track_foreign_dirty_slowpath(page, wb); > > } > > > > diff --git a/include/linux/mm.h b/include/linux/mm.h > > index 17e712207d74..5e24ff2ffec9 100644 > > --- a/include/linux/mm.h > > +++ b/include/linux/mm.h > > @@ -1476,28 +1476,6 @@ static inline void set_page_links(struct page *page, enum zone_type zone, > > #endif > > } > > > > -#ifdef CONFIG_MEMCG > > -static inline struct mem_cgroup *page_memcg(struct page *page) > > -{ > > - return page->mem_cgroup; > > -} > > -static inline struct mem_cgroup *page_memcg_rcu(struct page *page) > > -{ > > - WARN_ON_ONCE(!rcu_read_lock_held()); > > - return READ_ONCE(page->mem_cgroup); > > -} > > -#else > > -static inline struct mem_cgroup *page_memcg(struct page *page) > > -{ > > - return NULL; > > -} > > -static inline struct mem_cgroup *page_memcg_rcu(struct page *page) > > -{ > > - WARN_ON_ONCE(!rcu_read_lock_held()); > > - return NULL; > > -} > > -#endif > > You essentially renamed these existing helpers, but I don't think > that's justified. Especially with the proliferation of callsites, the > original names are nicer. I'd prefer we keep them. > > > @@ -560,16 +560,7 @@ ino_t page_cgroup_ino(struct page *page) > > unsigned long ino = 0; > > > > rcu_read_lock(); > > - memcg = page->mem_cgroup; > > - > > - /* > > - * The lowest bit set means that memcg isn't a valid > > - * memcg pointer, but a obj_cgroups pointer. > > - * In this case the page is shared and doesn't belong > > - * to any specific memory cgroup. > > - */ > > - if ((unsigned long) memcg & 0x1UL) > > - memcg = NULL; > > + memcg = page_mem_cgroup_check(page); > > This should actually have been using READ_ONCE() all along. Otherwise > the compiler can issue multiple loads to page->mem_cgroup here and you > can end up with a pointer with the lowest bit set leaking out. > > > @@ -2928,17 +2918,6 @@ struct mem_cgroup *mem_cgroup_from_obj(void *p) > > > > page = virt_to_head_page(p); > > > > - /* > > - * If page->mem_cgroup is set, it's either a simple mem_cgroup pointer > > - * or a pointer to obj_cgroup vector. In the latter case the lowest > > - * bit of the pointer is set. > > - * The page->mem_cgroup pointer can be asynchronously changed > > - * from NULL to (obj_cgroup_vec | 0x1UL), but can't be changed > > - * from a valid memcg pointer to objcg vector or back. > > - */ > > - if (!page->mem_cgroup) > > - return NULL; > > - > > /* > > * Slab objects are accounted individually, not per-page. > > * Memcg membership data for each individual object is saved in > > @@ -2956,8 +2935,14 @@ struct mem_cgroup *mem_cgroup_from_obj(void *p) > > return NULL; > > } > > > > - /* All other pages use page->mem_cgroup */ > > - return page->mem_cgroup; > > + /* > > + * page_mem_cgroup_check() is used here, because page_has_obj_cgroups() > > + * check above could fail because the object cgroups vector wasn't set > > + * at that moment, but it can be set concurrently. > > + * page_mem_cgroup_check(page) will guarantee tat a proper memory > > + * cgroup pointer or NULL will be returned. > > + */ > > + return page_mem_cgroup_check(page); > > The code right now doesn't look quite safe. As per above, without the > READ_ONCE the compiler might issue multiple loads and we may get a > pointer with the low bit set. > > Maybe slightly off-topic, but what are "all other pages" in general? > I don't see any callsites that ask for ownership on objects whose > backing pages may belong to a single memcg. That wouldn't seem to make > too much sense. Unless I'm missing something, this function should > probably tighten up its scope a bit and only work on stuff that is > actually following the obj_cgroup protocol. Kernel stacks can be slabs or generic pages/vmallocs. Also large kmallocs are using the page allocator, so they don't follow the objcg protocol. Thanks!