On Tue, Oct 04, 2022 at 09:18:26AM -0700, Roman Gushchin wrote: > On Mon, Oct 03, 2022 at 06:01:35PM +0300, Alexander Fedorov wrote: > > On 03.10.2022 17:27, Michal Hocko wrote: > > > On Mon 03-10-22 17:09:15, Alexander Fedorov wrote: > > >> On 03.10.2022 16:32, Michal Hocko wrote: > > >>> On Mon 03-10-22 15:47:10, Alexander Fedorov wrote: > > >>>> @@ -3197,17 +3197,30 @@ static void drain_obj_stock(struct memcg_stock_pcp *stock) > > >>>> stock->nr_bytes = 0; > > >>>> } > > >>>> > > >>>> - obj_cgroup_put(old); > > >>>> + /* > > >>>> + * Clear pointer before freeing memory so that > > >>>> + * drain_all_stock() -> obj_stock_flush_required() > > >>>> + * does not see a freed pointer. > > >>>> + */ > > >>>> stock->cached_objcg = NULL; > > >>>> + obj_cgroup_put(old); > > >>> > > >>> Do we need barrier() or something else to ensure there is no reordering? > > >>> I am not reallyu sure what kind of barriers are implied by the pcp ref > > >>> counting. > > >> > > >> obj_cgroup_put() -> kfree_rcu() -> synchronize_rcu() should take care > > >> of this: > > > > > > This is a very subtle guarantee. Also it would only apply if this is the > > > last reference, right? > > > > Hmm, yes, for the last reference only, also not sure about pcp ref > > counter ordering rules for previous references. > > > > > Is there any reason to not use > > > WRITE_ONCE(stock->cached_objcg, NULL); > > > obj_cgroup_put(old); > > > > > > IIRC this should prevent any reordering. > > > > Now that I think about it we actually must use WRITE_ONCE everywhere > > when writing cached_objcg because otherwise compiler might split the > > pointer-sized store into several smaller-sized ones (store tearing), > > and obj_stock_flush_required() would read garbage instead of pointer. > > > > And thinking about memory barriers, maybe we need them too alongside > > WRITE_ONCE when setting pointer to non-null value? Otherwise > > drain_all_stock() -> obj_stock_flush_required() might read old data. > > Since that's exactly what rcu_assign_pointer() does, it seems > > that we are going back to using rcu_*() primitives everywhere? > > Hm, Idk, I'm still somewhat resistant to the idea of putting rcu primitives, > but maybe it's the right thing. Maybe instead we should always schedule draining > on all cpus instead and perform a cpu-local check and bail out if a flush is not > required? Michal, Johannes, what do you think? I agree it's overkill. This is a speculative check, and we don't need any state coherency, just basic lifetime. READ_ONCE should fully address this problem. That said, I think the code could be a bit clearer and better documented. How about the below? (Nevermind the ifdef, I'm working on removing CONFIG_MEMCG_KMEM altogether, as it's a really strange way to say !SLOB at this point) --- >From 22855af38b116ec030286975ed2aa06851680296 Mon Sep 17 00:00:00 2001 From: Johannes Weiner <hannes@xxxxxxxxxxx> Date: Wed, 12 Oct 2022 12:59:07 -0400 Subject: [PATCH] mm: memcontrol: fix NULL deref race condition during cgroup deletion Alexander Fedorov reports a race condition between two concurrent stock draining operations, where the first one clears the stock's obj pointer between the pointer test and deref of the second. Analysis: 1) First CPU: css_killed_work_fn() -> mem_cgroup_css_offline() -> drain_all_stock() -> obj_stock_flush_required() if (stock->cached_objcg) { This check sees a non-NULL pointer for *another* CPU's `memcg_stock` instance. 2) Second CPU: css_free_rwork_fn() -> __mem_cgroup_free() -> free_percpu() -> obj_cgroup_uncharge() -> drain_obj_stock() It frees `cached_objcg` pointer in its own `memcg_stock` instance: struct obj_cgroup *old = stock->cached_objcg; < ... > obj_cgroup_put(old); stock->cached_objcg = NULL; 3) First CPU continues after the 'if' check and re-reads the pointer again, now it is NULL and dereferencing it leads to kernel panic: static bool obj_stock_flush_required(struct memcg_stock_pcp *stock, struct mem_cgroup *root_memcg) { < ... > if (stock->cached_objcg) { memcg = obj_cgroup_memcg(stock->cached_objcg); There is already RCU protection in place to ensure lifetime. Add the missing READ_ONCE to the cgroup pointers to fix the TOCTOU, and consolidate and document the speculative code. Reported-by: Alexander Fedorov <halcien@xxxxxxxxx> Signed-off-by: Johannes Weiner <hannes@xxxxxxxxxxx> --- mm/memcontrol.c | 44 ++++++++++++++++++++------------------------ 1 file changed, 20 insertions(+), 24 deletions(-) diff --git a/mm/memcontrol.c b/mm/memcontrol.c index 2d8549ae1b30..09ac2f8991ee 100644 --- a/mm/memcontrol.c +++ b/mm/memcontrol.c @@ -2190,8 +2190,6 @@ static DEFINE_MUTEX(percpu_charge_mutex); #ifdef CONFIG_MEMCG_KMEM static struct obj_cgroup *drain_obj_stock(struct memcg_stock_pcp *stock); -static bool obj_stock_flush_required(struct memcg_stock_pcp *stock, - struct mem_cgroup *root_memcg); static void memcg_account_kmem(struct mem_cgroup *memcg, int nr_pages); #else @@ -2199,11 +2197,6 @@ static inline struct obj_cgroup *drain_obj_stock(struct memcg_stock_pcp *stock) { return NULL; } -static bool obj_stock_flush_required(struct memcg_stock_pcp *stock, - struct mem_cgroup *root_memcg) -{ - return false; -} static void memcg_account_kmem(struct mem_cgroup *memcg, int nr_pages) { } @@ -2339,13 +2332,30 @@ static void drain_all_stock(struct mem_cgroup *root_memcg) struct mem_cgroup *memcg; bool flush = false; + /* + * Speculatively check up front if this CPU has any + * cached charges that belong to the specified + * root_memcg. The state may change from under us - + * which is okay, because the draining itself is a + * best-effort operation. Just ensure lifetime of + * whatever we end up looking at. + */ rcu_read_lock(); - memcg = stock->cached; + memcg = READ_ONCE(stock->cached); if (memcg && stock->nr_pages && mem_cgroup_is_descendant(memcg, root_memcg)) flush = true; - else if (obj_stock_flush_required(stock, root_memcg)) - flush = true; +#ifdef CONFIG_MEMCG_KMEM + else { + struct obj_cgroup *objcg; + + objcg = READ_ONCE(stock->cached_objcg); + if (objcg && stock->nr_bytes && + mem_cgroup_is_descendant(obj_cgroup_memcg(objcg), + root_memcg)) + flush = true; + } +#endif rcu_read_unlock(); if (flush && @@ -3297,20 +3307,6 @@ static struct obj_cgroup *drain_obj_stock(struct memcg_stock_pcp *stock) return old; } -static bool obj_stock_flush_required(struct memcg_stock_pcp *stock, - struct mem_cgroup *root_memcg) -{ - struct mem_cgroup *memcg; - - if (stock->cached_objcg) { - memcg = obj_cgroup_memcg(stock->cached_objcg); - if (memcg && mem_cgroup_is_descendant(memcg, root_memcg)) - return true; - } - - return false; -} - static void refill_obj_stock(struct obj_cgroup *objcg, unsigned int nr_bytes, bool allow_uncharge) { -- 2.37.3