On Wed, Oct 12, 2022 at 11:49:20AM -0700, Roman Gushchin wrote: > On Wed, Oct 12, 2022 at 01:23:11PM -0400, Johannes Weiner wrote: > > On Tue, Oct 04, 2022 at 09:18:26AM -0700, Roman Gushchin wrote: > > > On Mon, Oct 03, 2022 at 06:01:35PM +0300, Alexander Fedorov wrote: > > > > On 03.10.2022 17:27, Michal Hocko wrote: > > > > > On Mon 03-10-22 17:09:15, Alexander Fedorov wrote: > > > > >> On 03.10.2022 16:32, Michal Hocko wrote: > > > > >>> On Mon 03-10-22 15:47:10, Alexander Fedorov wrote: > > > > >>>> @@ -3197,17 +3197,30 @@ static void drain_obj_stock(struct memcg_stock_pcp *stock) > > > > >>>> stock->nr_bytes = 0; > > > > >>>> } > > > > >>>> > > > > >>>> - obj_cgroup_put(old); > > > > >>>> + /* > > > > >>>> + * Clear pointer before freeing memory so that > > > > >>>> + * drain_all_stock() -> obj_stock_flush_required() > > > > >>>> + * does not see a freed pointer. > > > > >>>> + */ > > > > >>>> stock->cached_objcg = NULL; > > > > >>>> + obj_cgroup_put(old); > > > > >>> > > > > >>> Do we need barrier() or something else to ensure there is no reordering? > > > > >>> I am not reallyu sure what kind of barriers are implied by the pcp ref > > > > >>> counting. > > > > >> > > > > >> obj_cgroup_put() -> kfree_rcu() -> synchronize_rcu() should take care > > > > >> of this: > > > > > > > > > > This is a very subtle guarantee. Also it would only apply if this is the > > > > > last reference, right? > > > > > > > > Hmm, yes, for the last reference only, also not sure about pcp ref > > > > counter ordering rules for previous references. > > > > > > > > > Is there any reason to not use > > > > > WRITE_ONCE(stock->cached_objcg, NULL); > > > > > obj_cgroup_put(old); > > > > > > > > > > IIRC this should prevent any reordering. > > > > > > > > Now that I think about it we actually must use WRITE_ONCE everywhere > > > > when writing cached_objcg because otherwise compiler might split the > > > > pointer-sized store into several smaller-sized ones (store tearing), > > > > and obj_stock_flush_required() would read garbage instead of pointer. > > > > > > > > And thinking about memory barriers, maybe we need them too alongside > > > > WRITE_ONCE when setting pointer to non-null value? Otherwise > > > > drain_all_stock() -> obj_stock_flush_required() might read old data. > > > > Since that's exactly what rcu_assign_pointer() does, it seems > > > > that we are going back to using rcu_*() primitives everywhere? > > > > > > Hm, Idk, I'm still somewhat resistant to the idea of putting rcu primitives, > > > but maybe it's the right thing. Maybe instead we should always schedule draining > > > on all cpus instead and perform a cpu-local check and bail out if a flush is not > > > required? Michal, Johannes, what do you think? > > > > I agree it's overkill. > > > > This is a speculative check, and we don't need any state coherency, > > just basic lifetime. READ_ONCE should fully address this problem. That > > said, I think the code could be a bit clearer and better documented. > > > > How about the below? > > I'm fine with using READ_ONCE() to fix this immediate issue (I suggested it > in the thread above), please feel free to add my ack: > Acked-by: Roman Gushchin <roman.gushchin@xxxxxxxxx> . Thanks! > We might need a barrier() between zeroing stock->cached and dropping the last > reference, as discussed above, however I don't think this issue can be > realistically trgiggered in the real life. Hm, plus the load tearing. We can do WRITE_ONCE() just for ->cached and ->cached_objcg. That will take care of both: load tearing, as well as the compile-time order with the RCU free call. RCU will then handle the SMP effects. I still prefer it over rcuifying the pointers completely just for that one (questionable) optimization. Updated patch below. > However I think our overall approach to flushing is questionable: > 1) we often don't flush when it's necessary: if there is a concurrent flushing > we just bail out, even if that flushing is related to a completely different > part of the cgroup tree (e.g. a leaf node belonging to a distant branch). Right. > 2) we can race and flush when it's not necessarily: if another cpu is busy, > likely by the time when work will be executed there will be already another > memcg cached. So IMO we need to move this check into the flushing thread. We might just be able to remove all the speculative checks. drain_all_stock() is slowpath after all... > I'm working on a different approach, but it will take time and also likely be > too invasive for @stable, so fixing the crash discovered by Alexander with > READ_ONCE() is a good idea. Sounds good, I'm looking forward to those changes. --- >From c9b940db5f75160b5e80c4ae83ea760ad29e8ef9 Mon Sep 17 00:00:00 2001 From: Johannes Weiner <hannes@xxxxxxxxxxx> Date: Wed, 12 Oct 2022 12:59:07 -0400 Subject: [PATCH] mm: memcontrol: fix NULL deref race condition during cgroup deletion Alexander Fedorov reports a race condition between two concurrent stock draining operations, where the first one clears the stock's obj pointer between the pointer test and deref of the second. Analysis: 1) First CPU: css_killed_work_fn() -> mem_cgroup_css_offline() -> drain_all_stock() -> obj_stock_flush_required() if (stock->cached_objcg) { This check sees a non-NULL pointer for *another* CPU's `memcg_stock` instance. 2) Second CPU: css_free_rwork_fn() -> __mem_cgroup_free() -> free_percpu() -> obj_cgroup_uncharge() -> drain_obj_stock() It frees `cached_objcg` pointer in its own `memcg_stock` instance: struct obj_cgroup *old = stock->cached_objcg; < ... > obj_cgroup_put(old); stock->cached_objcg = NULL; 3) First CPU continues after the 'if' check and re-reads the pointer again, now it is NULL and dereferencing it leads to kernel panic: static bool obj_stock_flush_required(struct memcg_stock_pcp *stock, struct mem_cgroup *root_memcg) { < ... > if (stock->cached_objcg) { memcg = obj_cgroup_memcg(stock->cached_objcg); There is already RCU protection in place to ensure lifetime. Add the missing READ_ONCE to the cgroup pointers to fix the TOCTOU, and consolidate and document the speculative code. Reported-by: Alexander Fedorov <halcien@xxxxxxxxx> Signed-off-by: Johannes Weiner <hannes@xxxxxxxxxxx> Acked-by: Roman Gushchin <roman.gushchin@xxxxxxxxx> --- mm/memcontrol.c | 54 +++++++++++++++++++++++-------------------------- 1 file changed, 25 insertions(+), 29 deletions(-) diff --git a/mm/memcontrol.c b/mm/memcontrol.c index 2d8549ae1b30..4357dadae95d 100644 --- a/mm/memcontrol.c +++ b/mm/memcontrol.c @@ -2190,8 +2190,6 @@ static DEFINE_MUTEX(percpu_charge_mutex); #ifdef CONFIG_MEMCG_KMEM static struct obj_cgroup *drain_obj_stock(struct memcg_stock_pcp *stock); -static bool obj_stock_flush_required(struct memcg_stock_pcp *stock, - struct mem_cgroup *root_memcg); static void memcg_account_kmem(struct mem_cgroup *memcg, int nr_pages); #else @@ -2199,11 +2197,6 @@ static inline struct obj_cgroup *drain_obj_stock(struct memcg_stock_pcp *stock) { return NULL; } -static bool obj_stock_flush_required(struct memcg_stock_pcp *stock, - struct mem_cgroup *root_memcg) -{ - return false; -} static void memcg_account_kmem(struct mem_cgroup *memcg, int nr_pages) { } @@ -2259,8 +2252,8 @@ static void drain_stock(struct memcg_stock_pcp *stock) stock->nr_pages = 0; } + WRITE_ONCE(stock->cached, NULL); css_put(&old->css); - stock->cached = NULL; } static void drain_local_stock(struct work_struct *dummy) @@ -2298,7 +2291,7 @@ static void __refill_stock(struct mem_cgroup *memcg, unsigned int nr_pages) if (stock->cached != memcg) { /* reset if necessary */ drain_stock(stock); css_get(&memcg->css); - stock->cached = memcg; + WRITE_ONCE(stock->cached, memcg); } stock->nr_pages += nr_pages; @@ -2339,13 +2332,30 @@ static void drain_all_stock(struct mem_cgroup *root_memcg) struct mem_cgroup *memcg; bool flush = false; + /* + * Speculatively check up front if this CPU has any + * cached charges that belong to the specified + * root_memcg. The state may change from under us - + * which is okay, because the draining itself is a + * best-effort operation. Just ensure lifetime of + * whatever we end up looking at. + */ rcu_read_lock(); - memcg = stock->cached; + memcg = READ_ONCE(stock->cached); if (memcg && stock->nr_pages && mem_cgroup_is_descendant(memcg, root_memcg)) flush = true; - else if (obj_stock_flush_required(stock, root_memcg)) - flush = true; +#ifdef CONFIG_MEMCG_KMEM + else { + struct obj_cgroup *objcg; + + objcg = READ_ONCE(stock->cached_objcg); + if (objcg && stock->nr_bytes && + mem_cgroup_is_descendant(obj_cgroup_memcg(objcg), + root_memcg)) + flush = true; + } +#endif rcu_read_unlock(); if (flush && @@ -3170,7 +3180,7 @@ void mod_objcg_state(struct obj_cgroup *objcg, struct pglist_data *pgdat, obj_cgroup_get(objcg); stock->nr_bytes = atomic_read(&objcg->nr_charged_bytes) ? atomic_xchg(&objcg->nr_charged_bytes, 0) : 0; - stock->cached_objcg = objcg; + WRITE_ONCE(stock->cached_objcg, objcg); stock->cached_pgdat = pgdat; } else if (stock->cached_pgdat != pgdat) { /* Flush the existing cached vmstat data */ @@ -3289,7 +3299,7 @@ static struct obj_cgroup *drain_obj_stock(struct memcg_stock_pcp *stock) stock->cached_pgdat = NULL; } - stock->cached_objcg = NULL; + WRITE_ONCE(stock->cached_objcg, NULL); /* * The `old' objects needs to be released by the caller via * obj_cgroup_put() outside of memcg_stock_pcp::stock_lock. @@ -3297,20 +3307,6 @@ static struct obj_cgroup *drain_obj_stock(struct memcg_stock_pcp *stock) return old; } -static bool obj_stock_flush_required(struct memcg_stock_pcp *stock, - struct mem_cgroup *root_memcg) -{ - struct mem_cgroup *memcg; - - if (stock->cached_objcg) { - memcg = obj_cgroup_memcg(stock->cached_objcg); - if (memcg && mem_cgroup_is_descendant(memcg, root_memcg)) - return true; - } - - return false; -} - static void refill_obj_stock(struct obj_cgroup *objcg, unsigned int nr_bytes, bool allow_uncharge) { @@ -3325,7 +3321,7 @@ static void refill_obj_stock(struct obj_cgroup *objcg, unsigned int nr_bytes, if (stock->cached_objcg != objcg) { /* reset if necessary */ old = drain_obj_stock(stock); obj_cgroup_get(objcg); - stock->cached_objcg = objcg; + WRITE_ONCE(stock->cached_objcg, objcg); stock->nr_bytes = atomic_read(&objcg->nr_charged_bytes) ? atomic_xchg(&objcg->nr_charged_bytes, 0) : 0; allow_uncharge = true; /* Allow uncharge when objcg changes */ -- 2.37.3