Re: [PATCH] mm/slub: add missing TID updates on slab deactivation

Hyeonggon Yoo <42.hyeyoo@xxxxxxxxx> · Mon, 13 Jun 2022 21:49:07 +0900

On Wed, Jun 08, 2022 at 08:22:05PM +0200, Jann Horn wrote:
> The fastpath in slab_alloc_node() assumes that c->slab is stable as long as
> the TID stays the same. However, two places in __slab_alloc() currently
> don't update the TID when deactivating the CPU slab.
> 
> If multiple operations race the right way, this could lead to an object
> getting lost; or, in an even more unlikely situation, it could even lead to
> an object being freed onto the wrong slab's freelist, messing up the
> `inuse` counter and eventually causing a page to be freed to the page
> allocator while it still contains slab objects.
> 
> (I haven't actually tested these cases though, this is just based on
> looking at the code. Writing testcases for this stuff seems like it'd be
> a pain...)
> 
> The race leading to state inconsistency is (all operations on the same CPU
> and kmem_cache):
> 
>  - task A: begin do_slab_free():
>     - read TID
>     - read pcpu freelist (==NULL)
>     - check `slab == c->slab` (true)
>  - [PREEMPT A->B]
>  - task B: begin slab_alloc_node():
>     - fastpath fails (`c->freelist` is NULL)
>     - enter __slab_alloc()
>     - slub_get_cpu_ptr() (disables preemption)
>     - enter ___slab_alloc()
>     - take local_lock_irqsave()
>     - read c->freelist as NULL
>     - get_freelist() returns NULL
>     - write `c->slab = NULL`
>     - drop local_unlock_irqrestore()
>     - goto new_slab
>     - slub_percpu_partial() is NULL
>     - get_partial() returns NULL
>     - slub_put_cpu_ptr() (enables preemption)
>  - [PREEMPT B->A]
>  - task A: finish do_slab_free():
>     - this_cpu_cmpxchg_double() succeeds()
>     - [CORRUPT STATE: c->slab==NULL, c->freelist!=NULL]

I can see this happening (!c->slab && c->freelist becoming true)
when I synthetically add scheduling points in the code:

diff --git a/mm/slub.c b/mm/slub.c
index b97fa5e21046..b8012fdf2607 100644
--- a/mm/slub.c
+++ b/mm/slub.c
@@ -3001,6 +3001,10 @@ static void *___slab_alloc(struct kmem_cache *s, gfp_t gfpflags, int node,
 		goto check_new_slab;

 	slub_put_cpu_ptr(s->cpu_slab);
+
+	if (!in_atomic())
+		schedule();
+
 	slab = new_slab(s, gfpflags, node);
 	c = slub_get_cpu_ptr(s->cpu_slab);

@@ -3456,9 +3460,13 @@ static __always_inline void do_slab_free(struct kmem_cache *s,
 	if (likely(slab == c->slab)) {
 #ifndef CONFIG_PREEMPT_RT
 		void **freelist = READ_ONCE(c->freelist);
+		unsigned long flags;

 		set_freepointer(s, tail_obj, freelist);

+		if (!in_atomic())
+			schedule();
+
 		if (unlikely(!this_cpu_cmpxchg_double(
 				s->cpu_slab->freelist, s->cpu_slab->tid,
 				freelist, tid,
@@ -3467,6 +3475,10 @@ static __always_inline void do_slab_free(struct kmem_cache *s,
 			note_cmpxchg_failure("slab_free", s, tid);
 			goto redo;
 		}
+
+		local_irq_save(flags);
+		WARN_ON(!READ_ONCE(c->slab) && READ_ONCE(c->freelist));
+		local_irq_restore(flags);
 #else /* CONFIG_PREEMPT_RT */
 		/*
 		 * We cannot use the lockless fastpath on PREEMPT_RT because if


> From there, the object on c->freelist will get lost if task B is allowed to
> continue from here: It will proceed to the retry_load_slab label,
> set c->slab, then jump to load_freelist, which clobbers c->freelist.
>
> But if we instead continue as follows, we get worse corruption:
> 
>  - task A: run __slab_free() on object from other struct slab:
>     - CPU_PARTIAL_FREE case (slab was on no list, is now on pcpu partial)
>  - task A: run slab_alloc_node() with NUMA node constraint:
>     - fastpath fails (c->slab is NULL)
>     - call __slab_alloc()
>     - slub_get_cpu_ptr() (disables preemption)
>     - enter ___slab_alloc()
>     - c->slab is NULL: goto new_slab
>     - slub_percpu_partial() is non-NULL
>     - set c->slab to slub_percpu_partial(c)
>     - [CORRUPT STATE: c->slab points to slab-1, c->freelist has objects
>       from slab-2]
>     - goto redo
>     - node_match() fails
>     - goto deactivate_slab
>     - existing c->freelist is passed into deactivate_slab()
>     - inuse count of slab-1 is decremented to account for object from
>       slab-2

I didn't try to reproduce this -- but I agree SLUB can be fooled
by the condition (!c->slab && c->freelist).

> At this point, the inuse count of slab-1 is 1 lower than it should be.
> This means that if we free all allocated objects in slab-1 except for one,
> SLUB will think that slab-1 is completely unused, and may free its page,
> leading to use-after-free.
> 
> Fixes: c17dda40a6a4e ("slub: Separate out kmem_cache_cpu processing from deactivate_slab")
> Fixes: 03e404af26dc2 ("slub: fast release on full slab")
> Cc: stable@xxxxxxxxxxxxxxx
> Signed-off-by: Jann Horn <jannh@xxxxxxxxxx>
> ---
>  mm/slub.c | 2 ++
>  1 file changed, 2 insertions(+)
> 
> diff --git a/mm/slub.c b/mm/slub.c
> index e5535020e0fdf..b97fa5e210469 100644
> --- a/mm/slub.c
> +++ b/mm/slub.c
> @@ -2936,6 +2936,7 @@ static void *___slab_alloc(struct kmem_cache *s, gfp_t gfpflags, int node,
>  
>  	if (!freelist) {
>  		c->slab = NULL;
> +		c->tid = next_tid(c->tid);
>  		local_unlock_irqrestore(&s->cpu_slab->lock, flags);
>  		stat(s, DEACTIVATE_BYPASS);
>  		goto new_slab;
> @@ -2968,6 +2969,7 @@ static void *___slab_alloc(struct kmem_cache *s, gfp_t gfpflags, int node,
>  	freelist = c->freelist;
>  	c->slab = NULL;
>  	c->freelist = NULL;
> +	c->tid = next_tid(c->tid);
>  	local_unlock_irqrestore(&s->cpu_slab->lock, flags);
>  	deactivate_slab(s, slab, freelist);
>  
> 
> base-commit: 9886142c7a2226439c1e3f7d9b69f9c7094c3ef6
> -- 
> 2.36.1.476.g0c4daa206d-goog

With this patch I couldn't reproduce it.
This work is really nice. Thanks!

Tested-by: Hyeonggon Yoo <42.hyeyoo@xxxxxxxxx>

BTW I wonder how much this race will affect machines in the real world.
Maybe just rare and undetectable memory leak?

-- 
Thanks,
Hyeonggon