On Wed, Jul 27, 2022 at 08:29:09PM +0200, Vlastimil Babka wrote: > Rongwei Wang reports [1] that cache validation triggered by writing to > /sys/kernel/slab/<cache>/validate is racy against normal cache > operations (e.g. freeing) in a way that can cause false positive > inconsistency reports for caches with debugging enabled. The problem is > that debugging actions that mark object free or active and actual > freelist operations are not atomic, and the validation can see an > inconsistent state. > > For caches that do or don't have debugging enabled, additional races > regarding n->nr_slabs are possible that result in false reports of wrong > slab counts. > > This patch attempts to solve these issues while not adding overhead to > normal (especially fastpath) operations for caches that do not have > debugging enabled, just to make possible userspace-triggered validation > safe. Instead, disable the validation for caches that don't have > debugging enabled and make the sysfs handler return -EINVAL. > > For caches that do have debugging enabled, we can instead extend the > existing approach of not using percpu freelists to force all operations > to the slow paths where debugging is checked for and processed. > > The processing on free in free_debug_processing() already happens under > n->list_lock and slab_lock() so we can extend it to actually do the > freeing as well and thus make it atomic against concurrent validation. > > The processing on alloc in alloc_debug_processing() currently doesn't > take any locks, but we have to first allocate the object from a slab on > the partial list (as percpu slabs are always non-existent) and thus take > the n->list_lock anyway. Add a function alloc_single_from_partial() that > additionally takes slab_lock() for the debug processing and then grabs > just the allocated object instead of the whole freelist. This again > makes it atomic against validation and it is also ultimately more > efficient than the current grabbing of freelist immediately followed by > slab deactivation. > > To prevent races on n->nr_slabs, make sure that for caches with > debugging enabled, inc_slabs_node() or dec_slabs_node() is called under > n->list_lock. When allocating a new slab for a debug cache, handle the > allocation by a new function alloc_single_from_new_slab() instead of the > current forced deactivation path. > > Neither of these changes affect the fast paths. > > The function free_debug_processing() is moved so that it is placed > later than the definitions of add_partial(), remove_partial() and > discard_slab(), to avoid a need for forward declarations. > > [1] https://lore.kernel.org/all/20220529081535.69275-1-rongwei.wang@xxxxxxxxxxxxxxxxx/ > > Reported-by: Rongwei Wang <rongwei.wang@xxxxxxxxxxxxxxxxx> > Signed-off-by: Vlastimil Babka <vbabka@xxxxxxx> > --- > Hi, this extends the pre-RFC from [1] to cover also racy n->nr_slab updates > and hopefully thus addresses everything that Rongwei's series did, and > testing will show that? > Thanks, Vlastimil > I don't care whose patch to ACK. Maybe Rongwei will post his own patch? Anyway, this patch overall looks good. Also all issues (as far as I know) related to validate attribute as gone after this patch. Silly question: Do we want to apply on stable trees? I doubt someone would use validate attribute when not debugging. > [1] https://lore.kernel.org/all/69462916-2d1c-dd50-2e64-b31c2b61690e@xxxxxxx/ > > mm/slub.c | 322 +++++++++++++++++++++++++++++++++++++++--------------- > 1 file changed, 231 insertions(+), 91 deletions(-) > > diff --git a/mm/slub.c b/mm/slub.c > index b1281b8654bd..01e5228809d7 100644 > --- a/mm/slub.c > +++ b/mm/slub.c > @@ -1324,17 +1324,14 @@ static inline int alloc_consistency_checks(struct kmem_cache *s, > } [...] > +/* > + * Called only for kmem_cache_debug() caches instead of acquire_slab(), with a > + * slab from the n->partial list. Removes only a single object from the slab > + * under slab_lock(), does the alloc_debug_processing() checks and leaves the > + * slab on the list, or moves it to full list if it was the last object. > + */ > +static void *alloc_single_from_partial(struct kmem_cache *s, > + struct kmem_cache_node *n, struct slab *slab) > +{ > + void *object; > + unsigned long flags; > + > + lockdep_assert_held(&n->list_lock); > + > + slab_lock(slab, &flags); > + > + object = slab->freelist; > + slab->freelist = get_freepointer(s, object); > + slab->inuse++; > + > + if (!alloc_debug_processing(s, slab, object)) { > + remove_partial(n, slab); > + slab_unlock(slab, &flags); > + return NULL; > + } > + > + if (slab->inuse == slab->objects) { > + remove_partial(n, slab); > + add_full(s, n, slab); > + } > + > + slab_unlock(slab, &flags); AFAIK add_full/remove_full/add_partial/remove_partial can be called outside slab_lock but inside list_lock. > + return object; > +} > + > +/* > + * Called only for kmem_cache_debug() caches to allocate from a freshly > + * allocated slab. Allocates a single object instead of whole freelist > + * and puts the slab to the partial (or full) list. > + */ > +static void *alloc_single_from_new_slab(struct kmem_cache *s, > + struct slab *slab) > +{ > + int nid = slab_nid(slab); > + struct kmem_cache_node *n = get_node(s, nid); > + unsigned long flags, flags2; > + void *object; > + > + spin_lock_irqsave(&n->list_lock, flags); > + slab_lock(slab, &flags2); > + > + object = slab->freelist; > + slab->freelist = get_freepointer(s, object); > + /* Undo what allocate_slab() did */ > + slab->frozen = 0; > + slab->inuse = 1; Maybe do it in allocate_slab()? > + if (!alloc_debug_processing(s, slab, object)) { > + /* > + * It's not really expected that this would fail on a > + * freshly allocated slab, but a concurrent memory > + * corruption in theory could cause that. > + */ > + slab_unlock(slab, &flags2); > + spin_unlock_irqrestore(&n->list_lock, flags); > + return NULL; > + } > + > + if (slab->inuse == slab->objects) > + add_full(s, n, slab); > + else > + add_partial(n, slab, DEACTIVATE_TO_HEAD); > + > + slab_unlock(slab, &flags2); > + inc_slabs_node(s, nid, slab->objects); > + spin_unlock_irqrestore(&n->list_lock, flags); > + > + return object; > +} [...] > #endif /* CONFIG_SLUB_DEBUG */ > > #if defined(CONFIG_SLUB_DEBUG) || defined(CONFIG_SYSFS) > @@ -3036,6 +3165,20 @@ static void *___slab_alloc(struct kmem_cache *s, gfp_t gfpflags, int node, > return NULL; > } > > + stat(s, ALLOC_SLAB); > + > + if (kmem_cache_debug(s)) { > + freelist = alloc_single_from_new_slab(s, slab); > + > + if (unlikely(!freelist)) > + goto new_objects; > + > + if (s->flags & SLAB_STORE_USER) > + set_track(s, freelist, TRACK_ALLOC, addr); > + > + return freelist; > + } > + > /* > * No other reference to the slab yet so we can > * muck around with it freely without cmpxchg > @@ -3043,29 +3186,29 @@ static void *___slab_alloc(struct kmem_cache *s, gfp_t gfpflags, int node, > freelist = slab->freelist; > slab->freelist = NULL; > > - stat(s, ALLOC_SLAB); > + inc_slabs_node(s, slab_nid(slab), slab->objects); > > check_new_slab: > > if (kmem_cache_debug(s)) { > - if (!alloc_debug_processing(s, slab, freelist, addr)) { > - /* Slab failed checks. Next slab needed */ > - goto new_slab; > - } else { > - /* > - * For debug case, we don't load freelist so that all > - * allocations go through alloc_debug_processing() > - */ > - goto return_single; > - } > + /* > + * For debug caches here we had to go through > + * alloc_single_from_partial() so just store the tracking info > + * and return the object > + */ > + if (s->flags & SLAB_STORE_USER) > + set_track(s, freelist, TRACK_ALLOC, addr); > + return freelist; > } > > - if (unlikely(!pfmemalloc_match(slab, gfpflags))) > + if (unlikely(!pfmemalloc_match(slab, gfpflags))) { > /* > * For !pfmemalloc_match() case we don't load freelist so that > * we don't make further mismatched allocations easier. > */ > - goto return_single; > + deactivate_slab(s, slab, get_freepointer(s, freelist)); > + return freelist; > + } > > retry_load_slab: > > @@ -3089,11 +3232,6 @@ static void *___slab_alloc(struct kmem_cache *s, gfp_t gfpflags, int node, > c->slab = slab; > > goto load_freelist; > - > -return_single: > - > - deactivate_slab(s, slab, get_freepointer(s, freelist)); > - return freelist; > } > > /* > @@ -3341,9 +3479,10 @@ static void __slab_free(struct kmem_cache *s, struct slab *slab, > if (kfence_free(head)) > return; > > - if (kmem_cache_debug(s) && > - !free_debug_processing(s, slab, head, tail, cnt, addr)) > + if (kmem_cache_debug(s)) { > + free_debug_processing(s, slab, head, tail, cnt, addr); > return; > + } Oh, now debugging caches does not share free path with non-debugging caches. Now free_debug_processing's return type can be void? > > do { > if (unlikely(n)) { > @@ -3958,6 +4097,7 @@ static void early_kmem_cache_node_alloc(int node) > slab = new_slab(kmem_cache_node, GFP_NOWAIT, node); > > BUG_ON(!slab); > + inc_slabs_node(kmem_cache_node, slab_nid(slab), slab->objects); > if (slab_nid(slab) != node) { > pr_err("SLUB: Unable to allocate memory from node %d\n", node); > pr_err("SLUB: Allocating a useless per node structure in order to be able to continue\n"); > @@ -5625,7 +5765,7 @@ static ssize_t validate_store(struct kmem_cache *s, > { > int ret = -EINVAL; > > - if (buf[0] == '1') { > + if (buf[0] == '1' && kmem_cache_debug(s)) { > ret = validate_slab_cache(s); > if (ret >= 0) > ret = length; Yeah definitely this is what it should be, instead of serializing inc_slabs_node()/dec_slabs_node() for non-debugging caches. > -- > 2.37.1 > -- Thanks, Hyeonggon