On Thu, 2012-04-26 at 14:57 -0400, Waiman Long wrote: > The SLUB memory allocator was changed substantially from 3.0 to 3.1 by > replacing some of page locking codes for updating the free object list > of the slab with double-quadword atomic exchange (cmpxchg_double_slab) > or a pseudo one using a page lock when debugging is turned on. In the > normal case, that should be enough to make sure that the slab is in a > consistent state. However, when CONFIG_SLUB_DEBUG is turned on and the > Redzone debugging flag is set, the Redzone bytes are also used to mark > if an object is free or allocated. The extra state information in those > Redzone bytes is not protected by the cmpxchg_double_slab(). As a > result, > validate_slab() may report a Redzone error if the validation is > performed > while racing with a free to a debugged slab. > > The problem was reported in > > https://bugzilla.kernel.org/show_bug.cgi?id=42312 > > It is fairly easy to reproduce by passing in the kernel parameter of > "slub_debug=FZPU". After booting, run the command (as root): > > while true ; do ./slabinfo -v ; sleep 3 ; done > > The slabinfo test code can be found in tools/vm/slabinfo.c. > > At the same time, load the system with heavy I/O activities by, for > example, building the Linux kernel. The following kind of dmesg messages > will then be reported: > > BUG names_cache: Redzone overwritten > SLUB: names_cache 3 slabs counted but counter=4 > > This patch fixes the BUG message by acquiring the node-level lock for > slabs flagged for debugging to avoid this possible racing condition. > The locking is done on the node-level lock instead of the more granular > page lock because the new code may speculatively acquire the node-level > lock later on. Acquiring the page lock and then the node lock may lead > to potential deadlock. > > As the increment of slab node count and insertion of the new slab into > the partial or full slab list is not an atomic operation, there is a > small time window where the two may not match. This patch temporarily > works around this problem by allowing the node count to be one larger > than the number of slab presents in the lists. This workaround may not > work if more than one CPU is actively adding slab to the same node, > but it should be good enough to workaround the problem in most cases. > > To really fix the issue, the overall synchronization between debug slub > operations and slub validation needs a revisit. > > This patch also fixes a number of "code indent should use tabs where > possible" error reported by checkpatch.pl in the __slab_free() function > by replacing groups of 8-space tab by real tabs. > > After applying the patch, the slub error and warnings are all gone in > the 4-CPU x86-64 test machine. > > Signed-off-by: Waiman Long <waiman.long@xxxxxx> > Reviewed-by: Don Morris <don.morris@xxxxxx> > --- > mm/slub.c | 46 +++++++++++++++++++++++++++++++++------------- > 1 files changed, 33 insertions(+), 13 deletions(-) > > diff --git a/mm/slub.c b/mm/slub.c > index ffe13fd..4ca3140 100644 > --- a/mm/slub.c > +++ b/mm/slub.c > @@ -2445,8 +2445,18 @@ static void __slab_free(struct kmem_cache *s, struct page *page, > > stat(s, FREE_SLOWPATH); > > - if (kmem_cache_debug(s) && !free_debug_processing(s, page, x, addr)) > - return; > + if (kmem_cache_debug(s)) { > + /* > + * We need to acquire the node lock to prevent spurious error > + * with validate_slab(). > + */ > + n = get_node(s, page_to_nid(page)); > + spin_lock_irqsave(&n->list_lock, flags); > + if (!free_debug_processing(s, page, x, addr)) { > + spin_unlock_irqrestore(&n->list_lock, flags); > + return; > + } missing spin_unlock_irqrestore(&n->list_lock, flags); ? > + } > > do { > prior = page->freelist; > @@ -2467,7 +2477,7 @@ static void __slab_free(struct kmem_cache *s, struct page *page, > > else { /* Needs to be taken off a list */ > > - n = get_node(s, page_to_nid(page)); > + n = get_node(s, page_to_nid(page)); > /* > * Speculatively acquire the list_lock. > * If the cmpxchg does not succeed then we may > @@ -2501,10 +2511,10 @@ static void __slab_free(struct kmem_cache *s, struct page *page, > * The list lock was not taken therefore no list > * activity can be necessary. > */ > - if (was_frozen) > - stat(s, FREE_FROZEN); > - return; > - } > + if (was_frozen) > + stat(s, FREE_FROZEN); > + return; > + } > > /* > * was_frozen may have been set after we acquired the list_lock in > @@ -2514,7 +2524,7 @@ static void __slab_free(struct kmem_cache *s, struct page *page, > stat(s, FREE_FROZEN); > else { > if (unlikely(!inuse && n->nr_partial > s->min_partial)) > - goto slab_empty; > + goto slab_empty; > > /* > * Objects left in the slab. If it was not on the partial list before > @@ -4122,7 +4132,7 @@ static void validate_slab_slab(struct kmem_cache *s, struct page *page, > static int validate_slab_node(struct kmem_cache *s, > struct kmem_cache_node *n, unsigned long *map) > { > - unsigned long count = 0; > + unsigned long count = 0, n_count; > struct page *page; > unsigned long flags; > > @@ -4143,10 +4153,20 @@ static int validate_slab_node(struct kmem_cache *s, > validate_slab_slab(s, page, map); > count++; > } > - if (count != atomic_long_read(&n->nr_slabs)) > - printk(KERN_ERR "SLUB: %s %ld slabs counted but " > - "counter=%ld\n", s->name, count, > - atomic_long_read(&n->nr_slabs)); > + n_count = atomic_long_read(&n->nr_slabs); > + /* > + * The following workaround is to greatly reduce the chance of counter > + * mismatch messages due to the fact that inc_slabs_node() and the > + * subsequent insertion into the partial or full slab list is not > + * atomic. Consequently, there is a small timing window when the two > + * are not in the same state. A possible fix is to take the node lock > + * while doing inc_slabs_node() and slab insertion, but that may > + * require substantial changes to existing slow path slab allocation > + * logic. > + */ > + if ((count != n_count) && (count + 1 != n_count)) > + printk(KERN_ERR "SLUB: %s %ld slabs counted but counter=%ld\n", > + s->name, count, n_count); > > out: > spin_unlock_irqrestore(&n->list_lock, flags); -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@xxxxxxxxx. For more info on Linux MM, see: http://www.linux-mm.org/ . Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/ Don't email: <a href=mailto:"dont@xxxxxxxxx"> email@xxxxxxxxx </a>