On Fri, Mar 08, 2019 at 03:14:20PM +1100, Tobin C. Harding wrote: > If many objects are allocated with the slab allocator and freed in an > arbitrary order then the slab caches can become internally fragmented. > Now that the slab allocator supports movable objects we can defragment > any cache that has this feature enabled. > > Slab defragmentation may occur: > > 1. Unconditionally when __kmem_cache_shrink() is called on a slab cache > by the kernel calling kmem_cache_shrink(). > > 2. Unconditionally through the use of the slabinfo command. > > slabinfo <cache> -s > > 3. Conditionally via the use of kmem_cache_defrag() > > Use SMO when shrinking cache. Currently when the kernel calls > kmem_cache_shrink() we curate the partial slabs list. If object > migration is not enabled for the cache we still do this, if however SMO > is enabled, we attempt to move objects in partially full slabs in order > to defragment the cache. Shrink attempts to move all objects in order > to reduce the cache to a single partial slab for each node. > > kmem_cache_defrag() differs from shrink in that it operates dependent on > the defrag_used_ratio and only attempts to move objects if the number of > partial slabs exceeds MAX_PARTIAL (for each node). > > Add function kmem_cache_defrag(int node). > > kmem_cache_defrag() only performs defragmentation if the usage ratio > of the slab is lower than the configured percentage (sysfs file added > in previous patch). Fragmentation ratios are measured by calculating > the percentage of objects in use compared to the total number of > objects that the slab page can accommodate. > > The scanning of slab caches is optimized because the defragmentable > slabs come first on the list. Thus we can terminate scans on the > first slab encountered that does not support defragmentation. > > kmem_cache_defrag() takes a node parameter. This can either be -1 if > defragmentation should be performed on all nodes, or a node number. > > Defragmentation may be disabled by setting defrag ratio to 0 > > echo 0 > /sys/kernel/slab/<cache>/defrag_used_ratio > > In order for a cache to be defragmentable the cache must support object > migration (SMO). Enabling SMO for a cache is done via a call to the > recently added function: > > void kmem_cache_setup_mobility(struct kmem_cache *, > kmem_cache_isolate_func, > kmem_cache_migrate_func); > > Co-developed-by: Christoph Lameter <cl@xxxxxxxxx> > Signed-off-by: Tobin C. Harding <tobin@xxxxxxxxxx> > --- > include/linux/slab.h | 1 + > mm/slub.c | 266 +++++++++++++++++++++++++++++++------------ > 2 files changed, 194 insertions(+), 73 deletions(-) > > diff --git a/include/linux/slab.h b/include/linux/slab.h > index 22e87c41b8a4..b9b46bc9937e 100644 > --- a/include/linux/slab.h > +++ b/include/linux/slab.h > @@ -147,6 +147,7 @@ struct kmem_cache *kmem_cache_create_usercopy(const char *name, > void (*ctor)(void *)); > void kmem_cache_destroy(struct kmem_cache *); > int kmem_cache_shrink(struct kmem_cache *); > +int kmem_cache_defrag(int node); > > void memcg_create_kmem_cache(struct mem_cgroup *, struct kmem_cache *); > void memcg_deactivate_kmem_caches(struct mem_cgroup *); > diff --git a/mm/slub.c b/mm/slub.c > index 515db0f36c55..53dd4cb5b5a4 100644 > --- a/mm/slub.c > +++ b/mm/slub.c > @@ -354,6 +354,12 @@ static __always_inline void slab_lock(struct page *page) > bit_spin_lock(PG_locked, &page->flags); > } > > +static __always_inline int slab_trylock(struct page *page) > +{ > + VM_BUG_ON_PAGE(PageTail(page), page); > + return bit_spin_trylock(PG_locked, &page->flags); > +} > + > static __always_inline void slab_unlock(struct page *page) > { > VM_BUG_ON_PAGE(PageTail(page), page); > @@ -3959,79 +3965,6 @@ void kfree(const void *x) > } > EXPORT_SYMBOL(kfree); > > -#define SHRINK_PROMOTE_MAX 32 > - > -/* > - * kmem_cache_shrink discards empty slabs and promotes the slabs filled > - * up most to the head of the partial lists. New allocations will then > - * fill those up and thus they can be removed from the partial lists. > - * > - * The slabs with the least items are placed last. This results in them > - * being allocated from last increasing the chance that the last objects > - * are freed in them. > - */ > -int __kmem_cache_shrink(struct kmem_cache *s) > -{ > - int node; > - int i; > - struct kmem_cache_node *n; > - struct page *page; > - struct page *t; > - struct list_head discard; > - struct list_head promote[SHRINK_PROMOTE_MAX]; > - unsigned long flags; > - int ret = 0; > - > - flush_all(s); > - for_each_kmem_cache_node(s, node, n) { > - INIT_LIST_HEAD(&discard); > - for (i = 0; i < SHRINK_PROMOTE_MAX; i++) > - INIT_LIST_HEAD(promote + i); > - > - spin_lock_irqsave(&n->list_lock, flags); > - > - /* > - * Build lists of slabs to discard or promote. > - * > - * Note that concurrent frees may occur while we hold the > - * list_lock. page->inuse here is the upper limit. > - */ > - list_for_each_entry_safe(page, t, &n->partial, lru) { > - int free = page->objects - page->inuse; > - > - /* Do not reread page->inuse */ > - barrier(); > - > - /* We do not keep full slabs on the list */ > - BUG_ON(free <= 0); > - > - if (free == page->objects) { > - list_move(&page->lru, &discard); > - n->nr_partial--; > - } else if (free <= SHRINK_PROMOTE_MAX) > - list_move(&page->lru, promote + free - 1); > - } > - > - /* > - * Promote the slabs filled up most to the head of the > - * partial list. > - */ > - for (i = SHRINK_PROMOTE_MAX - 1; i >= 0; i--) > - list_splice(promote + i, &n->partial); > - > - spin_unlock_irqrestore(&n->list_lock, flags); > - > - /* Release empty slabs */ > - list_for_each_entry_safe(page, t, &discard, lru) > - discard_slab(s, page); > - > - if (slabs_node(s, node)) > - ret = 1; > - } > - > - return ret; > -} > - > #ifdef CONFIG_MEMCG > static void kmemcg_cache_deact_after_rcu(struct kmem_cache *s) > { > @@ -4411,6 +4344,193 @@ static void __move(struct page *page, void *scratch, int node) > s->migrate(s, vector, count, node, private); > } > > +/* > + * __defrag() - Defragment node. > + * @s: cache we are working on. > + * @node: The node to move objects from. > + * @target_node: The node to move objects to. > + * @ratio: The defrag ratio (percentage, between 0 and 100). > + * > + * Release slabs with zero objects and try to call the migration function > + * for slabs with less than the 'ratio' percentage of objects allocated. > + * > + * Moved objects are allocated on @target_node. > + * > + * Return: The number of partial slabs left on the node after the operation. > + */ > +static unsigned long __defrag(struct kmem_cache *s, int node, int target_node, > + int ratio) Maybe kmem_cache_defrag_node()? > +{ > + struct kmem_cache_node *n = get_node(s, node); > + struct page *page, *page2; > + LIST_HEAD(move_list); > + unsigned long flags; > + > + if (node == target_node && n->nr_partial <= 1) { > + /* > + * Trying to reduce fragmentation on a node but there is > + * only a single or no partial slab page. This is already > + * the optimal object density that we can reach. > + */ > + return n->nr_partial; > + } > + > + spin_lock_irqsave(&n->list_lock, flags); > + list_for_each_entry_safe(page, page2, &n->partial, lru) { > + if (!slab_trylock(page)) > + /* Busy slab. Get out of the way */ > + continue; > + > + if (page->inuse) { > + if (page->inuse > ratio * page->objects / 100) { > + slab_unlock(page); > + /* > + * Skip slab because the object density > + * in the slab page is high enough. > + */ > + continue; > + } > + > + list_move(&page->lru, &move_list); > + if (s->migrate) { > + /* Stop page being considered for allocations */ > + n->nr_partial--; > + page->frozen = 1; > + } > + slab_unlock(page); > + } else { /* Empty slab page */ > + list_del(&page->lru); > + n->nr_partial--; > + slab_unlock(page); > + discard_slab(s, page); > + } > + } > + > + if (!s->migrate) { > + /* > + * No defrag method. By simply putting the zaplist at the > + * end of the partial list we can let them simmer longer > + * and thus increase the chance of all objects being > + * reclaimed. > + * > + */ > + list_splice(&move_list, n->partial.prev); > + } > + > + spin_unlock_irqrestore(&n->list_lock, flags); > + > + if (s->migrate && !list_empty(&move_list)) { > + void **scratch = alloc_scratch(s); > + struct page *page, *page2; > + > + if (scratch) { > + /* Try to remove / move the objects left */ > + list_for_each_entry(page, &move_list, lru) { > + if (page->inuse) > + __move(page, scratch, target_node); > + } > + kfree(scratch); > + } > + > + /* Inspect results and dispose of pages */ > + spin_lock_irqsave(&n->list_lock, flags); > + list_for_each_entry_safe(page, page2, &move_list, lru) { > + list_del(&page->lru); > + slab_lock(page); > + page->frozen = 0; > + > + if (page->inuse) { > + /* > + * Objects left in slab page, move it to the > + * tail of the partial list to increase the > + * chance that the freeing of the remaining > + * objects will free the slab page. > + */ > + n->nr_partial++; > + list_add_tail(&page->lru, &n->partial); > + slab_unlock(page); > + } else { > + slab_unlock(page); > + discard_slab(s, page); > + } > + } > + spin_unlock_irqrestore(&n->list_lock, flags); > + } > + > + return n->nr_partial; > +} > + > +/** > + * kmem_cache_defrag() - Defrag slab caches. > + * @node: The node to defrag or -1 for all nodes. > + * > + * Defrag slabs conditional on the amount of fragmentation in a page. > + */ > +int kmem_cache_defrag(int node) > +{ > + struct kmem_cache *s; > + unsigned long left = 0; > + > + /* > + * kmem_cache_defrag may be called from the reclaim path which may be > + * called for any page allocator alloc. So there is the danger that we > + * get called in a situation where slub already acquired the slub_lock > + * for other purposes. > + */ > + if (!mutex_trylock(&slab_mutex)) > + return 0; > + > + list_for_each_entry(s, &slab_caches, list) { > + /* > + * Defragmentable caches come first. If the slab cache is not > + * defragmentable then we can stop traversing the list. > + */ > + if (!s->migrate) > + break; > + > + if (node == -1) { > + int nid; > + > + for_each_node_state(nid, N_NORMAL_MEMORY) > + if (s->node[nid]->nr_partial > MAX_PARTIAL) > + left += __defrag(s, nid, nid, s->defrag_used_ratio); > + } else { > + if (s->node[node]->nr_partial > MAX_PARTIAL) > + left += __defrag(s, node, node, s->defrag_used_ratio); > + } > + } > + mutex_unlock(&slab_mutex); > + return left; > +} > +EXPORT_SYMBOL(kmem_cache_defrag); > + > +/** > + * __kmem_cache_shrink() - Shrink a cache. > + * @s: The cache to shrink. > + * > + * Reduces the memory footprint of a slab cache by as much as possible. > + * > + * This works by: > + * 1. Removing empty slabs from the partial list. > + * 2. Migrating slab objects to denser slab pages if the slab cache > + * supports migration. If not, reorganizing the partial list so that > + * more densely allocated slab pages come first. > + * > + * Not called directly, called by kmem_cache_shrink(). > + */ > +int __kmem_cache_shrink(struct kmem_cache *s) > +{ > + int node; > + int left = 0; s/int/unsigned long? Or s/unsigned long/int in __defrag()? Thanks!