The patch titled Subject: mm/zsmalloc: don't hold locks of all pages when free_zspage() has been added to the -mm mm-unstable branch. Its filename is mm-zsmalloc-dont-hold-locks-of-all-pages-when-free_zspage.patch This patch will shortly appear at https://git.kernel.org/pub/scm/linux/kernel/git/akpm/25-new.git/tree/patches/mm-zsmalloc-dont-hold-locks-of-all-pages-when-free_zspage.patch This patch will later appear in the mm-unstable branch at git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm Before you just go and hit "reply", please: a) Consider who else should be cc'ed b) Prefer to cc a suitable mailing list as well c) Ideally: find the original patch on the mailing list and do a reply-to-all to that, adding suitable additional cc's *** Remember to use Documentation/process/submit-checklist.rst when testing your code *** The -mm tree is included into linux-next via the mm-everything branch at git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm and is updated there every 2-3 working days ------------------------------------------------------ From: Chengming Zhou <zhouchengming@xxxxxxxxxxxxx> Subject: mm/zsmalloc: don't hold locks of all pages when free_zspage() Date: Tue, 27 Feb 2024 03:02:54 +0000 Patch series "mm/zsmalloc: simplify synchronization between zs_page_migrate() and free_zspage()". free_zspage() has to hold locks of all pages, since zs_page_migrate() path rely on this page lock to protect the race between zs_free() and it, so it can safely get zspage from page->private. But this way is not good and simple enough: 1. Since zs_free() couldn't be sleepable, it can only trylock pages, or has to kick_deferred_free() to defer that to a work. 2. Even in the worker context, async_free_zspage() can't simply lock all pages in lock_zspage(), it's still trylock because of the race between zs_free() and zs_page_migrate(). Please see the commit 2505a981114d ("zsmalloc: fix races between asynchronous zspage free and page migration") for details. Actually, all free_zspage() needs is to get zspage from page safely, we can use RCU to achieve it easily. Then free_zspage() don't need to hold locks of all pages, so don't need the deferred free mechanism at all. This patchset implements it and remove all of deferred free related code. This patch (of 2): free_zspage() has to hold locks of all pages, since zs_page_migrate() path rely on this page lock to protect the race between zs_free() and it, so it can safely get zspage from page->private. But this way is not good and simple enough: 1. Since zs_free() couldn't be sleepable, it can only trylock pages, or has to kick_deferred_free() to defer that to a work. 2. Even in the worker context, async_free_zspage() can't simply lock all pages in lock_zspage(), it's still trylock because of the race between zs_free() and zs_page_migrate(). Please see the commit 2505a981114d ("zsmalloc: fix races between asynchronous zspage free and page migration") for details. Actually, all free_zspage() needs is to get zspage from page safely, we can use RCU to achieve it easily. Then free_zspage() don't need to hold locks of all pages, so don't need the deferred free mechanism at all. The updated zs_page_migrate() now has two more cases to consider: 1. get_zspage_lockless() return NULL: it means free_zspage() has used reset_page() on this page and its reference of page. 2. get_zspage_lockless() return zspage but it's not on pool list: it means zspage has been removed from list and in process of free. I'm not sure what value should be returned in these cases? -EINVAL or -EAGAIN or other value? If the migration caller can find that page has no extra referenced and can just free it, I think we should return -EAGAIN to let the migration caller retry this page later to free it. Now I choose to use -EINVAL to skip migration of this page, it seems not a big deal to fail migration of some pages? Link: https://lkml.kernel.org/r/20240226-zsmalloc-zspage-rcu-v1-0-456b0ef1a89d@xxxxxxxxxxxxx Link: https://lkml.kernel.org/r/20240226-zsmalloc-zspage-rcu-v1-1-456b0ef1a89d@xxxxxxxxxxxxx Signed-off-by: Chengming Zhou <zhouchengming@xxxxxxxxxxxxx> Cc: Johannes Weiner <hannes@xxxxxxxxxxx> Cc: Minchan Kim <minchan@xxxxxxxxxx> Cc: Nhat Pham <nphamcs@xxxxxxxxx> Cc: Sergey Senozhatsky <senozhatsky@xxxxxxxxxxxx> Cc: Yosry Ahmed <yosryahmed@xxxxxxxxxx> Signed-off-by: Andrew Morton <akpm@xxxxxxxxxxxxxxxxxxxx> --- mm/zsmalloc.c | 97 +++++++++++++++++++++++++++--------------------- 1 file changed, 56 insertions(+), 41 deletions(-) --- a/mm/zsmalloc.c~mm-zsmalloc-dont-hold-locks-of-all-pages-when-free_zspage +++ a/mm/zsmalloc.c @@ -253,6 +253,7 @@ struct zspage { struct list_head list; /* fullness list */ struct zs_pool *pool; rwlock_t lock; + struct rcu_head rcu_head; }; struct mapping_area { @@ -310,6 +311,8 @@ static int create_cache(struct zs_pool * static void destroy_cache(struct zs_pool *pool) { kmem_cache_destroy(pool->handle_cachep); + /* Synchronize RCU to free zspage. */ + synchronize_rcu(); kmem_cache_destroy(pool->zspage_cachep); } @@ -335,6 +338,14 @@ static void cache_free_zspage(struct zs_ kmem_cache_free(pool->zspage_cachep, zspage); } +static void rcu_free_zspage(struct rcu_head *h) +{ + struct zspage *zspage = container_of(h, struct zspage, rcu_head); + struct zs_pool *pool = zspage->pool; + + kmem_cache_free(pool->zspage_cachep, zspage); +} + /* pool->lock(which owns the handle) synchronizes races */ static void record_obj(unsigned long handle, unsigned long obj) { @@ -710,14 +721,31 @@ out: return newfg; } +static void set_zspage(struct page *page, struct zspage *zspage) +{ + struct zspage __rcu **private = (struct zspage __rcu **)&page->private; + + rcu_assign_pointer(*private, zspage); +} + static struct zspage *get_zspage(struct page *page) { - struct zspage *zspage = (struct zspage *)page_private(page); + struct zspage __rcu **private = (struct zspage __rcu **)&page->private; + struct zspage *zspage; + zspage = rcu_dereference_protected(*private, true); BUG_ON(zspage->magic != ZSPAGE_MAGIC); return zspage; } +/* Only used in zs_page_migrate() to get zspage locklessly. */ +static struct zspage *get_zspage_lockless(struct page *page) +{ + struct zspage __rcu **private = (struct zspage __rcu **)&page->private; + + return rcu_dereference(*private); +} + static struct page *get_next_page(struct page *page) { struct zspage *zspage = get_zspage(page); @@ -793,32 +821,11 @@ static void reset_page(struct page *page { __ClearPageMovable(page); ClearPagePrivate(page); - set_page_private(page, 0); + set_zspage(page, NULL); page_mapcount_reset(page); page->index = 0; } -static int trylock_zspage(struct zspage *zspage) -{ - struct page *cursor, *fail; - - for (cursor = get_first_page(zspage); cursor != NULL; cursor = - get_next_page(cursor)) { - if (!trylock_page(cursor)) { - fail = cursor; - goto unlock; - } - } - - return 1; -unlock: - for (cursor = get_first_page(zspage); cursor != fail; cursor = - get_next_page(cursor)) - unlock_page(cursor); - - return 0; -} - static void __free_zspage(struct zs_pool *pool, struct size_class *class, struct zspage *zspage) { @@ -834,13 +841,12 @@ static void __free_zspage(struct zs_pool VM_BUG_ON_PAGE(!PageLocked(page), page); next = get_next_page(page); reset_page(page); - unlock_page(page); dec_zone_page_state(page, NR_ZSPAGES); put_page(page); page = next; } while (page != NULL); - cache_free_zspage(pool, zspage); + call_rcu(&zspage->rcu_head, rcu_free_zspage); class_stat_dec(class, ZS_OBJS_ALLOCATED, class->objs_per_zspage); atomic_long_sub(class->pages_per_zspage, &pool->pages_allocated); @@ -852,16 +858,6 @@ static void free_zspage(struct zs_pool * VM_BUG_ON(get_zspage_inuse(zspage)); VM_BUG_ON(list_empty(&zspage->list)); - /* - * Since zs_free couldn't be sleepable, this function cannot call - * lock_page. The page locks trylock_zspage got will be released - * by __free_zspage. - */ - if (!trylock_zspage(zspage)) { - kick_deferred_free(pool); - return; - } - remove_zspage(class, zspage); __free_zspage(pool, class, zspage); } @@ -929,7 +925,7 @@ static void create_page_chain(struct siz */ for (i = 0; i < nr_pages; i++) { page = pages[i]; - set_page_private(page, (unsigned long)zspage); + set_zspage(page, zspage); page->index = 0; if (i == 0) { zspage->first_page = page; @@ -978,10 +974,11 @@ static struct zspage *alloc_zspage(struc pages[i] = page; } - create_page_chain(class, zspage, pages); init_zspage(class, zspage); zspage->pool = pool; zspage->class = class->index; + /* RCU set_zspage() after zspage initialized. */ + create_page_chain(class, zspage, pages); return zspage; } @@ -1765,17 +1762,35 @@ static int zs_page_migrate(struct page * VM_BUG_ON_PAGE(!PageIsolated(page), page); - /* The page is locked, so this pointer must remain valid */ - zspage = get_zspage(page); - pool = zspage->pool; + rcu_read_lock(); + zspage = get_zspage_lockless(page); + if (!zspage) { + rcu_read_unlock(); + return -EINVAL; + } /* * The pool's lock protects the race between zpage migration - * and zs_free. + * and zs_free. We check if the zspage is still in pool with + * pool->lock protection. If the zspage isn't in pool anymore, + * it should be freed by RCU soon. */ + pool = zspage->pool; spin_lock(&pool->lock); class = zspage_class(pool, zspage); + if (list_empty(&zspage->list)) { + spin_unlock(&pool->lock); + rcu_read_unlock(); + return -EINVAL; + } + + /* + * Now the zspage is still on pool, and we held pool->lock, + * it can't be freed in the meantime. + */ + rcu_read_unlock(); + /* the migrate_write_lock protects zpage access via zs_map_object */ migrate_write_lock(zspage); _ Patches currently in -mm which might be from zhouchengming@xxxxxxxxxxxxx are mm-zswap-global-lru-and-shrinker-shared-by-all-zswap_pools.patch mm-zswap-change-zswap_pool-kref-to-percpu_ref.patch mm-zsmalloc-dont-hold-locks-of-all-pages-when-free_zspage.patch mm-zsmalloc-remove-the-deferred-free-mechanism.patch