RE: [PATCH v2 mm-hotfixes] mm/zswap: fix inconsistent charging when zswap_store_page() fails

"Sridhar, Kanchana P" <kanchana.p.sridhar@xxxxxxxxx> · Tue, 28 Jan 2025 19:09:05 +0000

Hi Hyeonggon,

> -----Original Message-----
> From: Hyeonggon Yoo <42.hyeyoo@xxxxxxxxx>
> Sent: Tuesday, January 28, 2025 10:55 AM
> To: Sridhar, Kanchana P <kanchana.p.sridhar@xxxxxxxxx>; Johannes Weiner
> <hannes@xxxxxxxxxxx>; Yosry Ahmed <yosryahmed@xxxxxxxxxx>; Nhat
> Pham <nphamcs@xxxxxxxxx>; Chengming Zhou
> <chengming.zhou@xxxxxxxxx>; Andrew Morton <akpm@linux-
> foundation.org>
> Cc: linux-mm@xxxxxxxxx; Hyeonggon Yoo <42.hyeyoo@xxxxxxxxx>;
> stable@xxxxxxxxxxxxxxx
> Subject: [PATCH v2 mm-hotfixes] mm/zswap: fix inconsistent charging when
> zswap_store_page() fails
> 
> Commit b7c0ccdfbafd ("mm: zswap: support large folios in zswap_store()")
> skips charging any zswapped base pages when it failed to zswap the entire
> folio.
> 
> However, when some base pages are zswapped but it failed to zswap
> the entire folio, the zswap operation is rolled back.
> When freeing zswap entries for those pages, zswap_entry_free() uncharges
> the pages that were not previously charged, causing zswap charging to
> become inconsistent.
> 
> This inconsistency triggers two warnings with following steps:
>   # On a machine with 64GiB of RAM and 36GiB of zswap
>   $ stress-ng --bigheap 2 # wait until the OOM-killer kills stress-ng
>   $ sudo reboot
> 
>   Two warnings are:
>     in mm/memcontrol.c:163, function obj_cgroup_release():
>       WARN_ON_ONCE(nr_bytes & (PAGE_SIZE - 1));
> 
>     in mm/page_counter.c:60, function page_counter_cancel():
>       if (WARN_ONCE(new < 0, "page_counter underflow: %ld
> nr_pages=%lu\n",
> 	  new, nr_pages))
> 
> While objcg events should only be accounted for when the entire folio is
> zswapped, objcg charging should be performed regardlessly.
> Fix accordingly.
> 
> After resolving the inconsistency, these warnings disappear.
> 
> Fixes: b7c0ccdfbafd ("mm: zswap: support large folios in zswap_store()")
> Cc: stable@xxxxxxxxxxxxxxx
> Signed-off-by: Hyeonggon Yoo <42.hyeyoo@xxxxxxxxx>
> ---
> 
> v1->v2:
> 
>  Fixed objcg events being accounted for on zswap failure.
> 
>  Fixed the incorrect description. I misunderstood that the base pages are
>  going to be stored in zswap, but their zswap entries are freed immediately.
> 
>  Added a comment on why it charges pages that are going to be removed
>  from zswap.
> 
>  mm/zswap.c | 14 ++++++++++----
>  1 file changed, 10 insertions(+), 4 deletions(-)
> 
> diff --git a/mm/zswap.c b/mm/zswap.c
> index 6504174fbc6a..10b30ac46deb 100644
> --- a/mm/zswap.c
> +++ b/mm/zswap.c
> @@ -1568,20 +1568,26 @@ bool zswap_store(struct folio *folio)
> 
>  		bytes = zswap_store_page(page, objcg, pool);
>  		if (bytes < 0)
> -			goto put_pool;
> +			goto charge_zswap;
>  		compressed_bytes += bytes;
>  	}
> 
> -	if (objcg) {
> -		obj_cgroup_charge_zswap(objcg, compressed_bytes);
> +	if (objcg)
>  		count_objcg_events(objcg, ZSWPOUT, nr_pages);
> -	}
> 
>  	atomic_long_add(nr_pages, &zswap_stored_pages);
>  	count_vm_events(ZSWPOUT, nr_pages);
> 
>  	ret = true;
> 
> +charge_zswap:
> +	/*
> +	 * Charge zswapped pages even when it failed to zswap the entire
> folio,
> +	 * because zswap_entry_free() will uncharge them anyway.
> +	 * Otherwise zswap charging will become inconsistent.
> +	 */
> +	if (objcg)
> +		obj_cgroup_charge_zswap(objcg, compressed_bytes);

Thanks for finding this bug! I am thinking it might make sense to charge
and increment the zswap_stored_pages counter in zswap_store_page().
Something like:

diff --git a/mm/zswap.c b/mm/zswap.c
index b84c20d889b1..fd2a72598a8a 100644
--- a/mm/zswap.c
+++ b/mm/zswap.c
@@ -1504,11 +1504,14 @@ static ssize_t zswap_store_page(struct page *page,
 	entry->pool = pool;
 	entry->swpentry = page_swpentry;
 	entry->objcg = objcg;
+	if (objcg)
+		obj_cgroup_charge_zswap(objcg, entry->length);
 	entry->referenced = true;
 	if (entry->length) {
 		INIT_LIST_HEAD(&entry->lru);
 		zswap_lru_add(&zswap_list_lru, entry);
 	}
+	atomic_long_inc(&zswap_stored_pages);
 
 	return entry->length;
 
@@ -1526,7 +1529,6 @@ bool zswap_store(struct folio *folio)
 	struct obj_cgroup *objcg = NULL;
 	struct mem_cgroup *memcg = NULL;
 	struct zswap_pool *pool;
-	size_t compressed_bytes = 0;
 	bool ret = false;
 	long index;
 
@@ -1569,15 +1571,11 @@ bool zswap_store(struct folio *folio)
 		bytes = zswap_store_page(page, objcg, pool);
 		if (bytes < 0)
 			goto put_pool;
-		compressed_bytes += bytes;
 	}
 
-	if (objcg) {
-		obj_cgroup_charge_zswap(objcg, compressed_bytes);
+	if (objcg)
 		count_objcg_events(objcg, ZSWPOUT, nr_pages);
-	}
 
-	atomic_long_add(nr_pages, &zswap_stored_pages);
 	count_vm_events(ZSWPOUT, nr_pages);
 
 	ret = true;

What do you think?

Yosry, Nhat, Johannes, please let me know if this would be a cleaner
approach. If so, I don't think we would be losing a lot of performance
by not doing the one-time charge per folio, but please let me know
your thoughts as well.

Thanks,
Kanchana

>  put_pool:
>  	zswap_pool_put(pool);
>  put_objcg:
> --
> 2.47.1