Re: [PATCH v2] mm: memcg: add THP swap out info for anonymous reclaim

Johannes Weiner <hannes@xxxxxxxxxxx> · Wed, 13 Sep 2023 11:52:17 -0400

On Tue, Sep 12, 2023 at 10:17:25AM +0800, Vern Hao wrote:
> From: Xin Hao <vernhao@xxxxxxxxxxx>
> 
> At present, we support per-memcg reclaim strategy, however we do not
> know the number of transparent huge pages being reclaimed, as we know
> the transparent huge pages need to be splited before reclaim them, and
> they will bring some performance bottleneck effect. for example, when
> two memcg (A & B) are doing reclaim for anonymous pages at same time,
> and 'A' memcg is reclaiming a large number of transparent huge pages, we
> can better analyze that the performance bottleneck will be caused by 'A'
> memcg.  therefore, in order to better analyze such problems, there add
> THP swap out info for per-memcg.
> 
> Signed-off-by: Xin Hao <vernhao@xxxxxxxxxxx>
> ---
> v1 -> v2
> 	- Do some fix as Johannes Weiner suggestion.
> v1: 
> https://lore.kernel.org/linux-mm/20230911160824.GB103342@xxxxxxxxxxx/T/
> 
>  mm/memcontrol.c | 2 ++
>  mm/page_io.c    | 4 +++-
>  mm/vmscan.c     | 1 +
>  3 files changed, 6 insertions(+), 1 deletion(-)
> 
> diff --git a/mm/memcontrol.c b/mm/memcontrol.c
> index ecc07b47e813..32d50db9ea0d 100644
> --- a/mm/memcontrol.c
> +++ b/mm/memcontrol.c
> @@ -752,6 +752,8 @@ static const unsigned int memcg_vm_event_stat[] = {
>  #ifdef CONFIG_TRANSPARENT_HUGEPAGE
>  	THP_FAULT_ALLOC,
>  	THP_COLLAPSE_ALLOC,
> +	THP_SWPOUT,
> +	THP_SWPOUT_FALLBACK,

Can you please add documentation to
Documentation/admin-guide/cgroup-v2.rst?

Sorry, I missed this in v1.

> @@ -208,8 +208,10 @@ int swap_writepage(struct page *page, struct writeback_control *wbc)
>  static inline void count_swpout_vm_event(struct folio *folio)
>  {
>  #ifdef CONFIG_TRANSPARENT_HUGEPAGE
> -	if (unlikely(folio_test_pmd_mappable(folio)))
> +	if (unlikely(folio_test_pmd_mappable(folio))) {
> +		count_memcg_folio_events(folio, THP_SWPOUT, 1);
>  		count_vm_event(THP_SWPOUT);
> +	}
>  #endif
>  	count_vm_events(PSWPOUT, folio_nr_pages(folio));
>  }

Looking through the callers, they seem mostly fine except this one:

static void sio_write_complete(struct kiocb *iocb, long ret)
{
	struct swap_iocb *sio = container_of(iocb, struct swap_iocb, iocb);
	struct page *page = sio->bvec[0].bv_page;
	int p;

	if (ret != sio->len) {
		/*
		 * In the case of swap-over-nfs, this can be a
		 * temporary failure if the system has limited
		 * memory for allocating transmit buffers.
		 * Mark the page dirty and avoid
		 * folio_rotate_reclaimable but rate-limit the
		 * messages but do not flag PageError like
		 * the normal direct-to-bio case as it could
		 * be temporary.
		 */
		pr_err_ratelimited("Write error %ld on dio swapfile (%llu)\n",
				   ret, page_file_offset(page));
		for (p = 0; p < sio->pages; p++) {
			page = sio->bvec[p].bv_page;
			set_page_dirty(page);
			ClearPageReclaim(page);
		}
	} else {
		for (p = 0; p < sio->pages; p++)
			count_swpout_vm_event(page_folio(sio->bvec[p].bv_page));

This is called at the end of IO where the page isn't locked
anymore. Since it's not locked, page->memcg is not stable and might
get freed (charge moving is deprecated but still possible).

The fix is simple, though. Every other IO path bumps THP_SWPOUT before
starting the IO while the page is still locked. We don't really care
if we get SWPOUT events even for failed IOs. So we can just adjust
this caller to fit the others, and count while still locked:

diff --git a/mm/page_io.c b/mm/page_io.c
index fe4c21af23f2..7925e19aeedd 100644
--- a/mm/page_io.c
+++ b/mm/page_io.c
@@ -278,9 +278,6 @@ static void sio_write_complete(struct kiocb *iocb, long ret)
 			set_page_dirty(page);
 			ClearPageReclaim(page);
 		}
-	} else {
-		for (p = 0; p < sio->pages; p++)
-			count_swpout_vm_event(page_folio(sio->bvec[p].bv_page));
 	}
 
 	for (p = 0; p < sio->pages; p++)
@@ -296,6 +293,7 @@ static void swap_writepage_fs(struct page *page, struct writeback_control *wbc)
 	struct file *swap_file = sis->swap_file;
 	loff_t pos = page_file_offset(page);
 
+	count_swpout_vm_event(page_folio(sio->bvec[p].bv_page));
 	set_page_writeback(page);
 	unlock_page(page);
 	if (wbc->swap_plug)