Re: [PATCH v4 2/6] mm: skip reclaiming folios in legacy memcg writeback contexts that may block

Shakeel Butt <shakeel.butt@xxxxxxxxx> · Fri, 8 Nov 2024 16:16:22 -0800

On Thu, Nov 07, 2024 at 03:56:10PM -0800, Joanne Koong wrote:
> Currently in shrink_folio_list(), reclaim for folios under writeback
> falls into 3 different cases:
> 1) Reclaim is encountering an excessive number of folios under
>    writeback and this folio has both the writeback and reclaim flags
>    set
> 2) Dirty throttling is enabled (this happens if reclaim through cgroup
>    is not enabled, if reclaim through cgroupv2 memcg is enabled, or
>    if reclaim is on the root cgroup), or if the folio is not marked for
>    immediate reclaim, or if the caller does not have __GFP_FS (or
>    __GFP_IO if it's going to swap) set
> 3) Legacy cgroupv1 encounters a folio that already has the reclaim flag
>    set and the caller did not have __GFP_FS (or __GFP_IO if swap) set
> 
> In cases 1) and 2), we activate the folio and skip reclaiming it while
> in case 3), we wait for writeback to finish on the folio and then try
> to reclaim the folio again. In case 3, we wait on writeback because
> cgroupv1 does not have dirty folio throttling, as such this is a
> mitigation against the case where there are too many folios in writeback
> with nothing else to reclaim.
> 
> The issue is that for filesystems where writeback may block, sub-optimal
> workarounds may need to be put in place to avoid a potential deadlock
> that may arise from reclaim waiting on writeback. (Even though case 3
> above is rare given that legacy cgroupv1 is on its way to being
> deprecated, this case still needs to be accounted for). For example, for
> FUSE filesystems, a temp page gets allocated per dirty page and the
> contents of the dirty page are copied over to the temp page so that
> writeback can be immediately cleared on the dirty page in order to avoid
> the following deadlock:
> * single-threaded FUSE server is in the middle of handling a request that
>   needs a memory allocation
> * memory allocation triggers direct reclaim
> * direct reclaim waits on a folio under writeback (eg falls into case 3
>   above) that needs to be written back to the FUSE server
> * the FUSE server can't write back the folio since it's stuck in direct
>   reclaim
> 
> In this commit, if legacy memcg encounters a folio with the reclaim flag
> set (eg case 3) and the folio belongs to a mapping that has the
> AS_WRITEBACK_MAY_BLOCK flag set, the folio will be activated and skip
> reclaim (eg default to behavior in case 2) instead.
> 
> This allows for the suboptimal workarounds added to address the
> "reclaim wait on writeback" deadlock scenario to be removed.
> 
> Signed-off-by: Joanne Koong <joannelkoong@xxxxxxxxx>

This looks good just one nit below.

Reviewed-by: Shakeel Butt <shakeel.butt@xxxxxxxxx>

> ---
>  mm/vmscan.c | 10 +++++++---
>  1 file changed, 7 insertions(+), 3 deletions(-)
> 
> diff --git a/mm/vmscan.c b/mm/vmscan.c
> index 749cdc110c74..e9755cb7211b 100644
> --- a/mm/vmscan.c
> +++ b/mm/vmscan.c
> @@ -1110,6 +1110,8 @@ static unsigned int shrink_folio_list(struct list_head *folio_list,
>  		if (writeback && folio_test_reclaim(folio))
>  			stat->nr_congested += nr_pages;
>  
> +		mapping = folio_mapping(folio);

Move the above line within folio_test_writeback() check block.

> +
>  		/*
>  		 * If a folio at the tail of the LRU is under writeback, there
>  		 * are three cases to consider.
> @@ -1129,8 +1131,9 @@ static unsigned int shrink_folio_list(struct list_head *folio_list,
>  		 * 2) Global or new memcg reclaim encounters a folio that is
>  		 *    not marked for immediate reclaim, or the caller does not
>  		 *    have __GFP_FS (or __GFP_IO if it's simply going to swap,
> -		 *    not to fs). In this case mark the folio for immediate
> -		 *    reclaim and continue scanning.
> +		 *    not to fs), or writebacks in the mapping may block.
> +		 *    In this case mark the folio for immediate reclaim and
> +		 *    continue scanning.
>  		 *
>  		 *    Require may_enter_fs() because we would wait on fs, which
>  		 *    may not have submitted I/O yet. And the loop driver might
> @@ -1165,7 +1168,8 @@ static unsigned int shrink_folio_list(struct list_head *folio_list,
>  			/* Case 2 above */
>  			} else if (writeback_throttling_sane(sc) ||
>  			    !folio_test_reclaim(folio) ||
> -			    !may_enter_fs(folio, sc->gfp_mask)) {
> +			    !may_enter_fs(folio, sc->gfp_mask) ||
> +			    (mapping && mapping_writeback_may_block(mapping))) {
>  				/*
>  				 * This is slightly racy -
>  				 * folio_end_writeback() might have
> -- 
> 2.43.5
>