Re: [PATCH 08/15] mm/filemap: add read support for RWF_UNCACHED

"Kirill A. Shutemov" <kirill@xxxxxxxxxxxxx> · Mon, 11 Nov 2024 11:15:16 +0200

On Sun, Nov 10, 2024 at 08:28:00AM -0700, Jens Axboe wrote:
> Add RWF_UNCACHED as a read operation flag, which means that any data
> read wil be removed from the page cache upon completion. Uses the page
> cache to synchronize, and simply prunes folios that were instantiated
> when the operation completes. While it would be possible to use private
> pages for this, using the page cache as synchronization is handy for a
> variety of reasons:
> 
> 1) No special truncate magic is needed
> 2) Async buffered reads need some place to serialize, using the page
>    cache is a lot easier than writing extra code for this
> 3) The pruning cost is pretty reasonable
> 
> and the code to support this is much simpler as a result.
> 
> You can think of uncached buffered IO as being the much more attractive
> cousing of O_DIRECT - it has none of the restrictions of O_DIRECT. Yes,
> it will copy the data, but unlike regular buffered IO, it doesn't run
> into the unpredictability of the page cache in terms of reclaim. As an
> example, on a test box with 32 drives, reading them with buffered IO
> looks as follows:
> 
> Reading bs 65536, uncached 0
>   1s: 145945MB/sec
>   2s: 158067MB/sec
>   3s: 157007MB/sec
>   4s: 148622MB/sec
>   5s: 118824MB/sec
>   6s: 70494MB/sec
>   7s: 41754MB/sec
>   8s: 90811MB/sec
>   9s: 92204MB/sec
>  10s: 95178MB/sec
>  11s: 95488MB/sec
>  12s: 95552MB/sec
>  13s: 96275MB/sec
> 
> where it's quite easy to see where the page cache filled up, and
> performance went from good to erratic, and finally settles at a much
> lower rate. Looking at top while this is ongoing, we see:
> 
>  PID USER      PR  NI    VIRT    RES    SHR S  %CPU  %MEM     TIME+ COMMAND
> 7535 root      20   0  267004      0      0 S  3199   0.0   8:40.65 uncached
> 3326 root      20   0       0      0      0 R 100.0   0.0   0:16.40 kswapd4
> 3327 root      20   0       0      0      0 R 100.0   0.0   0:17.22 kswapd5
> 3328 root      20   0       0      0      0 R 100.0   0.0   0:13.29 kswapd6
> 3332 root      20   0       0      0      0 R 100.0   0.0   0:11.11 kswapd10
> 3339 root      20   0       0      0      0 R 100.0   0.0   0:16.25 kswapd17
> 3348 root      20   0       0      0      0 R 100.0   0.0   0:16.40 kswapd26
> 3343 root      20   0       0      0      0 R 100.0   0.0   0:16.30 kswapd21
> 3344 root      20   0       0      0      0 R 100.0   0.0   0:11.92 kswapd22
> 3349 root      20   0       0      0      0 R 100.0   0.0   0:16.28 kswapd27
> 3352 root      20   0       0      0      0 R  99.7   0.0   0:11.89 kswapd30
> 3353 root      20   0       0      0      0 R  96.7   0.0   0:16.04 kswapd31
> 3329 root      20   0       0      0      0 R  96.4   0.0   0:11.41 kswapd7
> 3345 root      20   0       0      0      0 R  96.4   0.0   0:13.40 kswapd23
> 3330 root      20   0       0      0      0 S  91.1   0.0   0:08.28 kswapd8
> 3350 root      20   0       0      0      0 S  86.8   0.0   0:11.13 kswapd28
> 3325 root      20   0       0      0      0 S  76.3   0.0   0:07.43 kswapd3
> 3341 root      20   0       0      0      0 S  74.7   0.0   0:08.85 kswapd19
> 3334 root      20   0       0      0      0 S  71.7   0.0   0:10.04 kswapd12
> 3351 root      20   0       0      0      0 R  60.5   0.0   0:09.59 kswapd29
> 3323 root      20   0       0      0      0 R  57.6   0.0   0:11.50 kswapd1
> [...]
> 
> which is just showing a partial list of the 32 kswapd threads that are
> running mostly full tilt, burning ~28 full CPU cores.
> 
> If the same test case is run with RWF_UNCACHED set for the buffered read,
> the output looks as follows:
> 
> Reading bs 65536, uncached 0
>   1s: 153144MB/sec
>   2s: 156760MB/sec
>   3s: 158110MB/sec
>   4s: 158009MB/sec
>   5s: 158043MB/sec
>   6s: 157638MB/sec
>   7s: 157999MB/sec
>   8s: 158024MB/sec
>   9s: 157764MB/sec
>  10s: 157477MB/sec
>  11s: 157417MB/sec
>  12s: 157455MB/sec
>  13s: 157233MB/sec
>  14s: 156692MB/sec
> 
> which is just chugging along at ~155GB/sec of read performance. Looking
> at top, we see:
> 
>  PID USER      PR  NI    VIRT    RES    SHR S  %CPU  %MEM     TIME+ COMMAND
> 7961 root      20   0  267004      0      0 S  3180   0.0   5:37.95 uncached
> 8024 axboe     20   0   14292   4096      0 R   1.0   0.0   0:00.13 top
> 
> where just the test app is using CPU, no reclaim is taking place outside
> of the main thread. Not only is performance 65% better, it's also using
> half the CPU to do it.
> 
> Signed-off-by: Jens Axboe <axboe@xxxxxxxxx>
> ---
>  mm/filemap.c | 18 ++++++++++++++++--
>  mm/swap.c    |  2 ++
>  2 files changed, 18 insertions(+), 2 deletions(-)
> 
> diff --git a/mm/filemap.c b/mm/filemap.c
> index 38dc94b761b7..bd698340ef24 100644
> --- a/mm/filemap.c
> +++ b/mm/filemap.c
> @@ -2474,6 +2474,8 @@ static int filemap_create_folio(struct kiocb *iocb,
>  	folio = filemap_alloc_folio(mapping_gfp_mask(mapping), min_order);
>  	if (!folio)
>  		return -ENOMEM;
> +	if (iocb->ki_flags & IOCB_UNCACHED)
> +		__folio_set_uncached(folio);
>  
>  	/*
>  	 * Protect against truncate / hole punch. Grabbing invalidate_lock
> @@ -2519,6 +2521,8 @@ static int filemap_readahead(struct kiocb *iocb, struct file *file,
>  
>  	if (iocb->ki_flags & IOCB_NOIO)
>  		return -EAGAIN;
> +	if (iocb->ki_flags & IOCB_UNCACHED)
> +		ractl.uncached = 1;
>  	page_cache_async_ra(&ractl, folio, last_index - folio->index);
>  	return 0;
>  }
> @@ -2548,6 +2552,8 @@ static int filemap_get_pages(struct kiocb *iocb, size_t count,
>  			return -EAGAIN;
>  		if (iocb->ki_flags & IOCB_NOWAIT)
>  			flags = memalloc_noio_save();
> +		if (iocb->ki_flags & IOCB_UNCACHED)
> +			ractl.uncached = 1;
>  		page_cache_sync_ra(&ractl, last_index - index);
>  		if (iocb->ki_flags & IOCB_NOWAIT)
>  			memalloc_noio_restore(flags);
> @@ -2706,8 +2712,16 @@ ssize_t filemap_read(struct kiocb *iocb, struct iov_iter *iter,
>  			}
>  		}
>  put_folios:
> -		for (i = 0; i < folio_batch_count(&fbatch); i++)
> -			folio_put(fbatch.folios[i]);
> +		for (i = 0; i < folio_batch_count(&fbatch); i++) {
> +			struct folio *folio = fbatch.folios[i];
> +
> +			if (folio_test_uncached(folio)) {
> +				folio_lock(folio);
> +				invalidate_complete_folio2(mapping, folio, 0);
> +				folio_unlock(folio);

I am not sure it is safe. What happens if it races with page fault?

The only current caller of invalidate_complete_folio2() unmaps the folio
explicitly before calling it. And folio lock prevents re-faulting.

I think we need to give up PG_uncached if we see folio_mapped(). And maybe
also mark the page accessed.

> +			}
> +			folio_put(folio);
> +		}
>  		folio_batch_init(&fbatch);
>  	} while (iov_iter_count(iter) && iocb->ki_pos < isize && !error);
>  
> diff --git a/mm/swap.c b/mm/swap.c
> index 835bdf324b76..f2457acae383 100644
> --- a/mm/swap.c
> +++ b/mm/swap.c
> @@ -472,6 +472,8 @@ static void folio_inc_refs(struct folio *folio)
>   */
>  void folio_mark_accessed(struct folio *folio)
>  {
> +	if (folio_test_uncached(folio))
> +		return;

	if (folio_test_uncached(folio)) {
		if (folio_mapped(folio))
			folio_clear_uncached(folio);
		else
			return;
	}

>  	if (lru_gen_enabled()) {
>  		folio_inc_refs(folio);
>  		return;
> -- 
> 2.45.2
> 

-- 
  Kiryl Shutsemau / Kirill A. Shutemov