Re: [PATCH 4/4] mm/filemap: optimize filemap folio adding

Matthew Wilcox <willy@xxxxxxxxxxxxx> · Tue, 19 Mar 2024 22:19:57 +0000

On Tue, Mar 19, 2024 at 05:27:33PM +0800, Kairui Song wrote:
> From: Kairui Song <kasong@xxxxxxxxxxx>
> 
> Instead of doing multiple tree walks, do one optimism range check
> with lock hold, and exit if raced with another insertion. If a shadow
> exists, check it with a new xas_get_order helper before releasing the
> lock to avoid redundant tree walks for getting its order.
> 
> Drop the lock and do the allocation only if a split is needed.
> 
> In the best case, it only need to walk the tree once. If it needs
> to alloc and split, 3 walks are issued (One for first ranced
> conflict check and order retrieving, one for the second check after
> allocation, one for the insert after split).
> 
> Testing with 4k pages, in an 8G cgroup, with 20G brd as block device:
> 
> fio -name=cached --numjobs=16 --filename=/mnt/test.img \
>   --buffered=1 --ioengine=mmap --rw=randread --time_based \
>   --ramp_time=30s --runtime=5m --group_reporting
> 
> Before:
> bw (  MiB/s): min=  790, max= 3665, per=100.00%, avg=2499.17, stdev=20.64, samples=8698
> iops        : min=202295, max=938417, avg=639785.81, stdev=5284.08, samples=8698
> 
> After (+4%):
> bw (  MiB/s): min=  451, max= 3868, per=100.00%, avg=2599.83, stdev=23.39, samples=8653
> iops        : min=115596, max=990364, avg=665556.34, stdev=5988.20, samples=8653
> 
> Test result with THP (do a THP randread then switch to 4K page in hope it
> issues a lot of splitting):
> 
> fio -name=cached --numjobs=16 --filename=/mnt/test.img \
>   --buffered=1 --ioengine mmap -thp=1 --readonly \
>   --rw=randread --random_distribution=random \
>   --time_based --runtime=5m --group_reporting
> 
> fio -name=cached --numjobs=16 --filename=/mnt/test.img \
>   --buffered=1 --ioengine mmap --readonly \
>   --rw=randread --random_distribution=random \
>   --time_based --runtime=5s --group_reporting
> 
> Before:
> bw (  KiB/s): min=28071, max=62359, per=100.00%, avg=53542.44, stdev=179.77, samples=9520
> iops        : min= 7012, max=15586, avg=13379.39, stdev=44.94, samples=9520
> bw (  MiB/s): min= 2457, max= 6193, per=100.00%, avg=3923.21, stdev=82.48, samples=144
> iops        : min=629220, max=1585642, avg=1004340.78, stdev=21116.07, samples=144
> 
> After (+-0.0%):
> bw (  KiB/s): min=30561, max=63064, per=100.00%, avg=53635.82, stdev=177.21, samples=9520
> iops        : min= 7636, max=15762, avg=13402.82, stdev=44.29, samples=9520
> bw (  MiB/s): min= 2449, max= 6145, per=100.00%, avg=3914.68, stdev=81.15, samples=144
> iops        : min=627106, max=1573156, avg=1002158.11, stdev=20774.77, samples=144
> 
> The performance is better (+4%) for 4K cached read and unchanged for THP.
> 
> Signed-off-by: Kairui Song <kasong@xxxxxxxxxxx>
> ---
>  mm/filemap.c | 127 ++++++++++++++++++++++++++++++---------------------
>  1 file changed, 76 insertions(+), 51 deletions(-)
> 
> diff --git a/mm/filemap.c b/mm/filemap.c
> index 6bbec8783793..c1484bcdbddb 100644
> --- a/mm/filemap.c
> +++ b/mm/filemap.c
> @@ -848,12 +848,77 @@ void replace_page_cache_folio(struct folio *old, struct folio *new)
>  }
>  EXPORT_SYMBOL_GPL(replace_page_cache_folio);
>  
> +static int __split_add_folio_locked(struct xa_state *xas, struct folio *folio,
> +				    pgoff_t index, gfp_t gfp, void **shadowp)

I don't love the name of this function.  Splitting is a rare thing that
it does.  I'd suggest it's more filemap_store().

> +{
> +	void *entry, *shadow, *alloced_shadow = NULL;
> +	int order, alloced_order = 0;
> +
> +	gfp &= GFP_RECLAIM_MASK;
> +	for (;;) {
> +		shadow = NULL;
> +		order = 0;
> +
> +		xas_for_each_conflict(xas, entry) {
> +			if (!xa_is_value(entry))
> +				return -EEXIST;
> +			shadow = entry;
> +		}
> +
> +		if (shadow) {
> +			if (shadow == xas_reload(xas)) {

Why do you need the xas_reload here?

> +				order = xas_get_order(xas);
> +				if (order && order > folio_order(folio)) {
> +					/* entry may have been split before we acquired lock */
> +					if (shadow != alloced_shadow || order != alloced_order)
> +						goto unlock;
> +					xas_split(xas, shadow, order);
> +					xas_reset(xas);
> +				}
> +				order = 0;
> +			}

I don't think this is right.  I think we can end up skipping a split
and storing a folio into a slot which is of greater order than the folio
we're storing.

> +			if (shadowp)
> +				*shadowp = shadow;
> +		}
> +
> +		xas_store(xas, folio);
> +		/* Success, return with mapping locked */
> +		if (!xas_error(xas))
> +			return 0;
> +unlock:
> +		/*
> +		 * Unlock path, if errored, return unlocked.
> +		 * If allocation needed, alloc and retry.
> +		 */
> +		xas_unlock_irq(xas);
> +		if (order) {
> +			if (unlikely(alloced_order))
> +				xas_destroy(xas);
> +			xas_split_alloc(xas, shadow, order, gfp);
> +			if (!xas_error(xas)) {
> +				alloced_shadow = shadow;
> +				alloced_order = order;
> +			}
> +			goto next;
> +		}
> +		/* xas_nomem result checked by xas_error below */
> +		xas_nomem(xas, gfp);
> +next:
> +		xas_lock_irq(xas);
> +		if (xas_error(xas))
> +			return xas_error(xas);
> +
> +		xas_reset(xas);
> +	}
> +}

Splitting this out into a different function while changing the logic
really makes this hard to review ;-(

I don't object to splitting the function, but maybe two patches; one
to move the logic and a second to change it?