From: Kairui Song <kasong@xxxxxxxxxxx> Currently, at least 3 tree walks are needed for filemap folio adding a previously evicted folio. One for getting the order, one for ranged conflict check, and one for another order retrieving. If a split is needed, more walks are needed. This series is trying to merge these walks, and speed up filemap_add_folio. Instead of doing multiple tree walks, do one optimism range check with lock hold, and exit if raced with another insertion. If a shadow exists, check it with a new xas_get_order helper before releasing the lock to avoid redundant tree walks for getting its order. Drop the lock and do the allocation only if a split is needed. In the best case, it only need to walk the tree once. If it needs to alloc and split, 3 walks are issued (One for first ranced conflict check and order retrieving, one for the second check after allocation, one for the insert after split). Testing with 4k pages, in an 8G cgroup, with 20G brd as block device: fio -name=cached --numjobs=16 --filename=/mnt/test.img \ --buffered=1 --ioengine=mmap --rw=randread --time_based \ --ramp_time=30s --runtime=5m --group_reporting Before: bw ( MiB/s): min= 790, max= 3665, per=100.00%, avg=2499.17, stdev=20.64, samples=8698 iops : min=202295, max=938417, avg=639785.81, stdev=5284.08, samples=8698 After (+4%): bw ( MiB/s): min= 451, max= 3868, per=100.00%, avg=2599.83, stdev=23.39, samples=8653 iops : min=115596, max=990364, avg=665556.34, stdev=5988.20, samples=8653 Test result with THP (do a THP test then switch to 4K page in hope it issues a lot of splitting): fio -name=cached --numjobs=16 --filename=/mnt/test.img \ --buffered=1 --ioengine mmap -thp=1 --readonly \ --rw=randread --random_distribution=random \ --time_based --runtime=5m --group_reporting fio -name=cached --numjobs=16 --filename=/mnt/test.img \ --buffered=1 --ioengine mmap --readonly \ --rw=randread --random_distribution=random \ --time_based --runtime=5s --group_reporting Before: bw ( KiB/s): min=28071, max=62359, per=100.00%, avg=53542.44, stdev=179.77, samples=9520 iops : min= 7012, max=15586, avg=13379.39, stdev=44.94, samples=9520 bw ( MiB/s): min= 2457, max= 6193, per=100.00%, avg=3923.21, stdev=82.48, samples=144 iops : min=629220, max=1585642, avg=1004340.78, stdev=21116.07, samples=144 After (+-0.0%): bw ( KiB/s): min=30561, max=63064, per=100.00%, avg=53635.82, stdev=177.21, samples=9520 iops : min= 7636, max=15762, avg=13402.82, stdev=44.29, samples=9520 bw ( MiB/s): min= 2449, max= 6145, per=100.00%, avg=3914.68, stdev=81.15, samples=144 iops : min=627106, max=1573156, avg=1002158.11, stdev=20774.77, samples=144 The performance is better (+4%) for 4K cached read and unchanged for THP. Kairui Song (4): mm/filemap: return early if failed to allocate memory for split mm/filemap: clean up hugetlb exclusion code lib/xarray: introduce a new helper xas_get_order mm/filemap: optimize filemap folio adding include/linux/xarray.h | 6 ++ lib/xarray.c | 49 +++++++++----- mm/filemap.c | 145 ++++++++++++++++++++++++----------------- 3 files changed, 121 insertions(+), 79 deletions(-) -- 2.43.0