Re: [PATCH v8 0/5] block: add larger order folio instead of pages

Luis Chamberlain <mcgrof@xxxxxxxxxx> · Thu, 8 Aug 2024 16:04:03 -0700

On Thu, Jul 11, 2024 at 10:37:45AM +0530, Kundan Kumar wrote:
> User space memory is mapped in kernel in form of pages array. These pages
> are iterated and added to BIO. In process, pages are also checked for
> contiguity and merged.
> 
> When mTHP is enabled the pages generally belong to larger order folio. This
> patch series enables adding large folio to bio. It fetches folio for
> page in the page array. The page might start from an offset in the folio
> which could be multiples of PAGE_SIZE. Subsequent pages in page array
> might belong to same folio. Using the length of folio, folio_offset and
> remaining size, determine length in folio which can be added to the bio.
> Check if pages are contiguous and belong to same folio. If yes then skip
> further processing for the contiguous pages.
> 
> This complete scheme reduces the overhead of iterating through pages.
> 
> perf diff before and after this change(with mTHP enabled):
> 
> Perf diff for write I/O with 128K block size:
>     1.24%     -0.20%  [kernel.kallsyms]  [k] bio_iov_iter_get_pages
>     1.71%             [kernel.kallsyms]  [k] bvec_try_merge_page
> Perf diff for read I/O with 128K block size:
>     4.03%     -1.59%  [kernel.kallsyms]  [k] bio_iov_iter_get_pages
>     5.14%             [kernel.kallsyms]  [k] bvec_try_merge_page

This is not just about mTHP uses though, this can also affect buffered IO and
direct IO patterns as well and this needs to be considered and tested as well.

I've given this a spin on top of of the LBS patches [0] and used the LBS
patches as a baseline. The good news is I see a considerable amount of
larger IOs for buffered IO and direct IO, however for buffered IO there
is an increase on unalignenment to the target filesystem block size and
that can affect performance.

You can test this with Daniel Gomez's blkalgn tool for IO introspection:

wget https://raw.githubusercontent.com/dkruces/bcc/lbs/tools/blkalgn.py
mv blkalgn.py /usr/local/bin/
apt-get install python3-bpfcc

And so let's try to make things "bad" by forcing a million of small 4k files
on a 64k block size fileystem, we see an increase in alignment by a
factor of about 2133:

fio -name=1k-files-per-thread --nrfiles=1000 -direct=0 -bs=512 \
	-ioengine=io_uring --group_reporting=1 \
	--alloc-size=2097152 --filesize=4KiB --readwrite=randwrite \
	--fallocate=none --numjobs=1000 --create_on_open=1 --directory=$DIR

# Force any pending IO from the page cache
umount /xfs-64k/

You can use blkalgn with something like this:

The left hand side are order, so for example we see only six 4k IOs
aligned to 4k with the baseline of just LBS on top of next-20240723.
However with these patches that increases to 11 4k IOs, but 23,468 IOs
are aligned to 4k.

mkfs.xfs -f -b size=64k /dev/nvme0n1
blkalgn -d nvme0n1 --ops Write --json-output 64k-next-20240723.json

# Hit CTRL-C after you umount above.

cat 64k-next-20240723.json
{
    "Block size": {
        "13": 1,
        "12": 6,
        "18": 244899,
        "16": 5236751,
        "17": 13088
    },
    "Algn size": {
        "18": 244899,
        "12": 6,
        "17": 9793,
        "13": 1,
        "16": 5240047
    }
}

And with this series say 64k-next-20240723-block-folios.json

{
    "Block size": {
        "16": 1018244,
        "9": 7,
        "17": 507163,
        "13": 16,
        "10": 4,
        "15": 51671,
        "12": 11,
        "14": 43,
        "11": 5
    },
    "Algn size": {
        "15": 6651,
        "16": 1018244,
        "13": 17620,
        "12": 23468,
        "17": 507163,
        "14": 4018
    }
}

When using direct IO, since applications typically do the right thing,
I see only improvements. And so this needs a bit more testing and
evaluation for impact on alignment for buffered IO.

[0] https://github.com/linux-kdevops/linux/tree/large-block-folio-for-next

  Luis