Re: [RFC PATCH 0/4] Enable >0 order folio memory compaction

Luis Chamberlain <mcgrof@xxxxxxxxxx> · Wed, 20 Sep 2023 17:55:51 -0700

On Tue, Sep 12, 2023 at 12:28:11PM -0400, Zi Yan wrote:
> From: Zi Yan <ziy@xxxxxxxxxx>
> 
> Feel free to give comments and ask questions.

How about testing? I'm looking with an eye towards creating a
pathalogical situation which can be automated for fragmentation and see
how things go.

Mel Gorman's original artificial fragmentation taken from his first
patches ot help with fragmentation avoidance from 2018 suggested he
tried [0]:

------ From 2018
a) Create an XFS filesystem

b) Start 4 fio threads that write a number of 64K files inefficiently.
Inefficiently means that files are created on first access and not
created in advance (fio parameterr create_on_open=1) and fallocate is
not used (fallocate=none). With multiple IO issuers this creates a mix
of slab and page cache allocations over time. The total size of the
files is 150% physical memory so that the slabs and page cache pages get
mixed

c) Warm up a number of fio read-only threads accessing the same files
created in step 2. This part runs for the same length of time it took to
create the files. It'll fault back in old data and further interleave
slab and page cache allocations. As it's now low on memory due to step
2, fragmentation occurs as pageblocks get stolen. While step 3 is still
running, start a process that tries to allocate 75% of memory as huge
pages with a number of threads. The number of threads is based on a
(NR_CPUS_SOCKET - NR_FIO_THREADS)/4 to avoid THP threads contending with
fio, any other threads or forcing cross-NUMA scheduling. Note that the
test has not been used on a machine with less than 8 cores. The
benchmark records whether huge pages were allocated and what the fault
latency was in microseconds

d) Measure the number of events potentially causing external fragmentation,
the fault latency and the huge page allocation success rate.
------- end of extract

These days we can probably do a bit more damage. There has been concerns
that LBS support (block size > ps) could hinder fragmentation, one of
the reasons is that any file created despite it's size will require at
least the block size, and if using 64k block size that means 64k
allocation for each new file on that 64k block size filesystem, so
clearly you may run out of lower order allocations pretty quickly. You
can also create different larg eblock filesystems too, one for 64k
another for 32k. Although LBS is new and we're still ironing out the
kinks if you wanna give it a go we've rebased the patches onto Linus'
tree [1], and if you wanted to ramp up fast you could use kdevops [2] which
let's you pick that branch and also a series of NVMe drives (by enabling
CONFIG_LIBVIRT_EXTRA_STORAGE_DRIVE_NVME) for large IO experimentation (by
enabling CONFIG_VAGRANT_ENABLE_LARGEIO). Creating different filesystem
with large block size (64k, 32k, 16k) on a 4k sector size drive
(mkfs.xfs -f -b size=64k -s size=4k) should let you easily do tons of
crazy pathalogical things.

Are there other known recipes test help test this stuff?
How do we measure success in your patches for fragmentation exactly?

[0] https://lwn.net/Articles/770235/
[1] https://git.kernel.org/pub/scm/linux/kernel/git/mcgrof/linux.git/log/?h=large-block-linus-nobdev
[2] https://github.com/linux-kdevops/kdevops

  Luis