Re: Folios for anonymous memory

Matthew Wilcox <willy@xxxxxxxxxxxxx> · Wed, 15 Feb 2023 15:13:46 +0000

On Wed, Feb 15, 2023 at 12:38:13PM +0000, Ryan Roberts wrote:
> Kernel Compilation:
> Speed up due to SW overhead reduction: 6.5%
> Speed up due to HW overhead reduction: 5.0%
> Total speed up: 11.5%
> 
> Speedometer 2.0:
> Speed up due to SW overhead reduction: 5.3%
> Speed up due to HW overhead reduction: 5.1%
> Total speed up: 10.4%
> 
> Digging into the reasons for the SW-side speedup, it boils down to less
> book-keeping - 4x fewer page faults, 4x fewer pages to manage locks/refcounts/…
> for, which leads to faster abort and syscall handling. I think these phenomena
> are well understood in the Folio context? Although for these workloads, the
> memory is primarily anonymous.

All of that tracks pretty well with what I've found.  Although I haven't
been conducting exactly the same experiments, and different hardware is
going to have different properties, it all seems about right.

> I’d like to figure out how to realise some of these benefits in a kernel that
> still maintains a 4K page user ABI. Reading over old threads, LWN and watching
> Matthew’s talk at OSS last summer, it sounds like this is exactly what Folios
> intend to solve?

Yes, it's exactly what folios are supposed to achieve -- opportunistic use
of larger memory allocations & TLB sizes when the stars align.

> So a few questions:
> 
> - I’ve seen folios for anon memory listed as future work; what’s the current
> status? Is anyone looking at this? It’s something that I would be interested to
> take a look at if not (although don’t take that as an actual commitment yet!).

There are definitely people _looking_ at it.  I don't think anyone's
committed to it, and I don't think there's anyone 50 patches into a 100
patch series to make it work ;-)  I think there are a lot of unanswered
questions about how best to do it.

> - My understanding is that as of v6.0, at least, XFS was the only FS supporting
> large folios? Has that picture changed? Is there any likelihood of seeing ext4
> and f2fs support anytime soon?

We have some progress on that front.  In addition to XFS, AFS, EROFS
and tmpfs currently enable support for large folios.  I've heard tell
of NFS support coming soon.  I'm pretty sure CIFS is looking into it.
The OCFS2 maintainers are interested.  You can find the current state
of fs support by grepping for mapping_set_large_folios().

People are working on it from the f2fs side:
https://lore.kernel.org/linux-fsdevel/Y5D8wYGpp%2F95ShTV@xxxxxxxxxxxxxxxxxxxxxx/

ext4 is being more conservative.  I posted a patch series to convert
ext4 to use order-0 folios instead of pages (enabling large folios
will be more work), but I don't have any significant responses to
that yet:
https://lore.kernel.org/linux-fsdevel/20230126202415.1682629-1-willy@xxxxxxxxxxxxx/

> - Matthew mentioned in the talk that he had data showing memory fragmentation
> becoming less of an issue as more users we allocating large folios. Is that data
> or the experimental approach public?

I'm not sure I have data on that front; more of an argument from first
principles -- page cache is the easiest form of memory to reclaim
since it's usually clean.  If the filesystems using the page cache are
allocating large folios, it's easier to find larger chunks of memory.
Also every time a fs tries to allocate large folios and fails, it'll
poke the compaction code to try to create larger chunks of memory.

There's also memory allocation patterns to consider.  At some point, all
our low-order pools will be empty and we'll have to break up an order-10
page.  If we're allocating individual pages for the filesystem, we'll
happily allocate the first few, but then the radix tree which we store
the pages in will have to allocate a new node from slab.  Slab allocates
28 nodes from an order-2 page allocation, so you'll almost instantly get
a case where this order-10 page will never be reassembled.  Unless your
system is configured with a movable memory zone (which will segregate slab
allocations from page cache allocations), and my laptop certainly isn't.

I don't want you to get the impression that all the work going on is
targetted at filesystem folios.  There's a lot of infrastructure that's
being converted from pages to folios and being reexamined at the same
time to be sure it handles arbitrary-order folios correctly.  Right
now, I'm working on the architecture support for inserting multiple
consecutive PTEs at the same time:
https://lore.kernel.org/linux-arch/20230211033948.891959-1-willy@xxxxxxxxxxxxx/

Thanks for reaching out.  We have a Zoom call on alternate Fridays,
so if you're free at 5pm UK time (yes, I know ... trying to fit in both
California and central Europe leads to awkward times for phone calls),
I can send you the meeting details.