Re: [Lsf-pc] [LSF/MM/BPF TOPIC] Optimizing Page Cache Readahead Behavior

David Hildenbrand <david@xxxxxxxxxx> · Fri, 28 Feb 2025 10:07:43 +0100

On 27.02.25 23:12, Matthew Wilcox wrote:
On Tue, Feb 25, 2025 at 10:56:21AM +1100, Dave Chinner wrote:
 From the previous discussions that Matthew shared [7], it seems like
Dave proposed an alternative to moving the extents to the VFS layer to
invert the IO read path operations [8]. Maybe this is a move
approachable solution since there is precedence for the same in the
write path?

[7] https://lore.kernel.org/linux-fsdevel/Zs97qHI-wA1a53Mm@xxxxxxxxxxxxxxxxxxxx/
[8] https://lore.kernel.org/linux-fsdevel/ZtAPsMcc3IC1VaAF@xxxxxxxxxxxxxxxxxxx/

Yes, if we are going to optimise away redundant zeros being stored
in the page cache over holes, we need to know where the holes in the
file are before the page cache is populated.

Well, you shot that down when I started trying to flesh it out:
https://lore.kernel.org/linux-fsdevel/Zs+2u3%2FUsoaUHuid@xxxxxxxxxxxxxxxxxxx/

As for efficient hole tracking in the mapping tree, I suspect that
we should be looking at using exceptional entries in the mapping
tree for holes, not inserting mulitple references to the zero folio.
i.e. the important information for data storage optimisation is that
the region covers a hole, not that it contains zeros.

The xarray is very much optimised for storing power-of-two sized &
aligned objects.  It makes no sense to try to track extents using the
mapping tree.  Now, if we abandon the radix tree for the maple tree, we
could talk about storing zero extents in the same data structure.
But that's a big change with potentially significant downsides.
It's something I want to play with, but I'm a little busy right now.

For buffered reads, all that is required when such an exceptional
entry is returned is a memset of the user buffer. For buffered
writes, we simply treat it like a normal folio allocating write and
replace the exceptional entry with the allocated (and zeroed) folio.

... and unmap the zero page from any mappings.

For read page faults, the zero page gets mapped (and maybe
accounted) via the vma rather than the mapping tree entry. For write
faults, a folio gets allocated and the exception entry replaced
before we call into ->page_mkwrite().

Invalidation simply removes the exceptional entries.

... and unmap the zero page from any mappings.

I'll add one detail for future reference; not sure about the priority 
this should have, but it's one of these nasty corner cases that are not 
the obvious to spot when having the shared zeropage in MAP_SHARED mappings:

Currently, only FS-DAX makes use of the shared zeropage in "ordinary 
MAP_SHARED" mappings. It doesn't use it for "holes" but for "logically 
zero" pages, to avoid allocating disk blocks (-> translating to actual 
DAX memory) on read-only access.

There is one issue between gup(FOLL_LONGTERM | FOLL_PIN) and the shared 
zeropage in MAP_SHARED mappings. It so far does not apply to fsdax,
because ... we don't support FOLL_LONGTERM for fsdax at all.

I spelled out part of the issue in fce831c92092 ("mm/memory: cleanly 
support zeropage in vm_insert_page*(), vm_map_pages*() and 
vmf_insert_mixed()").

In general, the problem is that gup(FOLL_LONGTERM | FOLL_PIN) will have 
to decide if it is okay to longterm-pin the shared zeropage in a 
MAP_SHARED mapping (which might just be fine with a R/O file in some 
cases?), and if not, it would have to trigger FAULT_FLAG_UNSHARE similar 
to how we break COW in MAP_PRIVATE mappings (shared zeropage -> 
anonymous folio).

If gup(FOLL_LONGTERM | FOLL_PIN) would just always longterm-pin the 
shared zeropage, and somebody else would end up triggering replacement 
of the shared zeropage in the pagecache (e.g., write() to the file 
offset, write access to the VMA that triggers a write fault etc.), you'd 
get a disconnect between what the GUP user sees and what the pagecache 
actually contains.

The file system fault logic will have to be taught about 
FAULT_FLAG_UNSHARE and handle it accordingly (e.g., allocate fill file 
hole, allocate disk space, allocate an actual folio ...).

Things like memfd_pin_folios() might require similar care -- that one in 
particular should likely never return the shared zeropage.

Likely gup(FOLL_LONGTERM | FOLL_PIN) users like RDMA or VFIO will be 
able to trigger it.

Not using the shared zeropage but instead some "hole" PTE marker could 
avoid this problem. Of course, not allowing for reading the shared 
zeropage there, but maybe that's not strictly required?

--
Cheers,

David / dhildenb