Re: [Lsf-pc] [LSF/MM/BPF TOPIC] Optimizing Page Cache Readahead Behavior

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 



On 27.02.25 23:12, Matthew Wilcox wrote:
On Tue, Feb 25, 2025 at 10:56:21AM +1100, Dave Chinner wrote:
 From the previous discussions that Matthew shared [7], it seems like
Dave proposed an alternative to moving the extents to the VFS layer to
invert the IO read path operations [8]. Maybe this is a move
approachable solution since there is precedence for the same in the
write path?

[7] https://lore.kernel.org/linux-fsdevel/Zs97qHI-wA1a53Mm@xxxxxxxxxxxxxxxxxxxx/
[8] https://lore.kernel.org/linux-fsdevel/ZtAPsMcc3IC1VaAF@xxxxxxxxxxxxxxxxxxx/

Yes, if we are going to optimise away redundant zeros being stored
in the page cache over holes, we need to know where the holes in the
file are before the page cache is populated.

Well, you shot that down when I started trying to flesh it out:
https://lore.kernel.org/linux-fsdevel/Zs+2u3%2FUsoaUHuid@xxxxxxxxxxxxxxxxxxx/

As for efficient hole tracking in the mapping tree, I suspect that
we should be looking at using exceptional entries in the mapping
tree for holes, not inserting mulitple references to the zero folio.
i.e. the important information for data storage optimisation is that
the region covers a hole, not that it contains zeros.

The xarray is very much optimised for storing power-of-two sized &
aligned objects.  It makes no sense to try to track extents using the
mapping tree.  Now, if we abandon the radix tree for the maple tree, we
could talk about storing zero extents in the same data structure.
But that's a big change with potentially significant downsides.
It's something I want to play with, but I'm a little busy right now.

For buffered reads, all that is required when such an exceptional
entry is returned is a memset of the user buffer. For buffered
writes, we simply treat it like a normal folio allocating write and
replace the exceptional entry with the allocated (and zeroed) folio.

... and unmap the zero page from any mappings.

For read page faults, the zero page gets mapped (and maybe
accounted) via the vma rather than the mapping tree entry. For write
faults, a folio gets allocated and the exception entry replaced
before we call into ->page_mkwrite().

Invalidation simply removes the exceptional entries.

... and unmap the zero page from any mappings.


I'll add one detail for future reference; not sure about the priority this should have, but it's one of these nasty corner cases that are not the obvious to spot when having the shared zeropage in MAP_SHARED mappings:

Currently, only FS-DAX makes use of the shared zeropage in "ordinary MAP_SHARED" mappings. It doesn't use it for "holes" but for "logically zero" pages, to avoid allocating disk blocks (-> translating to actual DAX memory) on read-only access.

There is one issue between gup(FOLL_LONGTERM | FOLL_PIN) and the shared zeropage in MAP_SHARED mappings. It so far does not apply to fsdax,
because ... we don't support FOLL_LONGTERM for fsdax at all.

I spelled out part of the issue in fce831c92092 ("mm/memory: cleanly support zeropage in vm_insert_page*(), vm_map_pages*() and vmf_insert_mixed()").

In general, the problem is that gup(FOLL_LONGTERM | FOLL_PIN) will have to decide if it is okay to longterm-pin the shared zeropage in a MAP_SHARED mapping (which might just be fine with a R/O file in some cases?), and if not, it would have to trigger FAULT_FLAG_UNSHARE similar to how we break COW in MAP_PRIVATE mappings (shared zeropage -> anonymous folio).

If gup(FOLL_LONGTERM | FOLL_PIN) would just always longterm-pin the shared zeropage, and somebody else would end up triggering replacement of the shared zeropage in the pagecache (e.g., write() to the file offset, write access to the VMA that triggers a write fault etc.), you'd get a disconnect between what the GUP user sees and what the pagecache actually contains.

The file system fault logic will have to be taught about FAULT_FLAG_UNSHARE and handle it accordingly (e.g., allocate fill file hole, allocate disk space, allocate an actual folio ...).

Things like memfd_pin_folios() might require similar care -- that one in particular should likely never return the shared zeropage.

Likely gup(FOLL_LONGTERM | FOLL_PIN) users like RDMA or VFIO will be able to trigger it.


Not using the shared zeropage but instead some "hole" PTE marker could avoid this problem. Of course, not allowing for reading the shared zeropage there, but maybe that's not strictly required?

--
Cheers,

David / dhildenb





[Index of Archives]     [Linux ARM Kernel]     [Linux ARM]     [Linux Omap]     [Fedora ARM]     [IETF Annouce]     [Bugtraq]     [Linux OMAP]     [Linux MIPS]     [eCos]     [Asterisk Internet PBX]     [Linux API]

  Powered by Linux