Re: [Lsf-pc] [LSF/MM/BPF TOPIC] Optimizing Page Cache Readahead Behavior

Dave Chinner <david@xxxxxxxxxxxxx> · Tue, 25 Feb 2025 10:56:21 +1100

On Mon, Feb 24, 2025 at 01:36:50PM -0800, Kalesh Singh wrote:
> Another possible way we can look at this: in the regressions shared
> above by the ELF padding regions, we are able to make these regions
> sparse (for *almost* all cases) -- solving the shared-zero page
> problem for file mappings, would also eliminate much of this overhead.
> So perhaps we should tackle this angle? If that's a more tangible
> solution ?
> 
> From the previous discussions that Matthew shared [7], it seems like
> Dave proposed an alternative to moving the extents to the VFS layer to
> invert the IO read path operations [8]. Maybe this is a move
> approachable solution since there is precedence for the same in the
> write path?
> 
> [7] https://lore.kernel.org/linux-fsdevel/Zs97qHI-wA1a53Mm@xxxxxxxxxxxxxxxxxxxx/
> [8] https://lore.kernel.org/linux-fsdevel/ZtAPsMcc3IC1VaAF@xxxxxxxxxxxxxxxxxxx/

Yes, if we are going to optimise away redundant zeros being stored
in the page cache over holes, we need to know where the holes in the
file are before the page cache is populated.

As for efficient hole tracking in the mapping tree, I suspect that
we should be looking at using exceptional entries in the mapping
tree for holes, not inserting mulitple references to the zero folio.
i.e. the important information for data storage optimisation is that
the region covers a hole, not that it contains zeros.

For buffered reads, all that is required when such an exceptional
entry is returned is a memset of the user buffer. For buffered
writes, we simply treat it like a normal folio allocating write and
replace the exceptional entry with the allocated (and zeroed) folio.

For read page faults, the zero page gets mapped (and maybe
accounted) via the vma rather than the mapping tree entry. For write
faults, a folio gets allocated and the exception entry replaced
before we call into ->page_mkwrite().

Invalidation simply removes the exceptional entries.

This largely gets rid of needing to care about the zero page outside
of mmap() context where something needs to be mapped into the
userspace mm context. Let the page fault/mm context substitute the
zero page in the PTE mappings where necessary, but we don't need to
use and/or track the zero page in the page cache itself....

FWIW, this also lends itself to storing unwritten extent information
in exceptional entries. One of the problems we have is unwritten
extents can contain either zeros (been read) and data (been
overwritten in memory, but not flushed to disk). This is the problem
that SEEK_DATA has to navigate - it has to walk the page cache over
unwritten extents to determine if there is data over the unwritten
extent or not.

In this case, an exceptional entry gets added on read, which is then
replaced with an actual folio on write. Now SEEK_DATA can easily and
safely determine where the data actually lies over the unwritten
extent with a mapping tree walk instead of having to load and lock
each folio to check it is dirty or not....

-Dave.
-- 
Dave Chinner
david@xxxxxxxxxxxxx