On Mon, Feb 24, 2025 at 01:36:50PM -0800, Kalesh Singh wrote: > Another possible way we can look at this: in the regressions shared > above by the ELF padding regions, we are able to make these regions > sparse (for *almost* all cases) -- solving the shared-zero page > problem for file mappings, would also eliminate much of this overhead. > So perhaps we should tackle this angle? If that's a more tangible > solution ? > > From the previous discussions that Matthew shared [7], it seems like > Dave proposed an alternative to moving the extents to the VFS layer to > invert the IO read path operations [8]. Maybe this is a move > approachable solution since there is precedence for the same in the > write path? > > [7] https://lore.kernel.org/linux-fsdevel/Zs97qHI-wA1a53Mm@xxxxxxxxxxxxxxxxxxxx/ > [8] https://lore.kernel.org/linux-fsdevel/ZtAPsMcc3IC1VaAF@xxxxxxxxxxxxxxxxxxx/ Yes, if we are going to optimise away redundant zeros being stored in the page cache over holes, we need to know where the holes in the file are before the page cache is populated. As for efficient hole tracking in the mapping tree, I suspect that we should be looking at using exceptional entries in the mapping tree for holes, not inserting mulitple references to the zero folio. i.e. the important information for data storage optimisation is that the region covers a hole, not that it contains zeros. For buffered reads, all that is required when such an exceptional entry is returned is a memset of the user buffer. For buffered writes, we simply treat it like a normal folio allocating write and replace the exceptional entry with the allocated (and zeroed) folio. For read page faults, the zero page gets mapped (and maybe accounted) via the vma rather than the mapping tree entry. For write faults, a folio gets allocated and the exception entry replaced before we call into ->page_mkwrite(). Invalidation simply removes the exceptional entries. This largely gets rid of needing to care about the zero page outside of mmap() context where something needs to be mapped into the userspace mm context. Let the page fault/mm context substitute the zero page in the PTE mappings where necessary, but we don't need to use and/or track the zero page in the page cache itself.... FWIW, this also lends itself to storing unwritten extent information in exceptional entries. One of the problems we have is unwritten extents can contain either zeros (been read) and data (been overwritten in memory, but not flushed to disk). This is the problem that SEEK_DATA has to navigate - it has to walk the page cache over unwritten extents to determine if there is data over the unwritten extent or not. In this case, an exceptional entry gets added on read, which is then replaced with an actual folio on write. Now SEEK_DATA can easily and safely determine where the data actually lies over the unwritten extent with a mapping tree walk instead of having to load and lock each folio to check it is dirty or not.... -Dave. -- Dave Chinner david@xxxxxxxxxxxxx