On Thu, Feb 27, 2025 at 10:12:50PM +0000, Matthew Wilcox wrote: > On Tue, Feb 25, 2025 at 10:56:21AM +1100, Dave Chinner wrote: > > > From the previous discussions that Matthew shared [7], it seems like > > > Dave proposed an alternative to moving the extents to the VFS layer to > > > invert the IO read path operations [8]. Maybe this is a move > > > approachable solution since there is precedence for the same in the > > > write path? > > > > > > [7] https://lore.kernel.org/linux-fsdevel/Zs97qHI-wA1a53Mm@xxxxxxxxxxxxxxxxxxxx/ > > > [8] https://lore.kernel.org/linux-fsdevel/ZtAPsMcc3IC1VaAF@xxxxxxxxxxxxxxxxxxx/ > > > > Yes, if we are going to optimise away redundant zeros being stored > > in the page cache over holes, we need to know where the holes in the > > file are before the page cache is populated. > > Well, you shot that down when I started trying to flesh it out: > https://lore.kernel.org/linux-fsdevel/Zs+2u3%2FUsoaUHuid@xxxxxxxxxxxxxxxxxxx/ No, I shot down the idea of having the page cache maintain a generic cache of file offset to LBA address mappings outside the filesystem. Having the filesystem insert a special 'this is a hole' entry into the mapping tree insert of allocating and inserting a page full of zeroes is not an extent cache - it's just a different way of representing a data range that is known to always contain zeroes. > > As for efficient hole tracking in the mapping tree, I suspect that > > we should be looking at using exceptional entries in the mapping > > tree for holes, not inserting mulitple references to the zero folio. > > i.e. the important information for data storage optimisation is that > > the region covers a hole, not that it contains zeros. > > The xarray is very much optimised for storing power-of-two sized & > aligned objects. It makes no sense to try to track extents using the > mapping tree. Certainly. I'm not suggesting that we do this at all, and .... > Now, if we abandon the radix tree for the maple tree, we > could talk about storing zero extents in the same data structure. > But that's a big change with potentially significant downsides. > It's something I want to play with, but I'm a little busy right now. .... I still do not want the page cache to try to maintain a block mapping/extent cache in addition to the what the filesystem must already maintain for the reasons I have previously given. > > For buffered reads, all that is required when such an exceptional > > entry is returned is a memset of the user buffer. For buffered > > writes, we simply treat it like a normal folio allocating write and > > replace the exceptional entry with the allocated (and zeroed) folio. > > ... and unmap the zero page from any mappings. Sure. That's just a call to unmap_mapping_range(), yes? > > For read page faults, the zero page gets mapped (and maybe > > accounted) via the vma rather than the mapping tree entry. For write > > faults, a folio gets allocated and the exception entry replaced > > before we call into ->page_mkwrite(). > > > > Invalidation simply removes the exceptional entries. > > ... and unmap the zero page from any mappings. Invalidation already calls unmap_mapping_range(), so this should already be handled, right? -Dave. -- Dave Chinner david@xxxxxxxxxxxxx