Hi folks, I've recently been looking into what is involved in sharing page cache pages for shared extents in a filesystem. That is, create a file, reflink it so there's two files but only one copy of the data on disk, then read both files. Right now, we get two copies of the data in the page cache - one in each inode mapping tree. If we scale this up to a container host which is using reflink trees it's shared root images, there might be hundreds of copies of the same data held in cache (i.e. one page per container). Given that the filesystem knows that the underlying data extent is shared when we go to read it, it's relatively easy to add mechanisms to the filesystem to return the same page for all attempts to read the from a shared extent from all inodes that share it. However, the problem I'm getting stuck on is that the page cache itself can't handle inserting a single page into multiple page cache mapping trees. i.e. The page has a single pointer to the mapping address space, and the mapping has a single pointer back to the owner inode. As such, a cached page has a 1:1 mapping to it's host inode and this structure seems to be assumed rather widely through the code. The problem is somewhat limited by the fact that only clean, read-only pages would be shared, and the attempt to write/dirty a shared page in a mapping would trigger a COW operation in the filesystem which would invalidate that inode's shared page and replace it with a new, inode-private page that could be written to. This still requires us to be able to find the right inode from the shared page context to run the COW operation. Luckily, the IO path already has an inode pointer, and the page fault path provides us with the inode via file_inode(vmf->vma->vm_file) so we don't actually need page->mapping->host in these paths. Along these lines I've thought about using a "shared mapping" that is associated with the filesystem rather than a specific inode (like a bdev mapping), but that's no good because if page->mapping != inode->i_mapping the page is consider to have been invalidated and should be considered invalid. Further - a page has a single, fixed index into the mapping tree (i.e. page->index), so this prevents arbitrary page sharing across inodes (the "deduplication triggered shared extent" case). And we can't really get rid of the page index, because that's how the page finds itself in a mapping tree. This leads me to think about crazy schemes like allocating a "referring struct page" that is allocated for every reference to a shared cache page and chain them all to the real struct page sorta like we do for compound pages. That would give us a unique struct page for each mapping tree and solve many of the issues, but I'm not sure how viable such a concept would be. I'm sure there's more issues than I've outlined here, but I haven't gone deeper than this because I've got to solve the one to many problem first. I don't know if anyone has looked at this in any detail, so I don't know what ideas, patches, crazy schemes, etc might already exist out there. Right now I'm just looking for information to narrow down what I need to look at - finding what rabbit holes have already been explored and what dragons are already known about would help an awful lot right now. Anyone? Cheers, Dave. -- Dave Chinner david@xxxxxxxxxxxxx