On Thu, Sep 16, 2021 at 12:54:22PM -0400, Johannes Weiner wrote: > On Wed, Sep 15, 2021 at 07:58:54PM -0700, Darrick J. Wong wrote: > > On Wed, Sep 15, 2021 at 11:40:11AM -0400, Johannes Weiner wrote: > > > On Fri, Sep 10, 2021 at 04:16:28PM -0400, Kent Overstreet wrote: > The MM POV (and the justification for both the acks and the naks of > the patchset) is that it's a generic, untyped compound page > abstraction, which applies to file, anon, slab, networking > pages. Certainly, the folio patches as of right now also convert anon > page handling to the folio. If followed to its conclusion, the folio > will have plenty of members and API functions for non-pagecache users > and look pretty much like struct page today, just with a dynamic size. > > I know Kent was surprised by this. I know Dave Chinner suggested to > call it "cache page" or "cage" early on, which also suggests an > understanding of a *dedicated* cache page descriptor. Don't take a flippant I comment made in a bikeshed as any sort of representation of what I think about this current situation. I've largely been silent because of your history of yelling incoherently in response to anything I say that you don't agree with. But now you've explicitly drawn me into this discussion, I'll point out that I'm one of very few people in the wider Linux mm/fs community who has any *direct experience* with the cache handle based architecture being advocated for here. I don't agree with your assertion that cache handle based objects are the way forward, so please read and try to understand what I've just put a couple of hours into writing before you start shouting. Please? --- Ok, so this cache page descriptor/handle/object architecture has been implemented in other operating systems. It's the solution that Irix implemented back in the early _1990s_ via it's chunk cache. I've talked about this a few times in the past 15 years, so I guess I'll talk about it again. eg at LSFMM 2014 where I said "we don't really want to go down that path" in reference to supporting sector sizes > PAGE_SIZE: https://lwn.net/Articles/592101/ So, in more gory detail why I don't think we really want to go down that path..... The Irix chunk cache sat between the low layer global, disk address indexed buffer cache[1] and the high layer per-mm-context page cache used for mmap(). A "chunk" was a variable sized object indexed by file offset on a per-inode AVL tree - basically the same caching architecture as our current per-inode mapping tree uses to index pages. But unlike the Linux page cache, these chunks were an extension of the low level buffer cache. Hence they were also indexed by physical disk address and the life-cycle was managed by the buffer cache shrinker rather than the mm-based page cache reclaim algorithms. Chunks were built from page cache pages, and pages pointed back to the chunk that they belonged to. Chunks needed their own locking. IO was done based on chunks, not pages. Filesystems decided the size of chunks, not the page cache. Pages attached to chunks could be of any hardware supported size - the only limitation was that all pages attached to a chunk had to be the same size. A large hardware page in the page cache could be mapped by multiple smaller chunks. A chunk made up of multiple hardware pages could vmap it's contents if the user needed contiguous access.[2] Chunks were largely unaware of ongoing mmap operations. page faults on pages that had no associated chunk (e.g. originally populated into the page cache by a read fault into hole or a cached page that the buffer cache had torn down) then a new chunk had to be built. The code needed to handle to partially populated chunks in this sort of situation was really, really nasty as it required interacting with the filesystem and having the filesystem take locks and call back up into the page cache to build the new chunk in the IO path. Similarly, dirty page state from page faults needed to be propagated down to the chunks, because dirty tracking for writeback was done at the chunk level, not the page cache level. This was *really* nasty, because if the page didn't have a chunk already built, it couldn't be built in a write fault context. Hence sweeping dirty page state to the IO subsystem was handled periodically by a pdflush daemon , which could work with the filesytsem to build new (dirty) chunks and insert them into the chunk cache for writeback. Similar problems will have to be considered during design for Linux because the dirty tracking in Linux for writeback is done at the per-inode mapping tree level. Hence things like ->page_mkwrite are going to have to dig through the page to the cached chunk and mark the chunk dirty rather than the page. Whether deadlocks are going to have to be worked around is an open question; I don't have answers to these concerns because nobody is proposing an architecture detailed enough to explore these situations. This also leads to really interesting questions about how page and chunk state w.r.t. IO is kept coherent. e.g. if we are not tracking IO state on individual page cache pages, how do we ensure all the pages stay stable when IO is being done to a block device that requires stable pages? Along similar lines: what's the interlock mechanism that we'll use to ensure that IO or truncate can lock out per-page accesses if the filesystem IO paths no longer directly interact with page state any more? I also wonder how will we manage cached chunks if the filesystem currently relies on page level locking for atomicity, concurrency and existence guarantees (e.g. ext4 buffered IO)? IOWs, it is extremely likely that there will still be situations where we have to blast directly through the cache handle abstraction to manipulate the objects behind the abstraction so that we can make specific functionality work correctly, without regressions and/or efficiently. Hence the biggest issue that a chunk-like cache handle introduces is the complex multi-dimensional state update interactions. These will require more complex locking and that locking will be required to work in arbitrary orders for operations to be performed safely and atomically. e.g IO needs inode->chunk->page order, whilst page migration/comapction needs page->chunk->inode order. Page migration and compaction on Irix had some unfixable deadlocks in rare corner cases because of locking inversion problems between filesystems, chunks, pages and mm contexts. I don't see any fundamental difference in Linux architecture that makes me think that it will be any different.[3] I've got war chest full of chunk cache related data corruption bugs on Irix that were crazy hard to reproduce and even more difficult to fix. At least half the bugs I had to fix in the chunk cache over 3-4 years as maintainer were data corruption bugs resulting from inconsistencies in multi-object state updates. I've got a whole 'nother barrel full of problem cases that revolve around memory reclaim, too. The cache handles really need to pin the pages that back them, and so we can't really do access optimised per-page based reclaim of file-backed pages anymore. The Irix chunk cache had it's own LRUs and shrinker[4] to manage life-cycles of chunks under memory pressure, and the mm code had it's own independent page cache shrinker. Hence pages didn't get freed until both the chunk cache and the page cache released the pages they had references to. IOWs, we're going to end up needing to reclaim cache handles before we can do page reclaim. This needs careful thought and will likely need a complete redesign of the vmscan.c algorithms to work properly. I really, really don't want to see awful layer violations like bufferhead reclaim getting hacked into the low layer page reclaim algorithms happen ever again. We're still paying the price for that. And given the way Linux uses the mapping tree for keeping stuff like per-page working set refault information after the pages have been removed from the page cache, I really struggle to see how functionality like this can be supported with a chunk based cache index that doesn't actually have direct tracking of individual page access and reclaim behaviour. We're also going to need a range-based indexing mechanism for the mapping tree if we want to avoid the inefficiencies that mapping large objects into the Xarray require. We'll need an rcu-aware tree of some kind, be it a btree, maple tree or something else so that we can maintain lockless lookups of cache objects. That infrastructure doesn't exist yet, either. And on that note, it is worth keeping in mind that one of the reasons that the current linux page cache architecture scales better for single files than the Irix architecture ever did is because the Irix chunk cache could not be made lockless. The requirements for atomic multi-dimensional indexing updates and coherent, atomic multi-object state changes could never be solved in a lockless manner. It was not for lack of trying or talent; people way smarter than me couldn't solve that problem. SO there's an open question as to whether we can maintain existing lockless algorithms when a chunk cache is layered over the top of the page cache. IOWs, I see significant, fundamental problems that chunk cache architectures suffer from. I know there are inherent problems with state coherency, locking, complexity in the IO path, etc. Some of these problems will inot be discovered until the implementation is well under way. Some of these problem may well be unsolveable, too. And until there's an actual model proposed of how everything will interact and work, we can't actually do any of this architectural analysis to determine if it might work or not. The chunk cache proposal is really just a grand thought experiment at this point in time. OTOH, folios have none of these problems and are here right now. Sure, they have their own issues, but we can see them for what they are given the code is already out there, and pretty much everyone sees them as a big step forwards. Folios don't prevent a chunk cache from being implemented. In fact, to make folios highly efficient, we have to do things a chunk cache would also require to be implemented. e.g. range-based cache indexing. Unlike a chunk cache, folios don't depend on this being done first - they stand alone without those changes, and will only improve from making them. IOWs, you can't use the "folios being mapped 512 times into the mapping tree" as a reason the chunk cache is better - the chunk cache also requires this same problem to be solved, but the chunk cache needs efficient range lookups done *before* it is implemented, not provided afterwards as an optimisation. IOWs, if we want to move towards a chunk cache, the first step is to move to folios to allow large objects in the page cache. Then we can implement a lock-less range based index mechanism for the mapping tree. Then we can look to replace folios with a typed cache handle without having to worry about all the whacky multi-object coherency problems because they only need to point to a single folio. Then we can work out all the memory reclaim issues, locking issues, sort out the API that filesystems use instead of folios, etc that ineed to be done when cache handles are introduced. And once we've worked through all that, then we can add support for multiple folios within a single cache object and discover all the really hard problems that this exposes. At this point, the cache objects are no longer dependent on folios to provide objects > PAGE_SIZE to the filesystems, and we can start to remove folios from the mm code and replace them with something else that the cache handle uses to provide the backing store to the filesysetms... Seriously, I have given a lot of thought over the years to a chunk cache for Linux. Right now, a chunk cache is a solution looking for a problem to solve. Unless there's an overall architectural mm plan that is being worked towards that requires a chunk cache, then I just don't see the justification for doing all this work because the first two steps above get filesystems everything they are currently asking for. Everything else past that is really just an experiment... > I agree with what I think the filesystems want: instead of an untyped, > variable-sized block of memory, I think we should have a typed page > cache desciptor. I don't think that's what fs devs want at all. It's what you think fs devs want. If you'd been listening to us the same way that Willy has been for the past year, maybe you'd have a different opinion. Indeed, we don't actually need a new page cache abstraction. fs/iomap already provides filesystems with a complete, efficient page cache abstraction that only requires filesytems to provide block mapping services. Filesystems using iomap do not interact with the page cache at all. And David Howells is working with Willy and all the network fs devs to build an equivalent generic netfs page cache abstraction based on folios that is supported by the major netfs client implementations in the kernel. IOWs, fs devs don't need a new page cache abstraction - we've got our own abstractions tailored directly to our needs. What we need are API cleanups, consistency in object access mechanisms and dynamic object size support to simplify and fill out the feature set of the abstractions we've already built. The fact that so many fs developers are pushing *hard* for folios is that it provides what we've been asking for individually over last few years. Willy has done a great job of working with the fs developers and getting feedback at every step of the process, and you see that in the amount of work that in progress that is already based on folios. ANd it provides those cleanups and new functionality without changing or invalidating any of the knowledge we collectively hold about how the page cache works. That's _pure gold_ right there. In summary: If you don't know anything about the architecture and limitations of the XFS buffer cache (also read the footnotes), you'd do very well to pay heed to what I've said in this email considering the direct relevancy it's history has to the alternative cache handle proposal being made here. We also need to consider the evidence that filesystems do not actually need a new page cache abstraction - they just need the existing page cache to be able to index objects larger than PAGE_SIZE. So with all that in mind, I consider folios (or whatever we call them) to be the best stepping stone towards a PAGE_SIZE indepedent future that we currently have. folios don't prevent us from introducing a cache handle based architecture if we have a compelling reason to do so in the future, nor do they stop anyone working on such infrastructure in parallel if it really is necessary. But the reality is that we don't need such a fundamental architectural change to provide the functionality that folios provide us with _right now_. Folios are not perfect, but they are here and they solve many issues we need solved. We're never going to have a perfect solution that everyone agrees with, so the real question is "are folios good enough?". To me the answer is a resounding yes. Cheers, Dave. [1] fs/xfs/xfs_buf.c is an example of a high performance handle based, variable object size cache that abstracts away the details of the data store being allocated from slab, discontiguous pages, contiguous pages or [2] vmapped memory. It is basically two decade old re-implementation of the Irix low layer global disk-addressed buffer cache, modernised and tailored directly to the needs of XFS metadata caching. [3] Keep in mind that the xfs_buf cache used to be page cache backed. The page cache provided the caching and memory reclaim infrastructure to the xfs_buf handles - and so we do actually have recent direct experience on Linux with the architecture you are proposing here. This architecture proved to be a major limitation from a performance, multi-object state coherency and cache residency prioritisation aspects. It really sucked with systems that had 64KB page sizes and 4kB metadata block sizes, and .... [4] So we went back to the old Irix way of managing the cache - our own buffer based LRUs and aging mechanisms, with memory reclaim run by a shrinkers based on buffer-type base priorities. We use bulk page allocation for buffers that >= PAGE_SIZE, and slab allocation < PAGE_SIZE. That's exactly what you are suggesting we do with 2MB sized base pages, but without having to care about mmap() at all. -- Dave Chinner david@xxxxxxxxxxxxx