On Sun, Sep 17, 2023 at 10:30:55AM -0700, Linus Torvalds wrote: > On Sat, 16 Sept 2023 at 18:40, NeilBrown <neilb@xxxxxxx> wrote: > > > > I'm not sure the technical argument was particularly coherent. I think > > there is a broad desire to deprecate and remove the buffer cache. .... > In other words, the buffer cache is > > - simple > > - self-contained > > - supports 20+ legacy filesystems > > so the whole "let's deprecate and remove it" is literally crazy > ranting and whining and completely mis-placed. But that isn't what this thread is about. This is a strawman that you're spending a lot of time and effort to stand up and then knock down. Let's start from a well known problem we currently face: the per-inode page cache struggles to scale to the bandwidth capabilities of modern storage. We've known about this for well over a decade in high performance IO circles, but now we are hitting it with cheap consumer level storage. These per-inode bandwidth scalability problems is one of the driving reasons behind the conversion to folios and the introduction of high order folios into the page cache. One of the problems being raised in the high-order folio context is that *bufferheads* and high-order folios don't really go together well. The pointer chasing model per-block bufferhead iteration requires to update state and retrieve mapping information just does not scale to marshalling millions of objects a second through the page cache. The best solution is to not use bufferheads at all for file data. That's the direction the page cache IO stack is moving; we are already there with iomap and hence XFS. With the recent introduction of high order folios into the buffered write path, single file write throughput on a pcie4.0 ssd went from ~2.5GB/s consuming 5 CPUs in mapping lock contention to saturating the device at over 7GB/s whilst also providing a 70% reduction in total CPU usage. This result is came about simply by reduce reducing mapping lock traffic by a couple of orders of magnitude across the write syscall, IO submission, IO completion and memory reclaim paths.... This was easy to do with iomap based filesystems because they don't carry per-block filesystem structures for every folio cached in page cache - we carry a single object per folio that holds the 2 bits of per-filesystem block state we need for each block the folio maps. Compare that to a bufferhead - it uses 56 bytes of memory per fielsystem block that is cached. Hence in modern systems with hundreds of GB to TB of RAM and IO rates measured in the multiple GB/s, this is a substantial cost in terms of page cache efficiency and resource usage when using bufferheads in the data path. The benefits to moving from bufferheads for data IO to iomap for data IO are significant. However, that's not an easy conversion. There's a lot of work to validate the intergrity of the IO path whilst making such a change. It's complex and requires a fair bit of expertise in how the IO path works, filesystem locking models, internal fs block mapping and allocation routines, etc. And some filesystems flush data through the buffer cache or track data writes though their journals via bufferheads, so actually removing bufferheads for them is not an easy task. So we have to consider that maybe it is less work to make high-order folios work with bufferheads. And that's where we start to get into the maintenance problems with old filesysetms using bufferheads - how do we ensure that the changes for high-order folio support in bufferheads does not break the way one of these old filesystems that use bufferheads? That comes down to a simple question: if we can't actually test all these old filesystems, how do we even know that they work correctly right now? Given that we are supposed to be providing some level of quality assurance to users of these filesystems, are they going to bve happy with running untested code that nobody really knows if it works properly or not? The buffer cache and the fact legacy filesystems use it is the least of our worries - the problems are with the complex APIs, architecture and interactions at the intersection point of shared page cache and filesystem state. The discussion is a reflection on how difficult it is to change a large, complex code base where significant portions of it are untestable. Regardless of which way we end up deciding to move forwards there is *lots* of work that needs to be done and significant burdens remain on the people who need to API changes to do get where we need to be. We want to try to minimise that burden so we can make progress as fast as possible. Getting rid of unmaintained, untestable code is low hanging fruit. Nobody is talking about getting rid of the buffer cache; we can ensure that the buffer cache continues to work fairly easily; it's all the other complex code in the filesystems that is the problem. What we are actually talking about how to manage code which is unmaintained, possibly broken and which nobody can and/or will fix. Nobody benefits from the kernel carrying code we can't easily maintain, test or fix, so working out how to deal with this problem efficiently is a key part of the decisions that need to be made. Hence to reduce this whole complex situation to a statement "the buffer cache is simple and people suggesting we deprecate and remove it" is a pretty significant misrepresentation the situation we find ourselves in. > Was this enough technical information for people? > > And can we now all just admit that anybody who says "remove the buffer > cache" is so uninformed about what they are speaking of that we can > just ignore said whining? Wow. Just wow. After being called out for abusive behaviour, you immediately call everyone who disagrees with you "uninformed" and suggest we should "just ignore said whining"? Which bit of "this is unacceptable behaviour" didn't you understand, Linus? -Dave. -- Dave Chinner david@xxxxxxxxxxxxx