Re: [MAINTAINERS/KERNEL SUMMIT] Trust and maintenance of file systems

Dave Chinner <david@xxxxxxxxxxxxx> · Tue, 19 Sep 2023 11:15:54 +1000

On Sun, Sep 17, 2023 at 10:30:55AM -0700, Linus Torvalds wrote:
> On Sat, 16 Sept 2023 at 18:40, NeilBrown <neilb@xxxxxxx> wrote:
> >
> > I'm not sure the technical argument was particularly coherent.  I think
> > there is a broad desire to deprecate and remove the buffer cache.

....

> In other words, the buffer cache is
> 
>  - simple
> 
>  - self-contained
> 
>  - supports 20+ legacy filesystems
> 
> so the whole "let's deprecate and remove it" is literally crazy
> ranting and whining and completely mis-placed.

But that isn't what this thread is about. This is a strawman that
you're spending a lot of time and effort to stand up and then knock down.

Let's start from a well known problem we currently face: the
per-inode page cache struggles to scale to the bandwidth
capabilities of modern storage. We've known about this for well over
a decade in high performance IO circles, but now we are hitting it
with cheap consumer level storage. These per-inode bandwidth
scalability problems is one of the driving reasons behind the
conversion to folios and the introduction of high order folios into
the page cache.

One of the problems being raised in the high-order folio context is
that *bufferheads* and high-order folios don't really go together
well.  The pointer chasing model per-block bufferhead iteration
requires to update state and retrieve mapping information just does
not scale to marshalling millions of objects a second through the
page cache.

The best solution is to not use bufferheads at all for file data.
That's the direction the page cache IO stack is moving; we are
already there with iomap and hence XFS. With the recent introduction
of high order folios into the buffered write path, single file write
throughput on a pcie4.0 ssd went from ~2.5GB/s consuming 5 CPUs in
mapping lock contention to saturating the device at over 7GB/s
whilst also providing a 70% reduction in total CPU usage. This
result is came about simply by reduce reducing mapping lock traffic
by a couple of orders of magnitude across the write syscall, IO
submission, IO completion and memory reclaim paths....

This was easy to do with iomap based filesystems because they don't
carry per-block filesystem structures for every folio cached in page
cache - we carry a single object per folio that holds the 2 bits of
per-filesystem block state we need for each block the folio maps.
Compare that to a bufferhead - it uses 56 bytes of memory per
fielsystem block that is cached.

Hence in modern systems with hundreds of GB to TB of RAM and IO
rates measured in the multiple GB/s, this is a substantial cost in
terms of page cache efficiency and resource usage when using
bufferheads in the data path.  The benefits to moving from
bufferheads for data IO to iomap for data IO are significant.

However, that's not an easy conversion. There's a lot of work to
validate the intergrity of the IO path whilst making such a change.
It's complex and requires a fair bit of expertise in how the IO path
works, filesystem locking models, internal fs block mapping and
allocation routines, etc. And some filesystems flush data through
the buffer cache or track data writes though their journals via
bufferheads, so actually removing bufferheads for them is not an
easy task.

So we have to consider that maybe it is less work to make high-order
folios work with bufferheads. And that's where we start to get into
the maintenance problems with old filesysetms using bufferheads -
how do we ensure that the changes for high-order folio support in
bufferheads does not break the way one of these old filesystems
that use bufferheads?

That comes down to a simple question: if we can't actually test all
these old filesystems, how do we even know that they work correctly
right now?  Given that we are supposed to be providing some level of
quality assurance to users of these filesystems, are they going to
bve happy with running untested code that nobody really knows if it
works properly or not?

The buffer cache and the fact legacy filesystems use it is the least
of our worries - the problems are with the complex APIs,
architecture and interactions at the intersection point of shared
page cache and filesystem state. The discussion is a reflection on
how difficult it is to change a large, complex code base where
significant portions of it are untestable.

Regardless of which way we end up deciding to move forwards there is
*lots* of work that needs to be done and significant burdens remain
on the people who need to API changes to do get where we need to be.
We want to try to minimise that burden so we can make progress as
fast as possible.

Getting rid of unmaintained, untestable code is low hanging fruit.
Nobody is talking about getting rid of the buffer cache; we can
ensure that the buffer cache continues to work fairly easily; it's
all the other complex code in the filesystems that is the problem.

What we are actually talking about how to manage code which is
unmaintained, possibly broken and which nobody can and/or will fix.
Nobody benefits from the kernel carrying code we can't easily
maintain, test or fix, so working out how to deal with this problem
efficiently is a key part of the decisions that need to be made.

Hence to reduce this whole complex situation to a statement "the
buffer cache is simple and people suggesting we deprecate and remove
it" is a pretty significant misrepresentation the situation we find
ourselves in.

> Was this enough technical information for people?
> 
> And can we now all just admit that anybody who says "remove the buffer
> cache" is so uninformed about what they are speaking of that we can
> just ignore said whining?

Wow. Just wow.

After being called out for abusive behaviour, you immediately call
everyone who disagrees with you "uninformed" and suggest we should
"just ignore said whining"?

Which bit of "this is unacceptable behaviour" didn't you understand,
Linus?

-Dave.
-- 
Dave Chinner
david@xxxxxxxxxxxxx