Re: [MAINTAINERS/KERNEL SUMMIT] Trust and maintenance of file systems

"Theodore Ts'o" <tytso@xxxxxxx> · Sun, 17 Sep 2023 14:57:42 -0400

On Sun, Sep 17, 2023 at 10:30:55AM -0700, Linus Torvalds wrote:
> And yes, *within* the context of a filesystem or two, the whole "try
> to avoid the buffer cache" can be a real thing.

Ext4 uses buffer_heads, and wasn't on your list because we don't use
sb_bread().  And we are thinking about getting rid of buffer heads,
mostly because (a) we want to have more control over which metadata
blocks gets cached and which doesn't, and (b) the buffer cache doesn't
have a callback function to inform the file system if the writeback
fails, so that the file system can try to work around the issue, or at
the very least, mark the file system as corrupted and to signal it via
fsnotify.

Attempts to fix (b) via enhancements buffer cache where shot down by
the linux-fsdevel bike-shedding cabal, because "the buffer cache is
deprecated", and at the time, I decided it wasn't worth pushing it,
since (a) was also a consideration, and I expect we can also (c)
reduce the memory overhead since there are large parts of struct
buffer_head that ext4 doesn't need.

There was *one* one technical argument raised by people who want to
get rid of buffer heads, which is that the change from set_bh_page()
to folio_set_bh() introduced a bug which broke bh_offset() in a way
that only showed up if you were using bh_offset() and the file system
block size was less than the page size.

Eh, it was a bug, and we caught it quickly enough once someone
actually tried to run xfstests on the commit, and it bisected pretty
quickly.  (Unfortunately, the change went in via the mm tree, and so
it wasn't noticed by the ext4 file system developers; but
fortunatelly, Zorro flagged it, and once that showed up, I
investigated it.)  As far as I'm concerned, that's working as
intended, and these sorts of things happen.  So personally, I don't
consider this an argument for nuking the buffer cache.

I *do* view it as only one of many concerns when we do make these
tree-wide changes, such as the folio migration.  Yes, these these
tree-wide can introduce regressions, such as breaking bh_offset() for
a week or so before the regression tests catch it, and another week
before the fix makes its way to Linus's tree.  That's the system
working as designed.

But that's not the only concern; the other problem with these
tree-wide changes is that it tends to break automatic backports of bug
fixes to the LTS kernels, which now require manual handling by the
file system developers (or we could leave the LTS kernels with the
bugs unfixed, but that tends to make customers cranky :-).

Anyway, it's perhaps natural that the people who make these sorts of
tree-wide changes may get cranky when they need to modify, or at least
regression test, 20+ legacy file systems, and it does kind of suck
that many of these legacy file systems can't easily be tested by
xfstests because we don't even have a mkfs program for them.  (OTOH,
we recently merged ntfs3 w/o a working, or at least open-source, mkfs
program, so this isn't *just* the fault of legacy file systems.)

So sure, they may wish that we could make the job of landing these
sorts of tree-wide changes to make their job easier.  But we don't do
tree-wide changes all that often, and so it's a mistake to try to
optimize for this non-common case.

Cheers,

					- Ted