On Sun, Sep 17, 2023 at 10:30:55AM -0700, Linus Torvalds wrote: > And yes, *within* the context of a filesystem or two, the whole "try > to avoid the buffer cache" can be a real thing. Ext4 uses buffer_heads, and wasn't on your list because we don't use sb_bread(). And we are thinking about getting rid of buffer heads, mostly because (a) we want to have more control over which metadata blocks gets cached and which doesn't, and (b) the buffer cache doesn't have a callback function to inform the file system if the writeback fails, so that the file system can try to work around the issue, or at the very least, mark the file system as corrupted and to signal it via fsnotify. Attempts to fix (b) via enhancements buffer cache where shot down by the linux-fsdevel bike-shedding cabal, because "the buffer cache is deprecated", and at the time, I decided it wasn't worth pushing it, since (a) was also a consideration, and I expect we can also (c) reduce the memory overhead since there are large parts of struct buffer_head that ext4 doesn't need. There was *one* one technical argument raised by people who want to get rid of buffer heads, which is that the change from set_bh_page() to folio_set_bh() introduced a bug which broke bh_offset() in a way that only showed up if you were using bh_offset() and the file system block size was less than the page size. Eh, it was a bug, and we caught it quickly enough once someone actually tried to run xfstests on the commit, and it bisected pretty quickly. (Unfortunately, the change went in via the mm tree, and so it wasn't noticed by the ext4 file system developers; but fortunatelly, Zorro flagged it, and once that showed up, I investigated it.) As far as I'm concerned, that's working as intended, and these sorts of things happen. So personally, I don't consider this an argument for nuking the buffer cache. I *do* view it as only one of many concerns when we do make these tree-wide changes, such as the folio migration. Yes, these these tree-wide can introduce regressions, such as breaking bh_offset() for a week or so before the regression tests catch it, and another week before the fix makes its way to Linus's tree. That's the system working as designed. But that's not the only concern; the other problem with these tree-wide changes is that it tends to break automatic backports of bug fixes to the LTS kernels, which now require manual handling by the file system developers (or we could leave the LTS kernels with the bugs unfixed, but that tends to make customers cranky :-). Anyway, it's perhaps natural that the people who make these sorts of tree-wide changes may get cranky when they need to modify, or at least regression test, 20+ legacy file systems, and it does kind of suck that many of these legacy file systems can't easily be tested by xfstests because we don't even have a mkfs program for them. (OTOH, we recently merged ntfs3 w/o a working, or at least open-source, mkfs program, so this isn't *just* the fault of legacy file systems.) So sure, they may wish that we could make the job of landing these sorts of tree-wide changes to make their job easier. But we don't do tree-wide changes all that often, and so it's a mistake to try to optimize for this non-common case. Cheers, - Ted