Re: [MAINTAINERS/KERNEL SUMMIT] Trust and maintenance of file systems

Matthew Wilcox <willy@xxxxxxxxxxxxx> · Tue, 19 Sep 2023 17:45:38 +0100

On Tue, Sep 19, 2023 at 12:34:17PM -0400, Theodore Ts'o wrote:
> On Tue, Sep 19, 2023 at 06:17:21AM +0100, Matthew Wilcox wrote:
> > Frustratingly, it looks like buffer_heads were intended to be used as
> > extents; each one has a b_size of its own.  But there's a ridiculous
> > amount of code that assumes that all BHs attached to a folio have the
> > same b_size as each other.
> 
> The primary reason why we need a per-bh b_size is for the benefit of
> non-iomap O_DIRECT code paths.  If that goes away, then we can
> simplify this significantly, since we flush the buffer cache whenever
> we change the blocksize used in the buffer cache; the O_DIRECT bh's
> aren't part of the buffer cache, which is when you might have bh's with
> a b_size of 8200k (when doing a 8200k O_DIRECT read or write).

I must have not explained myself very well.

What I was trying to say was that if the buffer cache actually supported
it, large folios and buffer_heads wouldn't perform horribly together,
unless you had a badly fragmented file.

eg you could allocate a 256kB folio, then ask the filesystem to
create buffer_heads for it, and maybe it would come back with a list
of four buffer_heads, the first representing the extent from 0-32kB,
the second 32kB-164kB, the third 164kB-252kB and the fourth 252kB-256kB.
Wherever there were physical discontiguities in the file.

Then there would be only four buffer_heads to scan in order to determine
whether the entire folio was uptodate/dirty/written-back/etc.  It's still
O(n^2) but don't underestimate the power of reducing N to a small number.

Possibly we'd want to change buffer_heads a little to support tracking
dirtiness on a finer granularity than per-extent (just as Ritesh
recently did to iomap).  But there is a path to happiness here that
doesn't involve switching everything to iomap.  If I try to do it, I
know I'll break everything while doing it ...