Re: [MAINTAINERS/KERNEL SUMMIT] Trust and maintenance of file systems

Matthew Wilcox <willy@xxxxxxxxxxxxx> · Tue, 19 Sep 2023 06:17:21 +0100

On Tue, Sep 19, 2023 at 11:15:54AM +1000, Dave Chinner wrote:
> This was easy to do with iomap based filesystems because they don't
> carry per-block filesystem structures for every folio cached in page
> cache - we carry a single object per folio that holds the 2 bits of
> per-filesystem block state we need for each block the folio maps.
> Compare that to a bufferhead - it uses 56 bytes of memory per
> fielsystem block that is cached.

56?1  What kind of config do you have?  It's 104 bytes on Debian:
buffer_head          936   1092    104   39    1 : tunables    0    0    0 : slabdata     28     28      0

Maybe you were looking at a 32-bit system; most of the elements are
word-sized (pointers, size_t or long)

> So we have to consider that maybe it is less work to make high-order
> folios work with bufferheads. And that's where we start to get into
> the maintenance problems with old filesysetms using bufferheads -
> how do we ensure that the changes for high-order folio support in
> bufferheads does not break the way one of these old filesystems
> that use bufferheads?

I don't think we can do it.  Regardless of the question you're proposing
here, the model where we complete a BIO, then walk every buffer_head
attached to the folio to determine if we can now mark the folio as being
(uptodate / not-under-writeback) just doesn't scale when you attach more
than tens of BHs to the folio.  It's one bit per BH rather than having
a summary bitmap like iomap has.

I have been thinking about spitting the BH into two pieces, something
like this:

struct buffer_head_head {
	spinlock_t b_lock;
	struct buffer_head *buffers;
	unsigned long state[];
};

and remove BH_Uptodate and BH_Dirty in favour of setting bits in state
like iomap does.

But, as you say, there are a lot of filesystems that would need to be
audited and probably modified.

Frustratingly, it looks like buffer_heads were intended to be used as
extents; each one has a b_size of its own.  But there's a ridiculous
amount of code that assumes that all BHs attached to a folio have the
same b_size as each other.