On Tue, Sep 19, 2023 at 11:15:54AM +1000, Dave Chinner wrote: > This was easy to do with iomap based filesystems because they don't > carry per-block filesystem structures for every folio cached in page > cache - we carry a single object per folio that holds the 2 bits of > per-filesystem block state we need for each block the folio maps. > Compare that to a bufferhead - it uses 56 bytes of memory per > fielsystem block that is cached. 56?1 What kind of config do you have? It's 104 bytes on Debian: buffer_head 936 1092 104 39 1 : tunables 0 0 0 : slabdata 28 28 0 Maybe you were looking at a 32-bit system; most of the elements are word-sized (pointers, size_t or long) > So we have to consider that maybe it is less work to make high-order > folios work with bufferheads. And that's where we start to get into > the maintenance problems with old filesysetms using bufferheads - > how do we ensure that the changes for high-order folio support in > bufferheads does not break the way one of these old filesystems > that use bufferheads? I don't think we can do it. Regardless of the question you're proposing here, the model where we complete a BIO, then walk every buffer_head attached to the folio to determine if we can now mark the folio as being (uptodate / not-under-writeback) just doesn't scale when you attach more than tens of BHs to the folio. It's one bit per BH rather than having a summary bitmap like iomap has. I have been thinking about spitting the BH into two pieces, something like this: struct buffer_head_head { spinlock_t b_lock; struct buffer_head *buffers; unsigned long state[]; }; and remove BH_Uptodate and BH_Dirty in favour of setting bits in state like iomap does. But, as you say, there are a lot of filesystems that would need to be audited and probably modified. Frustratingly, it looks like buffer_heads were intended to be used as extents; each one has a b_size of its own. But there's a ridiculous amount of code that assumes that all BHs attached to a folio have the same b_size as each other.