On Tue, Sep 19, 2023 at 12:34:17PM -0400, Theodore Ts'o wrote: > On Tue, Sep 19, 2023 at 06:17:21AM +0100, Matthew Wilcox wrote: > > Frustratingly, it looks like buffer_heads were intended to be used as > > extents; each one has a b_size of its own. But there's a ridiculous > > amount of code that assumes that all BHs attached to a folio have the > > same b_size as each other. > > The primary reason why we need a per-bh b_size is for the benefit of > non-iomap O_DIRECT code paths. If that goes away, then we can > simplify this significantly, since we flush the buffer cache whenever > we change the blocksize used in the buffer cache; the O_DIRECT bh's > aren't part of the buffer cache, which is when you might have bh's with > a b_size of 8200k (when doing a 8200k O_DIRECT read or write). I must have not explained myself very well. What I was trying to say was that if the buffer cache actually supported it, large folios and buffer_heads wouldn't perform horribly together, unless you had a badly fragmented file. eg you could allocate a 256kB folio, then ask the filesystem to create buffer_heads for it, and maybe it would come back with a list of four buffer_heads, the first representing the extent from 0-32kB, the second 32kB-164kB, the third 164kB-252kB and the fourth 252kB-256kB. Wherever there were physical discontiguities in the file. Then there would be only four buffer_heads to scan in order to determine whether the entire folio was uptodate/dirty/written-back/etc. It's still O(n^2) but don't underestimate the power of reducing N to a small number. Possibly we'd want to change buffer_heads a little to support tracking dirtiness on a finer granularity than per-extent (just as Ritesh recently did to iomap). But there is a path to happiness here that doesn't involve switching everything to iomap. If I try to do it, I know I'll break everything while doing it ...