Re: [LSF/MM/BPF TOPIC] Cloud storage optimizations

Dave Chinner <david@xxxxxxxxxxxxx> · Wed, 8 Mar 2023 18:59:52 +1100

On Tue, Mar 07, 2023 at 10:11:43PM -0800, Luis Chamberlain wrote:
> On Sun, Mar 05, 2023 at 05:02:43AM +0000, Matthew Wilcox wrote:
> > On Sat, Mar 04, 2023 at 08:15:50PM -0800, Luis Chamberlain wrote:
> > > On Sat, Mar 04, 2023 at 04:39:02PM +0000, Matthew Wilcox wrote:
> > > > XFS already works with arbitrary-order folios. 
> > > 
> > > But block sizes > PAGE_SIZE is work which is still not merged. It
> > > *can* be with time. That would allow one to muck with larger block
> > > sizes than 4k on x86-64 for instance. Without this, you can't play
> > > ball.
> > 
> > Do you mean that XFS is checking that fs block size <= PAGE_SIZE and
> > that check needs to be dropped?  If so, I don't see where that happens.
> 
> None of that. Back in 2018 Chinner had prototyped XFS support with
> larger block size > PAGE_SIZE:
> 
> https://lwn.net/ml/linux-fsdevel/20181107063127.3902-1-david@xxxxxxxxxxxxx/

Having a working BS > PS implementation on XFS based on variable
page order support in the page cache goes back over a
decade before that.

Christoph Lameter did the page cache work, and I added support for
XFS back in 2007. THe total change to XFS required can be seen in
this simple patch:

https://lore.kernel.org/linux-mm/20070423093152.GI32602149@xxxxxxxxxxxxxxxxx/

That was when the howls of anguish about high order allocations
Willy mentioned started....

> I just did a quick attempt to rebased it and most of the left over work
> is actually on IOMAP for writeback and zero / writes requiring a new
> zero-around functionality. All bugs on the rebase are my own, only compile
> tested so far, and not happy with some of the changes I had to make so
> likely could use tons more love:
> 
> https://git.kernel.org/pub/scm/linux/kernel/git/mcgrof/linux.git/log/?h=20230307-larger-bs-then-ps-xfs

On a current kernel, that patchset is fundamentally broken as we
have multi-page folio support in XFS and iomap - the patchset is
inherently PAGE_SIZE based and it will do the the wrong thing with
PAGE_SIZE based zero-around.

IOWs, IOMAP_F_ZERO_AROUND does not need to exist any more, nor
should any of the custom hooks it triggered in different operations
for zero-around.  That's because we should now be using the same
approach to BS > PS as we first used back in 2007. We already
support multi-page folios in the page cache, so all the zero-around
and partial folio uptodate tracking we need is already in place.

Hence, like Willy said, all we need to do is have
filemap_get_folio(FGP_CREAT) always allocate at least filesystem
block sized and aligned folio and insert them into the mapping tree.
Multi-page folios will always need to be sized as an integer
multiple of the filesystem block size, but once we ensure size and
alignment of folios in the page cache, we get everything else for
free.

/me cues the howls of anguish over memory fragmentation....

> But it should give you an idea of what type of things filesystems need to do.

Not really. it gives you an idea of what filesystems needed to do 5
years ago to support BS > PS. We're living in the age of folios now,
not pages.  Willy starting work on folios was why I dropped that
patch set, firstly because it was going to make the iomap conversion
to folios harder, and secondly, we realised that none of it was
necessary if folios supported multi-page constructs in the page
cache natively.

IOWs, multi-page folios in the page cache should make BS > PS mostly
trivial to support for any filesystem or block device that doesn't
have some other dependency on PAGE_SIZE objects in the page cache
(e.g. bufferheads).

Cheers,

Dave.
-- 
Dave Chinner
david@xxxxxxxxxxxxx