Re: [PATCH 3/9] xfs: convert buffer cache to use high order folios

Dave Chinner <david@xxxxxxxxxxxxx> · Wed, 20 Mar 2024 08:55:03 +1100

On Tue, Mar 19, 2024 at 10:29:09AM -0700, Darrick J. Wong wrote:
> On Tue, Mar 19, 2024 at 09:45:54AM +1100, Dave Chinner wrote:
> > From: Dave Chinner <dchinner@xxxxxxxxxx>
> > 
> > Now that we have the buffer cache using the folio API, we can extend
> > the use of folios to allocate high order folios for multi-page
> > buffers rather than an array of single pages that are then vmapped
> > into a contiguous range.
> > 
> > This creates two types of buffers: single folio buffers that can
> > have arbitrary order, and multi-folio buffers made up of many single
> > page folios that get vmapped. The latter is essentially the existing
> > code, so there are no logic changes to handle this case.
> > 
> > There are a few places where we iterate the folios on a buffer.
> > These need to be converted to handle the high order folio case.
> > Luckily, this only occurs when bp->b_folio_count == 1, and the code
> > for handling this case is just a simple application of the folio API
> > to the operations that need to be performed.
> > 
> > The code that allocates buffers will optimistically attempt a high
> > order folio allocation as a fast path. If this high order allocation
> > fails, then we fall back to the existing multi-folio allocation
> > code. This now forms the slow allocation path, and hopefully will be
> > largely unused in normal conditions.
> > 
> > This should improve performance of large buffer operations (e.g.
> > large directory block sizes) as we should now mostly avoid the
> > expense of vmapping large buffers (and the vmap lock contention that
> > can occur) as well as avoid the runtime pressure that frequently
> > accessing kernel vmapped pages put on the TLBs.
> > 
> > Signed-off-by: Dave Chinner <dchinner@xxxxxxxxxx>
> > ---
> >  fs/xfs/xfs_buf.c | 141 ++++++++++++++++++++++++++++++++++++-----------
> >  1 file changed, 110 insertions(+), 31 deletions(-)
> > 
> > diff --git a/fs/xfs/xfs_buf.c b/fs/xfs/xfs_buf.c
> > index 832226385154..7d9303497763 100644
> > --- a/fs/xfs/xfs_buf.c
> > +++ b/fs/xfs/xfs_buf.c
> > @@ -80,6 +80,10 @@ xfs_buf_is_vmapped(
> >  	return bp->b_addr && bp->b_folio_count > 1;
> >  }
> >  
> > +/*
> > + * See comment above xfs_buf_alloc_folios() about the constraints placed on
> > + * allocating vmapped buffers.
> > + */
> >  static inline int
> >  xfs_buf_vmap_len(
> >  	struct xfs_buf	*bp)
> > @@ -352,14 +356,63 @@ xfs_buf_alloc_kmem(
> >  		bp->b_addr = NULL;
> >  		return -ENOMEM;
> >  	}
> > -	bp->b_offset = offset_in_page(bp->b_addr);
> >  	bp->b_folios = bp->b_folio_array;
> >  	bp->b_folios[0] = kmem_to_folio(bp->b_addr);
> > +	bp->b_offset = offset_in_folio(bp->b_folios[0], bp->b_addr);
> >  	bp->b_folio_count = 1;
> >  	bp->b_flags |= _XBF_KMEM;
> >  	return 0;
> >  }
> >  
> > +/*
> > + * Allocating a high order folio makes the assumption that buffers are a
> > + * power-of-2 size so that ilog2() returns the exact order needed to fit
> > + * the contents of the buffer. Buffer lengths are mostly a power of two,
> > + * so this is not an unreasonable approach to take by default.
> > + *
> > + * The exception here are user xattr data buffers, which can be arbitrarily
> > + * sized up to 64kB plus structure metadata. In that case, round up the order.
> 
> So.... does that mean a 128K folio for a 68k xattr remote value buffer?

Yes. I figured that >64kB metadata is a rare corner case, and so
the extra memory usage isn't that big a deal. If it does ever become
a problem, we can skip straight to vmalloc for such buffers...

> I've been noticing the 4k merkle tree blobs consume 2 fsb in the xattr
> tree, which isn't awesome.

Yup, that's because we have headers in remote xattr blocks for
holding identifying information and CRCs.

> I haven't figured out a good way to deal
> with that, since the verity code uses a lot of blkno/byte shifting and
> 2k merkle tree blocks suck.

I really wish that the verity code just asked us to store individual
{key,value} tuples rather than large opaque blocks of fixed size
data. Then we could store everything efficiently as individual
btree records in the attr fork dabtree, and verity could still
implement it's page sized opaque block storage for all the other
filesystems behind that....

> > @@ -1551,28 +1617,28 @@ xfs_buf_ioapply_map(
> >  
> >  next_chunk:
> >  	atomic_inc(&bp->b_io_remaining);
> > -	nr_pages = bio_max_segs(total_nr_pages);
> > +	nr_folios = bio_max_segs(total_nr_folios);
> >  
> > -	bio = bio_alloc(bp->b_target->bt_bdev, nr_pages, op, GFP_NOIO);
> > +	bio = bio_alloc(bp->b_target->bt_bdev, nr_folios, op, GFP_NOIO);
> >  	bio->bi_iter.bi_sector = sector;
> >  	bio->bi_end_io = xfs_buf_bio_end_io;
> >  	bio->bi_private = bp;
> >  
> > -	for (; size && nr_pages; nr_pages--, page_index++) {
> > -		int	rbytes, nbytes = PAGE_SIZE - offset;
> > +	for (; size && nr_folios; nr_folios--, folio_index++) {
> > +		struct folio	*folio = bp->b_folios[folio_index];
> > +		int		nbytes = folio_size(folio) - offset;
> >  
> >  		if (nbytes > size)
> >  			nbytes = size;
> >  
> > -		rbytes = bio_add_folio(bio, bp->b_folios[page_index], nbytes,
> > -				      offset);
> 
> Um, bio_add_folio returns a bool, maybe that's why hch was complaining
> about hangs with only the first few patches applied?

Good spot - entirely possible this is the problem...

> Annoying if the kbuild robots don't catch that.  Either way, with hch's
> replies to this and patch 2 dealt with,
> 
> Reviewed-by: Darrick J. Wong <djwong@xxxxxxxxxx>

Thanks!.

-Dave.
-- 
Dave Chinner
david@xxxxxxxxxxxxx