On Wed, Feb 26, 2025 at 07:51:35AM -0800, Christoph Hellwig wrote: > Now that we have the buffer cache using the folio API, we can extend > the use of folios to allocate high order folios for multi-page > buffers rather than an array of single pages that are then vmapped > into a contiguous range. > > This creates a new type of single folio buffers that can have arbitrary > order in addition to the existing multi-folio buffers made up of many > single page folios that get vmapped. The single folio is for now > stashed into the existing b_pages array, but that will go away entirely > later in the series and remove the temporary page vs folio typing issues > that only work because the two structures currently can be used largely > interchangeable. > > The code that allocates buffers will optimistically attempt a high > order folio allocation as a fast path if the buffer size is a power > of two and thus fits into a folio. If this high order allocation > fails, then we fall back to the existing multi-folio allocation > code. This now forms the slow allocation path, and hopefully will be > largely unused in normal conditions except for buffers with size > that are not a power of two like larger remote xattrs. > > This should improve performance of large buffer operations (e.g. > large directory block sizes) as we should now mostly avoid the > expense of vmapping large buffers (and the vmap lock contention that > can occur) as well as avoid the runtime pressure that frequently > accessing kernel vmapped pages put on the TLBs. > > Based on a patch from Dave Chinner <dchinner@xxxxxxxxxx>, but mutilated > beyond recognition. > > Signed-off-by: Christoph Hellwig <hch@xxxxxx> > --- > fs/xfs/xfs_buf.c | 58 +++++++++++++++++++++++++++++++++++++++++++----- > 1 file changed, 52 insertions(+), 6 deletions(-) > > diff --git a/fs/xfs/xfs_buf.c b/fs/xfs/xfs_buf.c > index f327bf5b04c0..3c582eaa656d 100644 > --- a/fs/xfs/xfs_buf.c > +++ b/fs/xfs/xfs_buf.c > @@ -261,9 +261,10 @@ xfs_buf_free_pages( > > for (i = 0; i < bp->b_page_count; i++) { > if (bp->b_pages[i]) > - __free_page(bp->b_pages[i]); > + folio_put(page_folio(bp->b_pages[i])); > } > - mm_account_reclaimed_pages(bp->b_page_count); > + mm_account_reclaimed_pages( > + DIV_ROUND_UP(BBTOB(bp->b_length), PAGE_SIZE)); Why do we round the number of bytes in the buffer up to base page size? Don't we want howmany(BBTOB(bp->b_length), PAGE_SIZE) here? Oh wait, howmany *is* DIV_ROUND_UP. Never mind... > if (bp->b_pages != bp->b_page_array) > kfree(bp->b_pages); > @@ -336,12 +337,17 @@ xfs_buf_alloc_kmem( > * For tmpfs-backed buffers used by in-memory btrees this directly maps the > * tmpfs page cache folios. > * > - * For real file system buffers there are two different kinds backing memory: > + * For real file system buffers there are three different kinds backing memory: > * > * The first type backs the buffer by a kmalloc allocation. This is done for > * less than PAGE_SIZE allocations to avoid wasting memory. > * > - * The second type of buffer is the multi-page buffer. These are always made > + * The second type is a single folio buffer - this may be a high order folio or > + * just a single page sized folio, but either way they get treated the same way > + * by the rest of the code - the buffer memory spans a single contiguous memory > + * region that we don't have to map and unmap to access the data directly. > + * > + * The third type of buffer is the multi-page buffer. These are always made > * up of single pages so that they can be fed to vmap_ram() to return a > * contiguous memory region we can access the data through, or mark it as > * XBF_UNMAPPED and access the data directly through individual page_address() > @@ -354,6 +360,7 @@ xfs_buf_alloc_backing_mem( > { > size_t size = BBTOB(bp->b_length); > gfp_t gfp_mask = GFP_KERNEL | __GFP_NOLOCKDEP | __GFP_NOWARN; > + struct folio *folio; > long filled = 0; > > if (xfs_buftarg_is_mem(bp->b_target)) > @@ -375,7 +382,46 @@ xfs_buf_alloc_backing_mem( > if (size < PAGE_SIZE && is_power_of_2(size)) > return xfs_buf_alloc_kmem(bp, size, gfp_mask); > > - /* Make sure that we have a page list */ > + /* Assure zeroed buffer for non-read cases. */ > + if (!(flags & XBF_READ)) > + gfp_mask |= __GFP_ZERO; Didn't this get added ten lines up in "xfs: remove the kmalloc to page allocator fallback"? > + > + /* > + * Don't bother with the retry loop for single PAGE allocations, there > + * is litte changes this can be better than the VM version. Er... I don't understand the second half of this sentence; is this what you're trying to communicate?: "Don't bother with the retry loop for single-page allocations; vmalloc won't do any better." > + */ > + if (size <= PAGE_SIZE) > + gfp_mask |= __GFP_NOFAIL; > + > + /* > + * Optimistically attempt a single high order folio allocation for > + * larger than PAGE_SIZE buffers. > + * > + * Allocating a high order folio makes the assumption that buffers are a > + * power-of-2 size, matching the power-of-2 folios sizes available. > + * > + * The exception here are user xattr data buffers, which can be arbitrarily > + * sized up to 64kB plus structure metadata, skip straight to the vmalloc > + * path for them instead of wasting memory. > + * here. Nit: "...path for them instead of wasting memory here." --D > + */ > + if (size > PAGE_SIZE && !is_power_of_2(size)) > + goto fallback; > + folio = folio_alloc(gfp_mask, get_order(size)); > + if (!folio) { > + if (size <= PAGE_SIZE) > + return -ENOMEM; > + goto fallback; > + } > + bp->b_addr = folio_address(folio); > + bp->b_page_array[0] = &folio->page; > + bp->b_pages = bp->b_page_array; > + bp->b_page_count = 1; > + bp->b_flags |= _XBF_PAGES; > + return 0; > + > +fallback: > + /* Fall back to allocating an array of single page folios. */ > bp->b_page_count = DIV_ROUND_UP(size, PAGE_SIZE); > if (bp->b_page_count <= XB_PAGES) { > bp->b_pages = bp->b_page_array; > @@ -1529,7 +1575,7 @@ xfs_buf_submit_bio( > bio->bi_private = bp; > bio->bi_end_io = xfs_buf_bio_end_io; > > - if (bp->b_flags & _XBF_KMEM) { > + if (bp->b_page_count == 1) { > __bio_add_page(bio, virt_to_page(bp->b_addr), size, > offset_in_page(bp->b_addr)); > } else { > -- > 2.45.2 > >