Re: [PATCH 10/12] xfs: use vmalloc instead of vm_map_area for buffer backing memory

Dave Chinner <david@xxxxxxxxxxxxx> · Thu, 6 Mar 2025 10:28:30 +1100

On Wed, Mar 05, 2025 at 02:54:07PM -0800, Darrick J. Wong wrote:
> On Thu, Mar 06, 2025 at 08:20:08AM +1100, Dave Chinner wrote:
> > On Wed, Mar 05, 2025 at 07:05:27AM -0700, Christoph Hellwig wrote:
> > > The fallback buffer allocation path currently open codes a suboptimal
> > > version of vmalloc to allocate pages that are then mapped into
> > > vmalloc space.  Switch to using vmalloc instead, which uses all the
> > > optimizations in the common vmalloc code, and removes the need to
> > > track the backing pages in the xfs_buf structure.
> > > 
> > > Signed-off-by: Christoph Hellwig <hch@xxxxxx>
> > .....
> > 
> > > @@ -1500,29 +1373,43 @@ static void
> > >  xfs_buf_submit_bio(
> > >  	struct xfs_buf		*bp)
> > >  {
> > > -	unsigned int		size = BBTOB(bp->b_length);
> > > -	unsigned int		map = 0, p;
> > > +	unsigned int		map = 0;
> > >  	struct blk_plug		plug;
> > >  	struct bio		*bio;
> > >  
> > > -	bio = bio_alloc(bp->b_target->bt_bdev, bp->b_page_count,
> > > -			xfs_buf_bio_op(bp), GFP_NOIO);
> > > -	bio->bi_private = bp;
> > > -	bio->bi_end_io = xfs_buf_bio_end_io;
> > > +	if (is_vmalloc_addr(bp->b_addr)) {
> > > +		unsigned int	size = BBTOB(bp->b_length);
> > > +		unsigned int	alloc_size = roundup(size, PAGE_SIZE);
> > > +		void		*data = bp->b_addr;
> > >  
> > > -	if (bp->b_page_count == 1) {
> > > -		__bio_add_page(bio, virt_to_page(bp->b_addr), size,
> > > -				offset_in_page(bp->b_addr));
> > > -	} else {
> > > -		for (p = 0; p < bp->b_page_count; p++)
> > > -			__bio_add_page(bio, bp->b_pages[p], PAGE_SIZE, 0);
> > > -		bio->bi_iter.bi_size = size; /* limit to the actual size used */
> > > +		bio = bio_alloc(bp->b_target->bt_bdev, alloc_size >> PAGE_SHIFT,
> > > +				xfs_buf_bio_op(bp), GFP_NOIO);
> > > +
> > > +		do {
> > > +			unsigned int	len = min(size, PAGE_SIZE);
> > >  
> > > -		if (is_vmalloc_addr(bp->b_addr))
> > > -			flush_kernel_vmap_range(bp->b_addr,
> > > -					xfs_buf_vmap_len(bp));
> > > +			ASSERT(offset_in_page(data) == 0);
> > > +			__bio_add_page(bio, vmalloc_to_page(data), len, 0);
> > > +			data += len;
> > > +			size -= len;
> > > +		} while (size);
> > > +
> > > +		flush_kernel_vmap_range(bp->b_addr, alloc_size);
> > > +	} else {
> > > +		/*
> > > +		 * Single folio or slab allocation.  Must be contiguous and thus
> > > +		 * only a single bvec is needed.
> > > +		 */
> > > +		bio = bio_alloc(bp->b_target->bt_bdev, 1, xfs_buf_bio_op(bp),
> > > +				GFP_NOIO);
> > > +		__bio_add_page(bio, virt_to_page(bp->b_addr),
> > > +				BBTOB(bp->b_length),
> > > +				offset_in_page(bp->b_addr));
> > >  	}
> > 
> > How does offset_in_page() work with a high order folio? It can only
> > return a value between 0 and (PAGE_SIZE - 1). i.e. shouldn't this
> > be:
> > 
> > 		folio = kmem_to_folio(bp->b_addr);
> > 
> > 		bio_add_folio_nofail(bio, folio, BBTOB(bp->b_length),
> > 				offset_in_folio(folio, bp->b_addr));
> 
> I think offset_in_folio() returns 0 in the !kmem && !vmalloc case
> because we allocate the folio and set b_addr to folio_address(folio);
> and we never call the kmem alloc code for sizes greater than PAGE_SIZE.

Yes, but that misses my point: this is a folio conversion, whilst
this treats a folio as a page. We're trying to get rid of this sort
of page/folio type confusion (i.e. stuff like "does offset_in_page()
work correctly on large folios"). New code shouldn't be adding
new issues like these, especially when there are existing
folio-based APIs that are guaranteed to work correctly and won't
need fixing in future before pages and folios can be fully
separated.

-Dave.
-- 
Dave Chinner
david@xxxxxxxxxxxxx