Re: [RFCv5 5/5] iomap: Add per-block dirty state tracking to improve performance

Ritesh Harjani (IBM) <ritesh.list@xxxxxxxxx> · Wed, 17 May 2023 20:50:39 +0530

Brian Foster <bfoster@xxxxxxxxxx> writes:

> On Tue, May 16, 2023 at 08:19:31PM +0530, Ritesh Harjani wrote:
>> Brian Foster <bfoster@xxxxxxxxxx> writes:
>>
>> > On Mon, May 08, 2023 at 12:58:00AM +0530, Ritesh Harjani (IBM) wrote:
>> >> When filesystem blocksize is less than folio size (either with
>> >> mapping_large_folio_support() or with blocksize < pagesize) and when the
>> >> folio is uptodate in pagecache, then even a byte write can cause
>> >> an entire folio to be written to disk during writeback. This happens
>> >> because we currently don't have a mechanism to track per-block dirty
>> >> state within struct iomap_page. We currently only track uptodate state.
>> >>
>> >> This patch implements support for tracking per-block dirty state in
>> >> iomap_page->state bitmap. This should help improve the filesystem write
>> >> performance and help reduce write amplification.
>> >>
>> >> Performance testing of below fio workload reveals ~16x performance
>> >> improvement using nvme with XFS (4k blocksize) on Power (64K pagesize)
>> >> FIO reported write bw scores improved from around ~28 MBps to ~452 MBps.
>> >>
>> >> 1. <test_randwrite.fio>
>> >> [global]
>> >> 	ioengine=psync
>> >> 	rw=randwrite
>> >> 	overwrite=1
>> >> 	pre_read=1
>> >> 	direct=0
>> >> 	bs=4k
>> >> 	size=1G
>> >> 	dir=./
>> >> 	numjobs=8
>> >> 	fdatasync=1
>> >> 	runtime=60
>> >> 	iodepth=64
>> >> 	group_reporting=1
>> >>
>> >> [fio-run]
>> >>
>> >> 2. Also our internal performance team reported that this patch improves
>> >>    their database workload performance by around ~83% (with XFS on Power)
>> >>
>> >> Reported-by: Aravinda Herle <araherle@xxxxxxxxxx>
>> >> Reported-by: Brian Foster <bfoster@xxxxxxxxxx>
>> >> Signed-off-by: Ritesh Harjani (IBM) <ritesh.list@xxxxxxxxx>
>> >> ---
>> >>  fs/gfs2/aops.c         |   2 +-
>> >>  fs/iomap/buffered-io.c | 115 ++++++++++++++++++++++++++++++++++++++---
>> >>  fs/xfs/xfs_aops.c      |   2 +-
>> >>  fs/zonefs/file.c       |   2 +-
>> >>  include/linux/iomap.h  |   1 +
>> >>  5 files changed, 112 insertions(+), 10 deletions(-)
>> >>
>> > ...
>> >> diff --git a/fs/iomap/buffered-io.c b/fs/iomap/buffered-io.c
>> >> index 25f20f269214..c7f41b26280a 100644
>> >> --- a/fs/iomap/buffered-io.c
>> >> +++ b/fs/iomap/buffered-io.c
>> > ...
>> >> @@ -119,12 +169,20 @@ static struct iomap_page *iop_alloc(struct inode *inode, struct folio *folio,
>> >>  	else
>> >>  		gfp = GFP_NOFS | __GFP_NOFAIL;
>> >>
>> >> -	iop = kzalloc(struct_size(iop, state, BITS_TO_LONGS(nr_blocks)),
>> >> +	/*
>> >> +	 * iop->state tracks two sets of state flags when the
>> >> +	 * filesystem block size is smaller than the folio size.
>> >> +	 * The first state tracks per-block uptodate and the
>> >> +	 * second tracks per-block dirty state.
>> >> +	 */
>> >> +	iop = kzalloc(struct_size(iop, state, BITS_TO_LONGS(2 * nr_blocks)),
>> >>  		      gfp);
>> >>  	if (iop) {
>> >>  		spin_lock_init(&iop->state_lock);
>> >>  		if (folio_test_uptodate(folio))
>> >>  			iop_set_range(iop, 0, nr_blocks);
>> >> +		if (is_dirty)
>> >> +			iop_set_range(iop, nr_blocks, nr_blocks);
>> >
>> > I find the is_dirty logic here a bit confusing. AFAICT, this is
>> > primarily to handle the case where a folio is completely overwritten, so
>> > no iop is allocated at write time, and so then writeback needs to
>> > allocate the iop as fully dirty to reflect that. I.e., all iop_alloc()
>> > callers other than iomap_writepage_map() either pass the result of
>> > folio_test_dirty() or explicitly dirty the entire range of the folio
>> > anyways.  iomap_dirty_folio() essentially does the latter because it
>> > needs to dirty the entire folio regardless of whether the iop already
>> > exists or not, right?
>>
>> Yes.
>>
>> >
>> > If so and if I'm following all of that correctly, could this complexity
>> > be isolated to iomap_writepage_map() by simply checking for the !iop
>> > case first, then call iop_alloc() immediately followed by
>> > set_range_dirty() of the entire folio? Then presumably iop_alloc() could
>> > always just dirty based on folio state with the writeback path exception
>> > case handled explicitly. Hm?
>> >
>>
>> Hi Brian,
>>
>> It was discussed here [1] to pass is_dirty flag at the time of iop
>> allocation. We can do what you are essentially suggesting, but it's just
>> extra work i.e. we will again do some calculations of blocks_per_folio,
>> start, end and more importantly take and release iop->state_lock
>> spinlock. Whereas with above approach we could get away with this at the
>> time of iop allocation itself.
>>
>
> Hi Ritesh,
>
> Isn't that extra work already occurring in iomap_dirty_folio()? I was
> just thinking that maybe moving it to where it's apparently needed (i.e.
> writeback) might eliminate the need for the param.
>
> I suppose iomap_dirty_folio() would need to call filemap_dirty_folio()
> first to make sure iop_alloc() sees the dirty state, but maybe that
> would also allow skipping the iop alloc if the folio was already dirty
> (i.e. if the folio was previously dirtied by a full buffered overwite
> for example)?
>
> I've appended a quick diff below (compile tested only) just to explain
> what I mean. When doing that it also occurred to me that if we really
> care about the separate call, we could keep the is_dirty param but do
> the __iop_alloc() wrapper thing where iop_alloc() always passes
> folio_test_dirty().

Sure. Brian. I guess when we got the comment, it was still in the intial
work & I was anyway passing a from_writeback flag. Thanks for the diff,
it does make sense to me. I will try to add that change in the next revision.

>
> BTW, I think you left off your [1] discussion reference..
>

Sorry, my bad.

[1]: https://lore.kernel.org/linux-xfs/Y9f7cZxnXbL7x0p+@xxxxxxxxxxxxx/

>> Besides, isn't it easier this way? which as you also stated we will
>> dirty all the blocks based on is_dirty flag, which is folio_test_dirty()
>> except at the writeback time.
>>
>
> My thinking was just that I kind of had to read through all of the
> iop_alloc() callsites to grok the purpose of the parameter, which made
> it seem unnecessarily confusing. But ultimately it made sense, so I
> don't insist on changing it or anything if this approach is intentional
> and/or preferred by others. That's just my .02 and I'll defer to your
> preference. :)
>

Sure Thanks!

>>
>> >>  		folio_attach_private(folio, iop);
>> >>  	}
>> >>  	return iop;
>> > ...
>> >> @@ -561,6 +621,18 @@ void iomap_invalidate_folio(struct folio *folio, size_t offset, size_t len)
> ...
>> >
>> > WRT to the !iop case.. I _think_ this is handled correctly here because
>> > that means we'd handle the folio as completely dirty at writeback time.
>> > Is that the case? If so, it might be nice to document that assumption
>> > somewhere (here or perhaps in the writeback path).
>> >
>>
>> !iop case is simply when we don't have a large folio and blocksize ==
>>  pagesize. In that case we don't allocate any iop and simply returns
>>  from iop_alloc().
>> So then we just skip the loop which is only meant when we have blocks
>> within a folio.
>>
>
> Isn't it also the case that iop might be NULL at this point if the fs
> has sub-folio blocks, but the folio was dirtied by a full overwrite of
> the folio? Or did I misunderstand patch 4?
>

Yes. Both of the cases. We can either have bs == ps or we can have a
complete overwritten folio in which case we don't allocate any iop in
iomap_writepage_map() function.

Sure. I will document that when we factor out that change in a seperate function.

> Brian
>
> --- 8< ---
>
> diff --git a/fs/iomap/buffered-io.c b/fs/iomap/buffered-io.c
> index 92e1e1061225..89b3053e3f2d 100644
> --- a/fs/iomap/buffered-io.c
> +++ b/fs/iomap/buffered-io.c
> @@ -155,7 +155,7 @@ static void iop_clear_range_dirty(struct folio *folio, size_t off, size_t len)
>  }
>
>  static struct iomap_page *iop_alloc(struct inode *inode, struct folio *folio,
> -				    unsigned int flags, bool is_dirty)
> +				    unsigned int flags)
>  {
>  	struct iomap_page *iop = to_iomap_page(folio);
>  	unsigned int nr_blocks = i_blocks_per_folio(inode, folio);
> @@ -181,7 +181,7 @@ static struct iomap_page *iop_alloc(struct inode *inode, struct folio *folio,
>  		spin_lock_init(&iop->state_lock);
>  		if (folio_test_uptodate(folio))
>  			iop_set_range(iop, 0, nr_blocks);
> -		if (is_dirty)
> +		if (folio_test_dirty(folio))
>  			iop_set_range(iop, nr_blocks, nr_blocks);
>  		folio_attach_private(folio, iop);
>  	}
> @@ -326,8 +326,7 @@ static int iomap_read_inline_data(const struct iomap_iter *iter,
>  	if (WARN_ON_ONCE(size > iomap->length))
>  		return -EIO;
>  	if (offset > 0)
> -		iop = iop_alloc(iter->inode, folio, iter->flags,
> -				folio_test_dirty(folio));
> +		iop = iop_alloc(iter->inode, folio, iter->flags);
>  	else
>  		iop = to_iomap_page(folio);
>
> @@ -365,8 +364,7 @@ static loff_t iomap_readpage_iter(const struct iomap_iter *iter,
>  		return iomap_read_inline_data(iter, folio);
>
>  	/* zero post-eof blocks as the page may be mapped */
> -	iop = iop_alloc(iter->inode, folio, iter->flags,
> -			folio_test_dirty(folio));
> +	iop = iop_alloc(iter->inode, folio, iter->flags);
>  	iomap_adjust_read_range(iter->inode, folio, &pos, length, &poff, &plen);
>  	if (plen == 0)
>  		goto done;
> @@ -616,13 +614,10 @@ EXPORT_SYMBOL_GPL(iomap_invalidate_folio);
>
>  bool iomap_dirty_folio(struct address_space *mapping, struct folio *folio)
>  {
> -	struct iomap_page *iop;
> -	struct inode *inode = mapping->host;
> -	size_t len = i_blocks_per_folio(inode, folio) << inode->i_blkbits;
> -
> -	iop = iop_alloc(inode, folio, 0, false);
> -	iop_set_range_dirty(inode, folio, 0, len);
> -	return filemap_dirty_folio(mapping, folio);
> +	bool dirtied = filemap_dirty_folio(mapping, folio);
> +	if (dirtied)
> +		iop_alloc(mapping->host, folio, 0);
> +	return dirtied;
>  }
>  EXPORT_SYMBOL_GPL(iomap_dirty_folio);
>
> @@ -673,8 +668,7 @@ static int __iomap_write_begin(const struct iomap_iter *iter, loff_t pos,
>  	    pos + len >= folio_pos(folio) + folio_size(folio))
>  		return 0;
>
> -	iop = iop_alloc(iter->inode, folio, iter->flags,
> -			folio_test_dirty(folio));
> +	iop = iop_alloc(iter->inode, folio, iter->flags);
>
>  	if ((iter->flags & IOMAP_NOWAIT) && !iop && nr_blocks > 1)
>  		return -EAGAIN;
> @@ -1759,7 +1753,7 @@ iomap_writepage_map(struct iomap_writepage_ctx *wpc,
>  		struct writeback_control *wbc, struct inode *inode,
>  		struct folio *folio, u64 end_pos)
>  {
> -	struct iomap_page *iop = iop_alloc(inode, folio, 0, true);
> +	struct iomap_page *iop = to_iomap_page(folio);
>  	struct iomap_ioend *ioend, *next;
>  	unsigned len = i_blocksize(inode);
>  	unsigned nblocks = i_blocks_per_folio(inode, folio);
> @@ -1767,6 +1761,11 @@ iomap_writepage_map(struct iomap_writepage_ctx *wpc,
>  	int error = 0, count = 0, i;
>  	LIST_HEAD(submit_list);
>
> +	if (!iop) {
> +		iop = iop_alloc(inode, folio, 0);
> +		iop_set_range_dirty(inode, folio, 0, folio_size(folio));
> +	}
> +
>  	WARN_ON_ONCE(iop && atomic_read(&iop->write_bytes_pending) != 0);
>
>  	/*

Thanks for the review!

-ritesh