Re: [RFC 2/2] iomap: Support subpage size dirty tracking to improve write performance

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 



On 22/10/28 10:01AM, Darrick J. Wong wrote:
> On Fri, Oct 28, 2022 at 10:00:33AM +0530, Ritesh Harjani (IBM) wrote:
> > On a 64k pagesize platforms (specially Power and/or aarch64) with 4k
> > filesystem blocksize, this patch should improve the performance by doing
> > only the subpage dirty data write.
> > 
> > This should also reduce the write amplification since we can now track
> > subpage dirty status within state bitmaps. Earlier we had to
> > write the entire 64k page even if only a part of it (e.g. 4k) was
> > updated.
> > 
> > Performance testing of below fio workload reveals ~16x performance
> > improvement on nvme with XFS (4k blocksize) on Power (64K pagesize)
> > FIO reported write bw scores improved from around ~28 MBps to ~452 MBps.
> > 
> > <test_randwrite.fio>
> > [global]
> > 	ioengine=psync
> > 	rw=randwrite
> > 	overwrite=1
> > 	pre_read=1
> > 	direct=0
> > 	bs=4k
> > 	size=1G
> > 	dir=./
> > 	numjobs=8
> > 	fdatasync=1
> > 	runtime=60
> > 	iodepth=64
> > 	group_reporting=1
> > 
> > [fio-run]
> > 
> > Reported-by: Aravinda Herle <araherle@xxxxxxxxxx>
> > Signed-off-by: Ritesh Harjani (IBM) <ritesh.list@xxxxxxxxx>
> > ---
> >  fs/iomap/buffered-io.c | 53 ++++++++++++++++++++++++++++++++++++++++--
> >  1 file changed, 51 insertions(+), 2 deletions(-)
> > 
> > diff --git a/fs/iomap/buffered-io.c b/fs/iomap/buffered-io.c
> > index 255f9f92668c..31ee80a996b2 100644
> > --- a/fs/iomap/buffered-io.c
> > +++ b/fs/iomap/buffered-io.c
> > @@ -58,7 +58,7 @@ iomap_page_create(struct inode *inode, struct folio *folio, unsigned int flags)
> >  	else
> >  		gfp = GFP_NOFS | __GFP_NOFAIL;
> >  
> > -	iop = kzalloc(struct_size(iop, state, BITS_TO_LONGS(nr_blocks)),
> > +	iop = kzalloc(struct_size(iop, state, BITS_TO_LONGS(2 * nr_blocks)),
> >  		      gfp);
> >  	if (iop) {
> >  		spin_lock_init(&iop->state_lock);
> > @@ -168,6 +168,48 @@ static void iomap_set_range_uptodate(struct folio *folio,
> >  		folio_mark_uptodate(folio);
> >  }
> >  
> > +static void iomap_iop_set_range_dirty(struct folio *folio,
> > +		struct iomap_page *iop, size_t off, size_t len)
> > +{
> > +	struct inode *inode = folio->mapping->host;
> > +	unsigned int nr_blocks = i_blocks_per_folio(inode, folio);
> > +	unsigned first = (off >> inode->i_blkbits) + nr_blocks;
> > +	unsigned last = ((off + len - 1) >> inode->i_blkbits) + nr_blocks;
> > +	unsigned long flags;
> > +
> > +	spin_lock_irqsave(&iop->state_lock, flags);
> > +	bitmap_set(iop->state, first, last - first + 1);
> > +	spin_unlock_irqrestore(&iop->state_lock, flags);
> > +}
> > +
> > +static void iomap_set_range_dirty(struct folio *folio,
> > +		struct iomap_page *iop, size_t off, size_t len)
> > +{
> > +	if (iop)
> > +		iomap_iop_set_range_dirty(folio, iop, off, len);
> > +}
> > +
> > +static void iomap_iop_clear_range_dirty(struct folio *folio,
> > +		struct iomap_page *iop, size_t off, size_t len)
> > +{
> > +	struct inode *inode = folio->mapping->host;
> > +	unsigned int nr_blocks = i_blocks_per_folio(inode, folio);
> > +	unsigned first = (off >> inode->i_blkbits) + nr_blocks;
> > +	unsigned last = ((off + len - 1) >> inode->i_blkbits) + nr_blocks;
> > +	unsigned long flags;
> > +
> > +	spin_lock_irqsave(&iop->state_lock, flags);
> > +	bitmap_clear(iop->state, first, last - first + 1);
> > +	spin_unlock_irqrestore(&iop->state_lock, flags);
> > +}
> > +
> > +static void iomap_clear_range_dirty(struct folio *folio,
> > +		struct iomap_page *iop, size_t off, size_t len)
> > +{
> > +	if (iop)
> > +		iomap_iop_clear_range_dirty(folio, iop, off, len);
> > +}
> > +
> >  static void iomap_finish_folio_read(struct folio *folio, size_t offset,
> >  		size_t len, int error)
> >  {
> > @@ -665,6 +707,7 @@ static size_t __iomap_write_end(struct inode *inode, loff_t pos, size_t len,
> >  	if (unlikely(copied < len && !folio_test_uptodate(folio)))
> >  		return 0;
> >  	iomap_set_range_uptodate(folio, iop, offset_in_folio(folio, pos), len);
> > +	iomap_set_range_dirty(folio, iop, offset_in_folio(folio, pos), len);
> >  	filemap_dirty_folio(inode->i_mapping, folio);
> >  	return copied;
> >  }
> > @@ -979,6 +1022,8 @@ static loff_t iomap_folio_mkwrite_iter(struct iomap_iter *iter,
> >  		block_commit_write(&folio->page, 0, length);
> >  	} else {
> >  		WARN_ON_ONCE(!folio_test_uptodate(folio));
> > +		iomap_set_range_dirty(folio, to_iomap_page(folio),
> > +				offset_in_folio(folio, iter->pos), length);
> >  		folio_mark_dirty(folio);
> >  	}
> >  
> > @@ -1354,7 +1399,8 @@ iomap_writepage_map(struct iomap_writepage_ctx *wpc,
> >  	 * invalid, grab a new one.
> >  	 */
> >  	for (i = 0; i < nblocks && pos < end_pos; i++, pos += len) {
> > -		if (iop && !test_bit(i, iop->state))
> > +		if (iop && (!test_bit(i, iop->state) ||
> > +			    !test_bit(i + nblocks, iop->state)))
> 
> Hmm.  So I /think/ these two test_bit()s mean that we skip any folio
> sub-block if it's either notuptodate or not dirty?
> 
> I /think/ we only need to check the dirty status, right?  Like willy
> said? :)

Yes. Agreed.

> 
> That said... somewhere we probably ought to check the consistency of the
> two bits to ensure that they're not (dirty && !uptodate), given our
> horrible history of getting things wrong with page and bufferhead state
> bits.
> 
> Admittedly I'm not thrilled at the reintroduction of page and iop dirty
> state that are updated in separate places, but OTOH the write
> amplification here is demonstrably horrifying as you point out so it's
> clearly necessary.

On a 64K pagesize platform the performance of such workloads that I meantion is
also quiet bad. 


> 
> Maybe we need a debugging function that will check the page and iop
> state, and call it every time we go in and out of critical iomap
> functions (write, writeback, dropping pages, etc)

I will try and review each of the paths once again to ensure the consistency. 
What I see is, we only mark the iop->state dirty bits before dirtying the page
in iomap buffered-io paths. This happens at two places,
1. __iomap_write_end() where we call filemap_dirty_folio(). We mark iop state
   dirty bits before calling filemap_dirty_folio()
2. iomap_folio_mkwrite_iter(). Here again before calling folio_mark_dirty(), we
   set the dirty state bits. This is the iomap_page_mkwrite path.

But, I would still like to review each of these and other paths as well.

-ritesh


> 
> --D
> 
> >  			continue;
> >  
> >  		error = wpc->ops->map_blocks(wpc, inode, pos);
> > @@ -1397,6 +1443,9 @@ iomap_writepage_map(struct iomap_writepage_ctx *wpc,
> >  		}
> >  	}
> >  
> > +	iomap_clear_range_dirty(folio, iop,
> > +				offset_in_folio(folio, folio_pos(folio)),
> > +				end_pos - folio_pos(folio));
> >  	folio_start_writeback(folio);
> >  	folio_unlock(folio);
> >  
> > -- 
> > 2.37.3
> > 



[Index of Archives]     [Linux Ext4 Filesystem]     [Union Filesystem]     [Filesystem Testing]     [Ceph Users]     [Ecryptfs]     [NTFS 3]     [AutoFS]     [Kernel Newbies]     [Share Photos]     [Security]     [Netfilter]     [Bugtraq]     [Yosemite News]     [MIPS Linux]     [ARM Linux]     [Linux Security]     [Linux Cachefs]     [Reiser Filesystem]     [Linux RAID]     [NTFS 3]     [Samba]     [Device Mapper]     [CEPH Development]

  Powered by Linux