On 22/10/28 10:01AM, Darrick J. Wong wrote: > On Fri, Oct 28, 2022 at 10:00:33AM +0530, Ritesh Harjani (IBM) wrote: > > On a 64k pagesize platforms (specially Power and/or aarch64) with 4k > > filesystem blocksize, this patch should improve the performance by doing > > only the subpage dirty data write. > > > > This should also reduce the write amplification since we can now track > > subpage dirty status within state bitmaps. Earlier we had to > > write the entire 64k page even if only a part of it (e.g. 4k) was > > updated. > > > > Performance testing of below fio workload reveals ~16x performance > > improvement on nvme with XFS (4k blocksize) on Power (64K pagesize) > > FIO reported write bw scores improved from around ~28 MBps to ~452 MBps. > > > > <test_randwrite.fio> > > [global] > > ioengine=psync > > rw=randwrite > > overwrite=1 > > pre_read=1 > > direct=0 > > bs=4k > > size=1G > > dir=./ > > numjobs=8 > > fdatasync=1 > > runtime=60 > > iodepth=64 > > group_reporting=1 > > > > [fio-run] > > > > Reported-by: Aravinda Herle <araherle@xxxxxxxxxx> > > Signed-off-by: Ritesh Harjani (IBM) <ritesh.list@xxxxxxxxx> > > --- > > fs/iomap/buffered-io.c | 53 ++++++++++++++++++++++++++++++++++++++++-- > > 1 file changed, 51 insertions(+), 2 deletions(-) > > > > diff --git a/fs/iomap/buffered-io.c b/fs/iomap/buffered-io.c > > index 255f9f92668c..31ee80a996b2 100644 > > --- a/fs/iomap/buffered-io.c > > +++ b/fs/iomap/buffered-io.c > > @@ -58,7 +58,7 @@ iomap_page_create(struct inode *inode, struct folio *folio, unsigned int flags) > > else > > gfp = GFP_NOFS | __GFP_NOFAIL; > > > > - iop = kzalloc(struct_size(iop, state, BITS_TO_LONGS(nr_blocks)), > > + iop = kzalloc(struct_size(iop, state, BITS_TO_LONGS(2 * nr_blocks)), > > gfp); > > if (iop) { > > spin_lock_init(&iop->state_lock); > > @@ -168,6 +168,48 @@ static void iomap_set_range_uptodate(struct folio *folio, > > folio_mark_uptodate(folio); > > } > > > > +static void iomap_iop_set_range_dirty(struct folio *folio, > > + struct iomap_page *iop, size_t off, size_t len) > > +{ > > + struct inode *inode = folio->mapping->host; > > + unsigned int nr_blocks = i_blocks_per_folio(inode, folio); > > + unsigned first = (off >> inode->i_blkbits) + nr_blocks; > > + unsigned last = ((off + len - 1) >> inode->i_blkbits) + nr_blocks; > > + unsigned long flags; > > + > > + spin_lock_irqsave(&iop->state_lock, flags); > > + bitmap_set(iop->state, first, last - first + 1); > > + spin_unlock_irqrestore(&iop->state_lock, flags); > > +} > > + > > +static void iomap_set_range_dirty(struct folio *folio, > > + struct iomap_page *iop, size_t off, size_t len) > > +{ > > + if (iop) > > + iomap_iop_set_range_dirty(folio, iop, off, len); > > +} > > + > > +static void iomap_iop_clear_range_dirty(struct folio *folio, > > + struct iomap_page *iop, size_t off, size_t len) > > +{ > > + struct inode *inode = folio->mapping->host; > > + unsigned int nr_blocks = i_blocks_per_folio(inode, folio); > > + unsigned first = (off >> inode->i_blkbits) + nr_blocks; > > + unsigned last = ((off + len - 1) >> inode->i_blkbits) + nr_blocks; > > + unsigned long flags; > > + > > + spin_lock_irqsave(&iop->state_lock, flags); > > + bitmap_clear(iop->state, first, last - first + 1); > > + spin_unlock_irqrestore(&iop->state_lock, flags); > > +} > > + > > +static void iomap_clear_range_dirty(struct folio *folio, > > + struct iomap_page *iop, size_t off, size_t len) > > +{ > > + if (iop) > > + iomap_iop_clear_range_dirty(folio, iop, off, len); > > +} > > + > > static void iomap_finish_folio_read(struct folio *folio, size_t offset, > > size_t len, int error) > > { > > @@ -665,6 +707,7 @@ static size_t __iomap_write_end(struct inode *inode, loff_t pos, size_t len, > > if (unlikely(copied < len && !folio_test_uptodate(folio))) > > return 0; > > iomap_set_range_uptodate(folio, iop, offset_in_folio(folio, pos), len); > > + iomap_set_range_dirty(folio, iop, offset_in_folio(folio, pos), len); > > filemap_dirty_folio(inode->i_mapping, folio); > > return copied; > > } > > @@ -979,6 +1022,8 @@ static loff_t iomap_folio_mkwrite_iter(struct iomap_iter *iter, > > block_commit_write(&folio->page, 0, length); > > } else { > > WARN_ON_ONCE(!folio_test_uptodate(folio)); > > + iomap_set_range_dirty(folio, to_iomap_page(folio), > > + offset_in_folio(folio, iter->pos), length); > > folio_mark_dirty(folio); > > } > > > > @@ -1354,7 +1399,8 @@ iomap_writepage_map(struct iomap_writepage_ctx *wpc, > > * invalid, grab a new one. > > */ > > for (i = 0; i < nblocks && pos < end_pos; i++, pos += len) { > > - if (iop && !test_bit(i, iop->state)) > > + if (iop && (!test_bit(i, iop->state) || > > + !test_bit(i + nblocks, iop->state))) > > Hmm. So I /think/ these two test_bit()s mean that we skip any folio > sub-block if it's either notuptodate or not dirty? > > I /think/ we only need to check the dirty status, right? Like willy > said? :) Yes. Agreed. > > That said... somewhere we probably ought to check the consistency of the > two bits to ensure that they're not (dirty && !uptodate), given our > horrible history of getting things wrong with page and bufferhead state > bits. > > Admittedly I'm not thrilled at the reintroduction of page and iop dirty > state that are updated in separate places, but OTOH the write > amplification here is demonstrably horrifying as you point out so it's > clearly necessary. On a 64K pagesize platform the performance of such workloads that I meantion is also quiet bad. > > Maybe we need a debugging function that will check the page and iop > state, and call it every time we go in and out of critical iomap > functions (write, writeback, dropping pages, etc) I will try and review each of the paths once again to ensure the consistency. What I see is, we only mark the iop->state dirty bits before dirtying the page in iomap buffered-io paths. This happens at two places, 1. __iomap_write_end() where we call filemap_dirty_folio(). We mark iop state dirty bits before calling filemap_dirty_folio() 2. iomap_folio_mkwrite_iter(). Here again before calling folio_mark_dirty(), we set the dirty state bits. This is the iomap_page_mkwrite path. But, I would still like to review each of these and other paths as well. -ritesh > > --D > > > continue; > > > > error = wpc->ops->map_blocks(wpc, inode, pos); > > @@ -1397,6 +1443,9 @@ iomap_writepage_map(struct iomap_writepage_ctx *wpc, > > } > > } > > > > + iomap_clear_range_dirty(folio, iop, > > + offset_in_folio(folio, folio_pos(folio)), > > + end_pos - folio_pos(folio)); > > folio_start_writeback(folio); > > folio_unlock(folio); > > > > -- > > 2.37.3 > >