Andreas Gruenbacher <agruenba@xxxxxxxxxx> writes: > On Mon, Jun 19, 2023 at 4:29 AM Ritesh Harjani (IBM) > <ritesh.list@xxxxxxxxx> wrote: >> When filesystem blocksize is less than folio size (either with >> mapping_large_folio_support() or with blocksize < pagesize) and when the >> folio is uptodate in pagecache, then even a byte write can cause >> an entire folio to be written to disk during writeback. This happens >> because we currently don't have a mechanism to track per-block dirty >> state within struct iomap_folio_state. We currently only track uptodate >> state. >> >> This patch implements support for tracking per-block dirty state in >> iomap_folio_state->state bitmap. This should help improve the filesystem >> write performance and help reduce write amplification. >> >> Performance testing of below fio workload reveals ~16x performance >> improvement using nvme with XFS (4k blocksize) on Power (64K pagesize) >> FIO reported write bw scores improved from around ~28 MBps to ~452 MBps. >> >> 1. <test_randwrite.fio> >> [global] >> ioengine=psync >> rw=randwrite >> overwrite=1 >> pre_read=1 >> direct=0 >> bs=4k >> size=1G >> dir=./ >> numjobs=8 >> fdatasync=1 >> runtime=60 >> iodepth=64 >> group_reporting=1 >> >> [fio-run] >> >> 2. Also our internal performance team reported that this patch improves >> their database workload performance by around ~83% (with XFS on Power) >> >> Reported-by: Aravinda Herle <araherle@xxxxxxxxxx> >> Reported-by: Brian Foster <bfoster@xxxxxxxxxx> >> Signed-off-by: Ritesh Harjani (IBM) <ritesh.list@xxxxxxxxx> >> --- >> fs/gfs2/aops.c | 2 +- >> fs/iomap/buffered-io.c | 189 ++++++++++++++++++++++++++++++++++++----- >> fs/xfs/xfs_aops.c | 2 +- >> fs/zonefs/file.c | 2 +- >> include/linux/iomap.h | 1 + >> 5 files changed, 171 insertions(+), 25 deletions(-) >> >> diff --git a/fs/gfs2/aops.c b/fs/gfs2/aops.c >> index a5f4be6b9213..75efec3c3b71 100644 >> --- a/fs/gfs2/aops.c >> +++ b/fs/gfs2/aops.c >> @@ -746,7 +746,7 @@ static const struct address_space_operations gfs2_aops = { >> .writepages = gfs2_writepages, >> .read_folio = gfs2_read_folio, >> .readahead = gfs2_readahead, >> - .dirty_folio = filemap_dirty_folio, >> + .dirty_folio = iomap_dirty_folio, >> .release_folio = iomap_release_folio, >> .invalidate_folio = iomap_invalidate_folio, >> .bmap = gfs2_bmap, >> diff --git a/fs/iomap/buffered-io.c b/fs/iomap/buffered-io.c >> index 391d918ddd22..50f5840bb5f9 100644 >> --- a/fs/iomap/buffered-io.c >> +++ b/fs/iomap/buffered-io.c >> @@ -25,7 +25,7 @@ >> >> typedef int (*iomap_punch_t)(struct inode *inode, loff_t offset, loff_t length); >> /* >> - * Structure allocated for each folio to track per-block uptodate state >> + * Structure allocated for each folio to track per-block uptodate, dirty state >> * and I/O completions. >> */ >> struct iomap_folio_state { >> @@ -35,31 +35,55 @@ struct iomap_folio_state { >> unsigned long state[]; >> }; >> >> +enum iomap_block_state { >> + IOMAP_ST_UPTODATE, >> + IOMAP_ST_DIRTY, >> + >> + IOMAP_ST_MAX, >> +}; >> + >> +static void ifs_calc_range(struct folio *folio, size_t off, size_t len, >> + enum iomap_block_state state, unsigned int *first_blkp, >> + unsigned int *nr_blksp) >> +{ >> + struct inode *inode = folio->mapping->host; >> + unsigned int blks_per_folio = i_blocks_per_folio(inode, folio); >> + unsigned int first = off >> inode->i_blkbits; >> + unsigned int last = (off + len - 1) >> inode->i_blkbits; >> + >> + *first_blkp = first + (state * blks_per_folio); >> + *nr_blksp = last - first + 1; >> +} >> + >> static struct bio_set iomap_ioend_bioset; >> >> static inline bool ifs_is_fully_uptodate(struct folio *folio, >> struct iomap_folio_state *ifs) >> { >> struct inode *inode = folio->mapping->host; >> + unsigned int blks_per_folio = i_blocks_per_folio(inode, folio); >> + unsigned int nr_blks = (IOMAP_ST_UPTODATE + 1) * blks_per_folio; > > This nr_blks calculation doesn't make sense. > About this, I have replied with more details here [1] [1]: https://lore.kernel.org/linux-xfs/87o7lbmnam.fsf@xxxxxxx/ >> - return bitmap_full(ifs->state, i_blocks_per_folio(inode, folio)); >> + return bitmap_full(ifs->state, nr_blks); > > Could you please change this to: > > BUILD_BUG_ON(IOMAP_ST_UPTODATE != 0); ditto > return bitmap_full(ifs->state, blks_per_folio); > > Also, I'm seeing that the value of i_blocks_per_folio() is assigned to > local variables with various names in several places (blks_per_folio, > nr_blocks, nblocks). Maybe this could be made consistent. > I remember giving it a try, but then it was really not worth it because all of above naming also does make sense in the way they are getting used currently in the code. So, I think as long as it is clear behind what those variables means and how are they getting used, it should be ok, IMO. -ritesh