Re: [RFCv3 3/3] iomap: Support subpage size dirty tracking to improve write performance

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 



Brian Foster <bfoster@xxxxxxxxxx> writes:

> On Mon, Feb 27, 2023 at 01:13:32AM +0530, Ritesh Harjani (IBM) wrote:
>> On a 64k pagesize platforms (specially Power and/or aarch64) with 4k
>> filesystem blocksize, this patch should improve the performance by doing
>> only the subpage dirty data write.
>>
>> This should also reduce the write amplification since we can now track
>> subpage dirty status within state bitmaps. Earlier we had to
>> write the entire 64k page even if only a part of it (e.g. 4k) was
>> updated.
>>
>> Performance testing of below fio workload reveals ~16x performance
>> improvement on nvme with XFS (4k blocksize) on Power (64K pagesize)
>> FIO reported write bw scores improved from around ~28 MBps to ~452 MBps.
>>
>> 1. <test_randwrite.fio>
>> [global]
>> 	ioengine=psync
>> 	rw=randwrite
>> 	overwrite=1
>> 	pre_read=1
>> 	direct=0
>> 	bs=4k
>> 	size=1G
>> 	dir=./
>> 	numjobs=8
>> 	fdatasync=1
>> 	runtime=60
>> 	iodepth=64
>> 	group_reporting=1
>>
>> [fio-run]
>>
>> 2. Also our internal performance team reported that this patch improves there
>>    database workload performance by around ~83% (with XFS on Power)
>>
>> Reported-by: Aravinda Herle <araherle@xxxxxxxxxx>
>> Signed-off-by: Ritesh Harjani (IBM) <ritesh.list@xxxxxxxxx>
>> ---
>>  fs/gfs2/aops.c         |   2 +-
>>  fs/iomap/buffered-io.c | 104 +++++++++++++++++++++++++++++++++++++----
>>  fs/xfs/xfs_aops.c      |   2 +-
>>  fs/zonefs/super.c      |   2 +-
>>  include/linux/iomap.h  |   1 +
>>  5 files changed, 99 insertions(+), 12 deletions(-)
>>
> ...
>> diff --git a/fs/iomap/buffered-io.c b/fs/iomap/buffered-io.c
>> index e0b0be16278e..fb55183c547f 100644
>> --- a/fs/iomap/buffered-io.c
>> +++ b/fs/iomap/buffered-io.c
> ...
>> @@ -1630,7 +1715,7 @@ iomap_writepage_map(struct iomap_writepage_ctx *wpc,
>>  		struct writeback_control *wbc, struct inode *inode,
>>  		struct folio *folio, u64 end_pos)
>>  {
>> -	struct iomap_page *iop = iomap_page_create(inode, folio, 0);
>> +	struct iomap_page *iop = iomap_page_create(inode, folio, 0, true);
>>  	struct iomap_ioend *ioend, *next;
>>  	unsigned len = i_blocksize(inode);
>>  	unsigned nblocks = i_blocks_per_folio(inode, folio);
>> @@ -1646,7 +1731,7 @@ iomap_writepage_map(struct iomap_writepage_ctx *wpc,
>>  	 * invalid, grab a new one.
>>  	 */
>>  	for (i = 0; i < nblocks && pos < end_pos; i++, pos += len) {
>> -		if (iop && !iop_test_uptodate(iop, i, nblocks))
>> +		if (iop && !iop_test_dirty(iop, i, nblocks))
>>  			continue;
>>
>>  		error = wpc->ops->map_blocks(wpc, inode, pos);
>
> Hi Ritesh,
>
> I'm not sure if you followed any of the discussion on the imap
> revalidation series that landed in the last cycle or so, but the
> associated delalloc punch error handling code has a subtle dependency on
> current writeback behavior and thus left a bit of a landmine for this
> work. For reference, more detailed discussion starts around here [1].
> The context is basically that the use of mapping seek hole/data relies
> on uptodate status, which means in certain error cases the filesystem
> might allocate a delalloc block for a write, but not punch it out of the
> associated write happens to fail and the underlying portion of the folio
> was uptodate.
>
> This doesn't cause a problem in current mainline because writeback maps
> every uptodate block in a dirty folio, and so the delalloc block will
> convert at writeback time even though it wasn't written. This no longer
> occurs with the change above, which means there's a vector for a stale
> delalloc block to be left around in the inode. This is a free space
> accounting corruption issue on XFS. Here's a quick example [2] on a 1k
> FSB XFS filesystem to show exactly what I mean:
>
> # xfs_io -fc "truncate 4k" -c "mmap 0 4k" -c "mread 0 4k" -c "pwrite 0 1" -c "pwrite -f 2k 1" -c fsync /mnt/file
> # xfs_io -c "fiemap -v" /mnt/file
> /mnt/file:
>  EXT: FILE-OFFSET      BLOCK-RANGE      TOTAL FLAGS
> ...
>    2: [4..5]:          0..1                 2   0x7
> ...
> (the above shows delalloc after an fsync)
> # umount /mnt
>   kernel:XFS: Assertion failed: xfs_is_shutdown(mp) || percpu_counter_sum(&mp->m_delalloc_blks) == 0, file: fs/xfs/xfs_super.c, line: 1068
> # xfs_repair -n /dev/vdb2
> Phase 1 - find and verify superblock...
> Phase 2 - using internal log
> ...
> sb_fdblocks 20960187, counted 20960195
> ...
> #
>
> I suspect this means either the error handling code needs to be updated
> to consider dirty state (i.e. punch delalloc if the block is !dirty), or
> otherwise this needs to depend on a broader change in XFS to reclaim
> delalloc blocks before inode eviction (along the lines of Dave's recent
> proposal to do something like that for corrupted inodes). Of course the
> caveat with the latter is that doesn't help for any other filesystems
> (?) that might have similar expectations for delayed allocation and want
> to use iomap.
>
> Brian
>
> [1] https://lore.kernel.org/linux-fsdevel/Y3TsPzd0XzXXIzQv@bfoster/
>
> [2] This test case depends on a local xfs_io hack to co-opt the -f flag
> into inducing a write failure. A POC patch for that is available here,
> if you wanted to replicate:
>
> https://lore.kernel.org/linux-xfs/20221123181322.3710820-1-bfoster@xxxxxxxxxx/

Thanks a lot Brian for such detailed info along with a test case.
Really appreciate your insights on this! :)
And apologies for not getting to it earlier. I picked up iomap DIO conversion for
ext2 in between. Now that it is settled, I will be working on addressing
this problem.

Yes, I did follow that patch series which you pointed, but I guess
I mostly thought it was the stale iomap problem which we were solving.
But you are right the approach we take in that patch series to punch out
the delalloc blocks in case of short writes does open up a potential bug
when we add subpage size dirty tracking in iomap, which was also
discussed during the time and I had marked it as well. But I guess I had
completely forgotten about it. So thanks for bringing it up again.

I looked into the Dave's series and what we are trying to do there is
identifying uptodate folios within the [start, end) range and when if
there is a dirty folio within that range, then we punch out all of the
data blocks from *punch_start_byte uptil before this dirty folio byte.
(because these dirty folios anyways contain uptodate dirty data which
can be written back at the time of writeback. Any uptodate non-dirty
folios can be punched out to ensure we don't leak any delalloc blocks
due to short writes).
The only problem here is...when we have subpage size blocks within a
dirty folio, it can still have subblocks within the folio which are
only marked uptodate but not dirty.
Given that we never punch out these uptodate non-dirty subblocks
from within this folio, this can leave a potential delalloc blocks leak
because at the time of writeback we only considers writing subblocks
of a folio which are marked dirty (in this patch series which adds
subpage size dirty tracking). This can then cause bug on problems as you
pointed out during unmount.

I have a short fix for this problem i.e. similar to how we do writeback i.e.
to consider the dirty state of every subblock within a folio, during
punch out operation in iomap_write_delalloc_release() ->
iomap_write_delalloc_scan().
So when we identify a dirty folio in above delalloc release path, we
should iterate over each subblock of that folio as well to punch out non-dirty
blocks. This should make sure that we don't have any uptodate non-dirty
subblocks in the entire given range from [start, end).

I have also addressed many of the other smaller review comments from
others in v3. I will include this change as well will send a new
revision for review (after some cleanups and some initial testing).

Thanks!
-ritesh



[Index of Archives]     [XFS Filesystem Development (older mail)]     [Linux Filesystem Development]     [Linux Audio Users]     [Yosemite Trails]     [Linux Kernel]     [Linux RAID]     [Linux SCSI]


  Powered by Linux