Re: [PATCH v5 1/2] iomap: fix zero padding data issue in concurrent append writes

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 



On Wed, Nov 27, 2024 at 02:35:02PM +0800, Long Li wrote:
> During concurrent append writes to XFS filesystem, zero padding data
> may appear in the file after power failure. This happens due to imprecise
> disk size updates when handling write completion.
> 
> Consider this scenario with concurrent append writes same file:
> 
>   Thread 1:                  Thread 2:
>   ------------               -----------
>   write [A, A+B]
>   update inode size to A+B
>   submit I/O [A, A+BS]
>                              write [A+B, A+B+C]
>                              update inode size to A+B+C
>   <I/O completes, updates disk size to min(A+B+C, A+BS)>
>   <power failure>
> 
> After reboot:
>   1) with A+B+C < A+BS, the file has zero padding in range [A+B, A+B+C]
> 
>   |<         Block Size (BS)      >|
>   |DDDDDDDDDDDDDDDD0000000000000000|
>   ^               ^        ^
>   A              A+B     A+B+C
>                          (EOF)
> 
>   2) with A+B+C > A+BS, the file has zero padding in range [A+B, A+BS]
> 
>   |<         Block Size (BS)      >|<           Block Size (BS)    >|
>   |DDDDDDDDDDDDDDDD0000000000000000|00000000000000000000000000000000|
>   ^               ^                ^               ^
>   A              A+B              A+BS           A+B+C
>                                   (EOF)
> 
>   D = Valid Data
>   0 = Zero Padding
> 
> The issue stems from disk size being set to min(io_offset + io_size,
> inode->i_size) at I/O completion. Since io_offset+io_size is block
> size granularity, it may exceed the actual valid file data size. In
> the case of concurrent append writes, inode->i_size may be larger
> than the actual range of valid file data written to disk, leading to
> inaccurate disk size updates.
> 
> This patch modifies the meaning of io_size to represent the size of
> valid data within EOF in an ioend. If the ioend spans beyond i_size,
> io_size will be trimmed to provide the file with more accurate size
> information. This is particularly useful for on-disk size updates
> at completion time.
> 
> After this change, ioends that span i_size will not grow or merge with
> other ioends in concurrent scenarios. However, these cases that need
> growth/merging rarely occur and it seems no noticeable performance impact.
> Although rounding up io_size could enable ioend growth/merging in these
> scenarios, we decided to keep the code simple after discussion [1].
> 
> Another benefit is that it makes the xfs_ioend_is_append() check more
> accurate, which can reduce unnecessary end bio callbacks of xfs_end_bio()
> in certain scenarios, such as repeated writes at the file tail without
> extending the file size.
> 
> Fixes: 1da177e4c3f4 ("Linux-2.6.12-rc2")

Er... filesystem iomap wasn't in 2.6.12, how did you come up with this?
At most this is a fix against a roughly 6.1 era kernel, right?

> Link[1]: https://patchwork.kernel.org/project/xfs/patch/20241113091907.56937-1-leo.lilong@xxxxxxxxxx
> Signed-off-by: Long Li <leo.lilong@xxxxxxxxxx>
> Reviewed-by: Brian Foster <bfoster@xxxxxxxxxx>
> ---
> v4->v5: remove iomap_ioend_size_aligned() and don't round up io_size for
> 	ioend growth/merging to keep the code simple. 
>  fs/iomap/buffered-io.c | 10 ++++++++++
>  include/linux/iomap.h  |  2 +-
>  2 files changed, 11 insertions(+), 1 deletion(-)
> 
> diff --git a/fs/iomap/buffered-io.c b/fs/iomap/buffered-io.c
> index d42f01e0fc1c..dc360c8e5641 100644
> --- a/fs/iomap/buffered-io.c
> +++ b/fs/iomap/buffered-io.c
> @@ -1774,6 +1774,7 @@ static int iomap_add_to_ioend(struct iomap_writepage_ctx *wpc,
>  {
>  	struct iomap_folio_state *ifs = folio->private;
>  	size_t poff = offset_in_folio(folio, pos);
> +	loff_t isize = i_size_read(inode);
>  	int error;
>  
>  	if (!wpc->ioend || !iomap_can_add_to_ioend(wpc, pos)) {
> @@ -1789,7 +1790,16 @@ static int iomap_add_to_ioend(struct iomap_writepage_ctx *wpc,
>  
>  	if (ifs)
>  		atomic_add(len, &ifs->write_bytes_pending);
> +
> +	/*
> +	 * If the ioend spans i_size, trim io_size to the former to provide
> +	 * the fs with more accurate size information. This is useful for
> +	 * completion time on-disk size updates.

I think it's useful to preserve the diagram showing exactly what problem
you're solving:

	/*
	 * Clamp io_offset and io_size to the incore EOF so that ondisk
	 * file size updates in the ioend completion are byte-accurate.
	 * This avoids recovering files with zeroed tail regions when
	 * writeback races with appending writes:
	 *
	 *    Thread 1:                  Thread 2:
	 *    ------------               -----------
	 *    write [A, A+B]
	 *    update inode size to A+B
	 *    submit I/O [A, A+BS]
	 *                               write [A+B, A+B+C]
	 *                               update inode size to A+B+C
	 *    <I/O completes, updates disk size to min(A+B+C, A+BS)>
	 *    <power failure>
	 *
	 *  After reboot:
	 *    1) with A+B+C < A+BS, the file has zero padding in range
	 *       [A+B, A+B+C]
	 *
	 *    |<     Block Size (BS)    >|
	 *    |DDDDDDDDDDDD00000000000000|
	 *    ^           ^        ^
	 *    A          A+B     A+B+C
	 *                       (EOF)
	 *
	 *    2) with A+B+C > A+BS, the file has zero padding in range
	 *       [A+B, A+BS]
	 *
	 *    |<     Block Size (BS)    >|<      Block Size (BS)    >|
	 *    |DDDDDDDDDDDD00000000000000|000000000000000000000000000|
	 *    ^           ^              ^           ^
	 *    A          A+B            A+BS       A+B+C
	 *                              (EOF)
	 *
	 *    D = Valid Data
	 *    0 = Zero Padding
	 *
	 * Note that this defeats the ability to chain the ioends of
	 * appending writes.
	 */

(I reduced the blocksize a bit for wrapping purposes)

The logic looks ok, but I'm curious about how you landed at 2.6.12-rc
for the fixes tag.

--D

> +	 */
>  	wpc->ioend->io_size += len;
> +	if (wpc->ioend->io_offset + wpc->ioend->io_size > isize)
> +		wpc->ioend->io_size = isize - wpc->ioend->io_offset;
> +
>  	wbc_account_cgroup_owner(wbc, folio, len);
>  	return 0;
>  }
> diff --git a/include/linux/iomap.h b/include/linux/iomap.h
> index 5675af6b740c..75bf54e76f3b 100644
> --- a/include/linux/iomap.h
> +++ b/include/linux/iomap.h
> @@ -335,7 +335,7 @@ struct iomap_ioend {
>  	u16			io_type;
>  	u16			io_flags;	/* IOMAP_F_* */
>  	struct inode		*io_inode;	/* file being written to */
> -	size_t			io_size;	/* size of the extent */
> +	size_t			io_size;	/* size of data within eof */
>  	loff_t			io_offset;	/* offset in the file */
>  	sector_t		io_sector;	/* start sector of ioend */
>  	struct bio		io_bio;		/* MUST BE LAST! */
> -- 
> 2.39.2
> 




[Index of Archives]     [Linux Ext4 Filesystem]     [Union Filesystem]     [Filesystem Testing]     [Ceph Users]     [Ecryptfs]     [NTFS 3]     [AutoFS]     [Kernel Newbies]     [Share Photos]     [Security]     [Netfilter]     [Bugtraq]     [Yosemite News]     [MIPS Linux]     [ARM Linux]     [Linux Security]     [Linux Cachefs]     [Reiser Filesystem]     [Linux RAID]     [NTFS 3]     [Samba]     [Device Mapper]     [CEPH Development]

  Powered by Linux