On Fri, Nov 22, 2024 at 09:19:53AM -0500, Brian Foster wrote: > On Thu, Nov 21, 2024 at 02:34:29PM +0800, Long Li wrote: > > During concurrent append writes to XFS filesystem, zero padding data > > may appear in the file after power failure. This happens due to imprecise > > disk size updates when handling write completion. > > > > Consider this scenario with concurrent append writes same file: > > > > Thread 1: Thread 2: > > ------------ ----------- > > write [A, A+B] > > update inode size to A+B > > submit I/O [A, A+BS] > > write [A+B, A+B+C] > > update inode size to A+B+C > > <I/O completes, updates disk size to min(A+B+C, A+BS)> > > <power failure> > > > > After reboot: > > 1) with A+B+C < A+BS, the file has zero padding in range [A+B, A+B+C] > > > > |< Block Size (BS) >| > > |DDDDDDDDDDDDDDDD0000000000000000| > > ^ ^ ^ > > A A+B A+B+C > > (EOF) > > > > 2) with A+B+C > A+BS, the file has zero padding in range [A+B, A+BS] > > > > |< Block Size (BS) >|< Block Size (BS) >| > > |DDDDDDDDDDDDDDDD0000000000000000|00000000000000000000000000000000| > > ^ ^ ^ ^ > > A A+B A+BS A+B+C > > (EOF) > > > > D = Valid Data > > 0 = Zero Padding > > > > The issue stems from disk size being set to min(io_offset + io_size, > > inode->i_size) at I/O completion. Since io_offset+io_size is block > > size granularity, it may exceed the actual valid file data size. In > > the case of concurrent append writes, inode->i_size may be larger > > than the actual range of valid file data written to disk, leading to > > inaccurate disk size updates. > > > > This patch changes the meaning of io_size to represent the size of > > valid data in ioend, while the extent size of ioend can be obtained > > by rounding up based on block size. It ensures more precise disk > > size updates and avoids the zero padding issue. Another benefit is > > that it makes the xfs_ioend_is_append() check more accurate, which > > can reduce unnecessary end bio callbacks of xfs_end_bio() in certain > > scenarios, such as repeated writes at the file tail without extending > > the file size. > > > > Fixes: 1da177e4c3f4 ("Linux-2.6.12-rc2") > > Signed-off-by: Long Li <leo.lilong@xxxxxxxxxx> > > --- > > v2->v3: > > 1. Modify commit message and add the description of A+B+C > A+BS > > 2. Rename iomap_ioend_extent_size() to iomap_ioend_size_aligned() > > 3. Move iomap_ioend_size_aligned to buffered-io.c and avoid exposed > > to new users. > > 4. Add comment for rounding up io_size to explain when/why use it > > > > Thanks for the tweaks.. > > > fs/iomap/buffered-io.c | 43 ++++++++++++++++++++++++++++++++++++------ > > include/linux/iomap.h | 2 +- > > 2 files changed, 38 insertions(+), 7 deletions(-) > > > > diff --git a/fs/iomap/buffered-io.c b/fs/iomap/buffered-io.c > > index d42f01e0fc1c..3f59dfb4d58d 100644 > > --- a/fs/iomap/buffered-io.c > > +++ b/fs/iomap/buffered-io.c > > @@ -1593,12 +1593,35 @@ iomap_finish_ioends(struct iomap_ioend *ioend, int error) > > } > > EXPORT_SYMBOL_GPL(iomap_finish_ioends); > > > > +/* > > + * Calculates the extent size of an ioend by rounding up to block size. When > > + * the last block in the ioend's extent contains the file EOF and the EOF is > > + * not block-aligned, the io_size will not be block-aligned. > > + * > > + * This function is specifically used for ioend grow/merge management: > > + * 1. In concurrent writes, when one write's io_size is truncated due to > > + * non-block-aligned file size while another write extends the file size, > > + * if these two writes are physically and logically contiguous at block > > + * boundaries, rounding up io_size to block boundaries helps grow the > > + * first write's ioend and share this ioend between both writes. > > + * 2. During IO completion, we try to merge physically and logically > > + * contiguous ioends before completion to minimize the number of > > + * transactions. Rounding up io_size to block boundaries helps merge > > + * ioends whose io_size is not block-aligned. > > + */ > > I might suggest to simplify this and maybe split off a comment to where > the ioend size is trimmed as well. For example: > > "Calculate the physical size of an ioend by rounding up to block > granularity. io_size might be unaligned if the last block crossed an > unaligned i_size boundary at creation time." Ok, I want to move the explain of iomap_ioend_size_aligned() to commit message. > > > +static inline size_t iomap_ioend_size_aligned(struct iomap_ioend *ioend) > > +{ > > + return round_up(ioend->io_size, i_blocksize(ioend->io_inode)); > > +} > > + > > /* > > * We can merge two adjacent ioends if they have the same set of work to do. > > */ > > static bool > > iomap_ioend_can_merge(struct iomap_ioend *ioend, struct iomap_ioend *next) > > { > > + size_t size = iomap_ioend_size_aligned(ioend); > > + > > if (ioend->io_bio.bi_status != next->io_bio.bi_status) > > return false; > > if (next->io_flags & IOMAP_F_BOUNDARY) > ... > > @@ -1784,12 +1810,17 @@ static int iomap_add_to_ioend(struct iomap_writepage_ctx *wpc, > > wpc->ioend = iomap_alloc_ioend(wpc, wbc, inode, pos); > > } > > > > - if (!bio_add_folio(&wpc->ioend->io_bio, folio, len, poff)) > > + ioend = wpc->ioend; > > + if (!bio_add_folio(&ioend->io_bio, folio, len, poff)) > > goto new_ioend; > > > > if (ifs) > > atomic_add(len, &ifs->write_bytes_pending); > > - wpc->ioend->io_size += len; > > + > > And here.. > > "If the ioend spans i_size, trim io_size to the former to provide the fs > with more accurate size information. This is useful for completion time > on-disk size updates." > it's fine to me. > > + ioend->io_size = iomap_ioend_size_aligned(ioend) + len; > > + if (ioend->io_offset + ioend->io_size > isize) > > + ioend->io_size = isize - ioend->io_offset; > > + > > wbc_account_cgroup_owner(wbc, folio, len); > > return 0; > > } > > diff --git a/include/linux/iomap.h b/include/linux/iomap.h > > index 5675af6b740c..956a0f7c2a8d 100644 > > --- a/include/linux/iomap.h > > +++ b/include/linux/iomap.h > > @@ -335,7 +335,7 @@ struct iomap_ioend { > > u16 io_type; > > u16 io_flags; /* IOMAP_F_* */ > > struct inode *io_inode; /* file being written to */ > > - size_t io_size; /* size of the extent */ > > + size_t io_size; /* size of valid data */ > > "valid data" is kind of unclear to me. Maybe just "size of data within > eof..?" It's fine to me. > > But those are just suggestions. Feel free to take, leave or reword any > of it, and this LGTM regardless: > > Reviewed-by: Brian Foster <bfoster@xxxxxxxxxx> Thanks for your review, I'll update it as suggested in next version. Long Li