On Wed, Nov 27, 2024 at 02:35:02PM +0800, Long Li wrote: > During concurrent append writes to XFS filesystem, zero padding data > may appear in the file after power failure. This happens due to imprecise > disk size updates when handling write completion. > > Consider this scenario with concurrent append writes same file: > > Thread 1: Thread 2: > ------------ ----------- > write [A, A+B] > update inode size to A+B > submit I/O [A, A+BS] > write [A+B, A+B+C] > update inode size to A+B+C > <I/O completes, updates disk size to min(A+B+C, A+BS)> > <power failure> > > After reboot: > 1) with A+B+C < A+BS, the file has zero padding in range [A+B, A+B+C] > > |< Block Size (BS) >| > |DDDDDDDDDDDDDDDD0000000000000000| > ^ ^ ^ > A A+B A+B+C > (EOF) > > 2) with A+B+C > A+BS, the file has zero padding in range [A+B, A+BS] > > |< Block Size (BS) >|< Block Size (BS) >| > |DDDDDDDDDDDDDDDD0000000000000000|00000000000000000000000000000000| > ^ ^ ^ ^ > A A+B A+BS A+B+C > (EOF) > > D = Valid Data > 0 = Zero Padding > > The issue stems from disk size being set to min(io_offset + io_size, > inode->i_size) at I/O completion. Since io_offset+io_size is block > size granularity, it may exceed the actual valid file data size. In > the case of concurrent append writes, inode->i_size may be larger > than the actual range of valid file data written to disk, leading to > inaccurate disk size updates. > > This patch modifies the meaning of io_size to represent the size of > valid data within EOF in an ioend. If the ioend spans beyond i_size, > io_size will be trimmed to provide the file with more accurate size > information. This is particularly useful for on-disk size updates > at completion time. > > After this change, ioends that span i_size will not grow or merge with > other ioends in concurrent scenarios. However, these cases that need > growth/merging rarely occur and it seems no noticeable performance impact. > Although rounding up io_size could enable ioend growth/merging in these > scenarios, we decided to keep the code simple after discussion [1]. > > Another benefit is that it makes the xfs_ioend_is_append() check more > accurate, which can reduce unnecessary end bio callbacks of xfs_end_bio() > in certain scenarios, such as repeated writes at the file tail without > extending the file size. > > Fixes: 1da177e4c3f4 ("Linux-2.6.12-rc2") Er... filesystem iomap wasn't in 2.6.12, how did you come up with this? At most this is a fix against a roughly 6.1 era kernel, right? > Link[1]: https://patchwork.kernel.org/project/xfs/patch/20241113091907.56937-1-leo.lilong@xxxxxxxxxx > Signed-off-by: Long Li <leo.lilong@xxxxxxxxxx> > Reviewed-by: Brian Foster <bfoster@xxxxxxxxxx> > --- > v4->v5: remove iomap_ioend_size_aligned() and don't round up io_size for > ioend growth/merging to keep the code simple. > fs/iomap/buffered-io.c | 10 ++++++++++ > include/linux/iomap.h | 2 +- > 2 files changed, 11 insertions(+), 1 deletion(-) > > diff --git a/fs/iomap/buffered-io.c b/fs/iomap/buffered-io.c > index d42f01e0fc1c..dc360c8e5641 100644 > --- a/fs/iomap/buffered-io.c > +++ b/fs/iomap/buffered-io.c > @@ -1774,6 +1774,7 @@ static int iomap_add_to_ioend(struct iomap_writepage_ctx *wpc, > { > struct iomap_folio_state *ifs = folio->private; > size_t poff = offset_in_folio(folio, pos); > + loff_t isize = i_size_read(inode); > int error; > > if (!wpc->ioend || !iomap_can_add_to_ioend(wpc, pos)) { > @@ -1789,7 +1790,16 @@ static int iomap_add_to_ioend(struct iomap_writepage_ctx *wpc, > > if (ifs) > atomic_add(len, &ifs->write_bytes_pending); > + > + /* > + * If the ioend spans i_size, trim io_size to the former to provide > + * the fs with more accurate size information. This is useful for > + * completion time on-disk size updates. I think it's useful to preserve the diagram showing exactly what problem you're solving: /* * Clamp io_offset and io_size to the incore EOF so that ondisk * file size updates in the ioend completion are byte-accurate. * This avoids recovering files with zeroed tail regions when * writeback races with appending writes: * * Thread 1: Thread 2: * ------------ ----------- * write [A, A+B] * update inode size to A+B * submit I/O [A, A+BS] * write [A+B, A+B+C] * update inode size to A+B+C * <I/O completes, updates disk size to min(A+B+C, A+BS)> * <power failure> * * After reboot: * 1) with A+B+C < A+BS, the file has zero padding in range * [A+B, A+B+C] * * |< Block Size (BS) >| * |DDDDDDDDDDDD00000000000000| * ^ ^ ^ * A A+B A+B+C * (EOF) * * 2) with A+B+C > A+BS, the file has zero padding in range * [A+B, A+BS] * * |< Block Size (BS) >|< Block Size (BS) >| * |DDDDDDDDDDDD00000000000000|000000000000000000000000000| * ^ ^ ^ ^ * A A+B A+BS A+B+C * (EOF) * * D = Valid Data * 0 = Zero Padding * * Note that this defeats the ability to chain the ioends of * appending writes. */ (I reduced the blocksize a bit for wrapping purposes) The logic looks ok, but I'm curious about how you landed at 2.6.12-rc for the fixes tag. --D > + */ > wpc->ioend->io_size += len; > + if (wpc->ioend->io_offset + wpc->ioend->io_size > isize) > + wpc->ioend->io_size = isize - wpc->ioend->io_offset; > + > wbc_account_cgroup_owner(wbc, folio, len); > return 0; > } > diff --git a/include/linux/iomap.h b/include/linux/iomap.h > index 5675af6b740c..75bf54e76f3b 100644 > --- a/include/linux/iomap.h > +++ b/include/linux/iomap.h > @@ -335,7 +335,7 @@ struct iomap_ioend { > u16 io_type; > u16 io_flags; /* IOMAP_F_* */ > struct inode *io_inode; /* file being written to */ > - size_t io_size; /* size of the extent */ > + size_t io_size; /* size of data within eof */ > loff_t io_offset; /* offset in the file */ > sector_t io_sector; /* start sector of ioend */ > struct bio io_bio; /* MUST BE LAST! */ > -- > 2.39.2 >