On Fri, Mar 01, 2024 at 10:19:30AM +1100, Dave Chinner wrote: > On Thu, Feb 29, 2024 at 04:59:34PM +0800, Zhang Yi wrote: > > Hello, Dave! > > > > On 2024/2/29 6:25, Dave Chinner wrote: > > > On Wed, Feb 28, 2024 at 04:53:32PM +0800, Zhang Yi wrote: > > >> On 2024/2/13 13:46, Christoph Hellwig wrote: > > >>> Wouldn't it make more sense to just move the size manipulation to the > > >>> write-only code? An untested version of that is below. With this > > >>> the naming of the status variable becomes even more confusing than > > >>> it already is, maybe we need to do a cleanup of the *_write_end > > >>> calling conventions as it always returns the passed in copied value > > >>> or 0. > > >>> > > >>> diff --git a/fs/iomap/buffered-io.c b/fs/iomap/buffered-io.c > > >>> index 3dab060aed6d7b..8401a9ca702fc0 100644 > > >>> --- a/fs/iomap/buffered-io.c > > >>> +++ b/fs/iomap/buffered-io.c > > >>> @@ -876,34 +876,13 @@ static size_t iomap_write_end(struct iomap_iter *iter, loff_t pos, size_t len, > > >>> size_t copied, struct folio *folio) > > >>> { > > >>> const struct iomap *srcmap = iomap_iter_srcmap(iter); > > >>> - loff_t old_size = iter->inode->i_size; > > >>> - size_t ret; > > >>> - > > >>> - if (srcmap->type == IOMAP_INLINE) { > > >>> - ret = iomap_write_end_inline(iter, folio, pos, copied); > > >>> - } else if (srcmap->flags & IOMAP_F_BUFFER_HEAD) { > > >>> - ret = block_write_end(NULL, iter->inode->i_mapping, pos, len, > > >>> - copied, &folio->page, NULL); > > >>> - } else { > > >>> - ret = __iomap_write_end(iter->inode, pos, len, copied, folio); > > >>> - } > > >>> - > > >>> - /* > > >>> - * Update the in-memory inode size after copying the data into the page > > >>> - * cache. It's up to the file system to write the updated size to disk, > > >>> - * preferably after I/O completion so that no stale data is exposed. > > >>> - */ > > >>> - if (pos + ret > old_size) { > > >>> - i_size_write(iter->inode, pos + ret); > > >>> - iter->iomap.flags |= IOMAP_F_SIZE_CHANGED; > > >>> - } > > >> > > >> I've recently discovered that if we don't increase i_size in > > >> iomap_zero_iter(), it would break fstests generic/476 on xfs. xfs > > >> depends on iomap_zero_iter() to increase i_size in some cases. > > >> > > >> generic/476 75s ... _check_xfs_filesystem: filesystem on /dev/pmem2 is inconsistent (r) > > >> (see /home/zhangyi/xfstests-dev/results//xfs/generic/476.full for details) > > >> > > >> _check_xfs_filesystem: filesystem on /dev/pmem2 is inconsistent (r) > > >> *** xfs_repair -n output *** > > >> Phase 1 - find and verify superblock... > > >> Phase 2 - using internal log > > >> - zero log... > > >> - scan filesystem freespace and inode maps... > > >> sb_fdblocks 10916, counted 10923 > > >> - found root inode chunk > > >> ... > > >> > > >> After debugging and analysis, I found the root cause of the problem is > > >> related to the pre-allocations of xfs. xfs pre-allocates some blocks to > > >> reduce fragmentation during buffer append writing, then if we write new > > >> data or do file copy(reflink) after the end of the pre-allocating range, > > >> xfs would zero-out and write back the pre-allocate space(e.g. > > >> xfs_file_write_checks() -> xfs_zero_range()), so we have to update > > >> i_size before writing back in iomap_zero_iter(), otherwise, it will > > >> result in stale delayed extent. > > > > > > Ok, so this is long because the example is lacking in clear details > > > so to try to understand it I've laid it out in detail to make sure > > > I've understood it correctly. > > > > > > > Thanks for the graph, the added detail makes things clear and easy to > > understand. To be honest, it's not exactly the same as the results I > > captured and described (the position A\B\C\D\E\F I described is > > increased one by one), but the root cause of the problem is the same, > > so it doesn't affect our understanding of the problem. > > OK, good :) > > ..... > > > > However, if this did actually write zeroes to disk, this would end > > > up with: > > > > > > A C B E F D > > > +wwwwwwwwww+DDDDDDDDDDD+zzzzz+rrrrrrr+dddddd+ > > > EOF EOF > > > (in memory) (on disk) > > > > > > Which is wrong - the file extension and zeros should not get exposed > > > to the user until the entire reflink completes. This would expose > > > zeros at the EOF and a file size that the user never asked for after > > > a crash. Experience tells me that they would report this as > > > "filesystem corrupting data on crash". > > > > > > If we move where i_size gets updated by iomap_zero_iter(), we get: > > > > > > A C B E F D > > > +wwwwwwwwww+DDDDDDDDDDD+zzzzz+rrrrrrr+dddddd+ > > > EOF > > > (in memory) > > > (on disk) > > > > > > Which is also wrong, because now the user can see the size change > > > and read zeros in the middle of the clone operation, which is also > > > wrong. > > > > > > IOWs, we do not want to move the in-memory or on-disk EOF as a > > > result of zeroing delalloc extents beyond EOF as it opens up > > > transient, non-atomic on-disk states in the event of a crash. > > > > > > So, catch-22: we need to move the in-memory EOF to write back zeroes > > > beyond EOF, but that would move the on-disk EOF to E before the > > > clone operation starts. i.e. it makes clone non-atomic. > > > > Make sense. IIUC, I also notice that xfs_file_write_checks() zero > > out EOF blocks if the later write offset is beyond the size of the > > file. Think about if we replace the reflink operation to a buffer > > write E to F, although it doesn't call xfs_flush_unmap_range() > > directly, but if it could be raced by another background write > > back, and trigger the same problem (I've not try to reproduce it, > > so please correct me if I understand wrong). > > Correct, but the write is about to extend the file size when it > writes into the cache beyond the zeroed region. There is no cache > invalidate possible in this path, so the write of data moves the > in-memory EOF past the zeroes in cache and everything works just > fine. > > If it races with concurrent background writeback, the writeback will > skip the zeroed range beyond EOF until they are exposed by the first > data write beyond the zeroed post-eof region which moves the > in-memory EOF. > > truncate(to a larger size) also does this same zeroing - the page > cache is zeroed before we move the EOF in memory, and so the > writeback will only occur once the in-memory EOF is moved. i.e. it > effectively does: > > xfs_zero_range(oldsize to newsize) > truncate_setsize(newsize) > filemap_write_and_wait_range(old size to new size) > > > > What should acutally result from the iomap_zero_range() call from > > > xfs_reflink_remap_prep() is a state like this: > > > > > > A C B E F D > > > +wwwwwwwwww+DDDDDDDDDDD+uuuuu+rrrrrrr+dddddd+ > > > EOF EOF > > > (on disk) (in memory) > > > > > > where 'u' are unwritten extent blocks. > > > > > > > Yeah, this is a good solution. > > > > In xfs_file_write_checks(), I don't fully understand why we need > > the xfs_zero_range(). > > The EOF block may only be partially written. Hence on extension, we > have to guarantee the part of that block beyond the current EOF is > zero if the write leaves a hole between the current EOF and the > start of the new extending write. > > > Theoretically, iomap have already handled > > partial block zeroing for both buffered IO and DIO, so I guess > > the only reason we still need it is to handle pre-allocated blocks > > (no?). > > Historically speaking, Linux is able to leak data beyond EOF on > writeback of partial EOF blocks (e.g. mmap() can write to the EOF > page beyond EOF without failing. We try to mitigate these cases > where we can, but we have to consider that at any time the data in > the cache beyond EOF can be non-zero thanks to mmap() and so any > file extension *must* zero any region beyond EOF cached in the page > cache. > > > If so,would it be better to call xfs_free_eofblocks() to > > release all the preallocated extents in range? If not, maybe we > > could only zero out mapped partial blocks and also release > > preallocated extents? > > No, that will cause all sorts of other performance problems, > especially for reflinked files that triggering COW > operations... > > > > > In xfs_reflink_remap_prep(), I read the commit 410fdc72b05a ("xfs: > > zero posteof blocks when cloning above eof"), xfs used to release > > preallocations, the change log said it didn't work because of the > > PREALLOC flag, but the 'force' parameter is 'true' when calling > > xfs_can_free_eofblocks(), so I don't get the problem met. Could we > > fall back to use xfs_free_eofblocks() and make a state like this? > > > > A C B E F D > > +wwwwwwwwww+DDDDDDDDDDD+hhhhh+rrrrrrr+dddddd+ > > EOF EOF > > (on disk) (in memory) > > It could, but that then requires every place that may call > xfs_zero_range() to be aware of this need to trim EOF blocks to do > the right thing in all cases. We don't want to remove speculative > delalloc in the write() path nor in the truncate(up) case, and so it > doesn't fix the general problem of zeroing specualtive delalloc > beyond EOF requiring writeback to push page caceh pages to disk > before the inode size has been updated. > > The general solution is to have zeroing of speculative prealloc > extents beyond EOF simply convert the range to unwritten and then > invalidate any cached pages over that range. At this point, we are > guaranteed to have zeroes across that range, all without needing to > do any IO at all... That (separate iomap ops that do the delalloc -> unwritten allocation) seems a lot more straightforward to me than whacking preallocations. --D > -Dave. > -- > Dave Chinner > david@xxxxxxxxxxxxx >