On Thu, Dec 01, 2022 at 11:52:14AM +1100, Dave Chinner wrote: > From: Dave Chinner <dchinner@xxxxxxxxxx> > > Unwritten extents can have page cache data over the range being > zeroed so we can't just skip them entirely. Fix this by checking for > an existing dirty folio over the unwritten range we are zeroing > and only performing zeroing if the folio is already dirty. > > XXX: how do we detect a iomap containing a cow mapping over a hole > in iomap_zero_iter()? The XFS code implies this case also needs to > zero the page cache if there is data present, so trigger for page > cache lookup only in iomap_zero_iter() needs to handle this case as > well. > I think iomap generally would report the COW mapping in srcmap for that case. The problem I suspect is that if XFS sees a hole in the data fork for an IOMAP_ZERO operation, it doesn't bother checking the COW fork at all and just returns the data fork hole. That probably needs to be reworked to perform IOMAP_ZERO lookups similar to those for a normal buffered write, otherwise iomap_zero_range() would fail to zero in the particular case where a COW prealloc might overlap a hole. Either way, I agree with the followon point that the iomap map naming/reporting wrt to COW is far too confusing to keep straight. > Before: > > $ time sudo ./pwrite-trunc /mnt/scratch/foo 50000 > path /mnt/scratch/foo, 50000 iters > > real 0m14.103s > user 0m0.015s > sys 0m0.020s > > $ sudo strace -c ./pwrite-trunc /mnt/scratch/foo 50000 > path /mnt/scratch/foo, 50000 iters > % time seconds usecs/call calls errors syscall > ------ ----------- ----------- --------- --------- ---------------- > 85.90 0.847616 16 50000 ftruncate > 14.01 0.138229 2 50000 pwrite64 > .... > > After: > > $ time sudo ./pwrite-trunc /mnt/scratch/foo 50000 > path /mnt/scratch/foo, 50000 iters > > real 0m0.144s > user 0m0.021s > sys 0m0.012s > > $ sudo strace -c ./pwrite-trunc /mnt/scratch/foo 50000 > path /mnt/scratch/foo, 50000 iters > % time seconds usecs/call calls errors syscall > ------ ----------- ----------- --------- --------- ---------------- > 53.86 0.505964 10 50000 ftruncate > 46.12 0.433251 8 50000 pwrite64 > .... > > Yup, we get back all the performance. > > As for the "mmap write beyond EOF" data exposure aspect > documented here: > > https://lore.kernel.org/linux-xfs/20221104182358.2007475-1-bfoster@xxxxxxxxxx/ > > With this command: > > $ sudo xfs_io -tfc "falloc 0 1k" -c "pwrite 0 1k" \ > -c "mmap 0 4k" -c "mwrite 3k 1k" -c "pwrite 32k 4k" \ > -c fsync -c "pread -v 3k 32" /mnt/scratch/foo > > Before: > > wrote 1024/1024 bytes at offset 0 > 1 KiB, 1 ops; 0.0000 sec (34.877 MiB/sec and 35714.2857 ops/sec) > wrote 4096/4096 bytes at offset 32768 > 4 KiB, 1 ops; 0.0000 sec (229.779 MiB/sec and 58823.5294 ops/sec) > 00000c00: 58 58 58 58 58 58 58 58 58 58 58 58 58 58 58 58 XXXXXXXXXXXXXXXX > 00000c10: 58 58 58 58 58 58 58 58 58 58 58 58 58 58 58 58 XXXXXXXXXXXXXXXX > read 32/32 bytes at offset 3072 > 32.000000 bytes, 1 ops; 0.0000 sec (568.182 KiB/sec and 18181.8182 ops/sec > > After: > > wrote 1024/1024 bytes at offset 0 > 1 KiB, 1 ops; 0.0000 sec (40.690 MiB/sec and 41666.6667 ops/sec) > wrote 4096/4096 bytes at offset 32768 > 4 KiB, 1 ops; 0.0000 sec (150.240 MiB/sec and 38461.5385 ops/sec) > 00000c00: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 ................ > 00000c10: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 ................ > read 32/32 bytes at offset 3072 > 32.000000 bytes, 1 ops; 0.0000 sec (558.036 KiB/sec and 17857.1429 ops/sec) > > We see that this post-eof unwritten extent dirty page zeroing is > working correctly. > > This has passed through most of fstests on a couple of test VMs > without issues at the moment, so I think this approach to fixing the > issue is going to be solid once we've worked out how to detect the > COW-hole mapping case. > Does fstests currently have any coverage for zero range operations over multiple folios? I'm not sure that it does. I suspect the reason this problem went undetected for so long is that the primary uses of zero range (i.e. falloc) always flush and invalidate the range. The only exceptions that I'm aware of are the eof boundary handling scenarios in XFS (i.e. write past eof, truncate), and those typically only expect to handle a single folio at the eof block. Given there's also no test cases for the historical and subtle stale data exposure problems with this code, I suspect we'd want to come up with at least one test case that has the ability to effectively stress new zero range logic. Perhaps fallocate zero range could elide the cache flush/inval? Even if that were just an occasional/random debug mode thing, that would add some fsx coverage from the various associated tests, which would be better than what we have now. > Signed-off-by: Dave Chinner <dchinner@xxxxxxxxxx> > --- > fs/iomap/buffered-io.c | 50 +++++++++++++++++++++++++++++++++++------- > fs/xfs/xfs_iops.c | 12 +--------- > 2 files changed, 43 insertions(+), 19 deletions(-) > > diff --git a/fs/iomap/buffered-io.c b/fs/iomap/buffered-io.c > index 356193e44cf0..0969e4525507 100644 > --- a/fs/iomap/buffered-io.c > +++ b/fs/iomap/buffered-io.c ... > @@ -791,7 +796,7 @@ static loff_t iomap_write_iter(struct iomap_iter *iter, struct iov_iter *i) > break; > } > > - status = iomap_write_begin(iter, pos, bytes, &folio); > + status = iomap_write_begin(iter, pos, bytes, &folio, FGP_CREAT); > if (unlikely(status)) > break; > if (iter->iomap.flags & IOMAP_F_STALE) Not necessarily related to this patch, but this looks like it creates a bit of a negative feedback loop for buffered writes. I.e., a streaming buffered write workload performs delayed allocation, XFS converts delalloc to unwritten extents, writeback comes in behind and converts unwritten extents in ioend batch size chunks, each extent conversion updates the data fork sequence number and leads to spurious invalidations on subsequent writes. I don't know if that will ever lead to enough of a hit that it might be observable on a real workload, or otherwise if it can just be chalked up to being like any other background work that comes as a side effect of user operations (i.e. balance dirty pages, writeback, etc.). Just out of curiosity and to explore potential worst case characteristics, I ran a quick test to spin up a bunch of background flushers of a file while doing a larger streaming write. This seems able to trigger a 40-50% drop in buffered write throughput compared to the same workload without stale iomap detection, fwiw. Obviously that is not testing a sane workload, but probably something to keep in mind if somebody happens to observe/report that sort of behavior in the near future. ... > @@ -1158,9 +1167,20 @@ static loff_t iomap_zero_iter(struct iomap_iter *iter, bool *did_zero) > size_t offset; > size_t bytes = min_t(u64, SIZE_MAX, length); > > - status = iomap_write_begin(iter, pos, bytes, &folio); > - if (status) > + status = iomap_write_begin(iter, pos, bytes, &folio, fgp); > + if (status) { > + if (status == -ENODATA) { > + /* > + * No folio was found, so skip to the start of > + * the next potential entry in the page cache > + * and continue from there. > + */ > + if (bytes > PAGE_SIZE - offset_in_page(pos)) > + bytes = PAGE_SIZE - offset_in_page(pos); > + goto loop_continue; > + } I don't think this logic is safe because nothing prevents folio eviction between the mapping lookup and folio lookup. So if there's a dirty folio over an unwritten extent, iomap looks up the unwritten extent, and the folio writes back and is evicted before this lookup, we skip zeroing data on disk when we shouldn't. That reintroduces a variant of the stale data exposure vector the flush in xfs_setattr_size() is intended to prevent. If we wanted to use stale iomap detection for this purpose, one possible approach might be to try and support a revalidation check outside of locked folio context (because there's no guarantee one will exist for a mapping that might have become stale). With something like that available, uncached ranges would just have to be revalidated before they can be tracked as processed by the iter. That might not be too much trouble to do here since I think the existing stale check (if a folio is found) would indirectly validate any preceding uncached range. The iter function would just have to process the 'written' count appropriately and handle the added scenario where an uncached range must be directly revalidated. Brian > return status; > + } > if (iter->iomap.flags & IOMAP_F_STALE) > break; > > @@ -1168,6 +1188,19 @@ static loff_t iomap_zero_iter(struct iomap_iter *iter, bool *did_zero) > if (bytes > folio_size(folio) - offset) > bytes = folio_size(folio) - offset; > > + /* > + * If the folio over an unwritten extent is clean, then we > + * aren't going to touch the data in it at all. We don't want to > + * mark it dirty or change the uptodate state of data in the > + * page, so we just unlock it and skip to the next range over > + * the unwritten extent we need to check. > + */ > + if (srcmap->type == IOMAP_UNWRITTEN && > + !folio_test_dirty(folio)) { > + folio_unlock(folio); > + goto loop_continue; > + } > + > folio_zero_range(folio, offset, bytes); > folio_mark_accessed(folio); > > @@ -1175,6 +1208,7 @@ static loff_t iomap_zero_iter(struct iomap_iter *iter, bool *did_zero) > if (WARN_ON_ONCE(bytes == 0)) > return -EIO; > > +loop_continue: > pos += bytes; > length -= bytes; > written += bytes; > diff --git a/fs/xfs/xfs_iops.c b/fs/xfs/xfs_iops.c > index 2e10e1c66ad6..d7c2f8c49f94 100644 > --- a/fs/xfs/xfs_iops.c > +++ b/fs/xfs/xfs_iops.c > @@ -839,17 +839,7 @@ xfs_setattr_size( > trace_xfs_zero_eof(ip, oldsize, newsize - oldsize); > error = xfs_zero_range(ip, oldsize, newsize - oldsize, > &did_zeroing); > - } else { > - /* > - * iomap won't detect a dirty page over an unwritten block (or a > - * cow block over a hole) and subsequently skips zeroing the > - * newly post-EOF portion of the page. Flush the new EOF to > - * convert the block before the pagecache truncate. > - */ > - error = filemap_write_and_wait_range(inode->i_mapping, newsize, > - newsize); > - if (error) > - return error; > + } else if (newsize != oldsize) { > error = xfs_truncate_page(ip, newsize, &did_zeroing); > } > > -- > 2.38.1 >