On Wed, Oct 07, 2020 at 08:33:59AM -0700, Darrick J. Wong wrote: > On Wed, Oct 07, 2020 at 10:35:09AM -0400, Brian Foster wrote: > > It is possible to expose non-zeroed post-EOF data in XFS if the new > > EOF page is dirty, backed by an unwritten block and the truncate > > happens to race with writeback. iomap_truncate_page() will not zero > > the post-EOF portion of the page if the underlying block is > > unwritten. The subsequent call to truncate_setsize() will, but > > doesn't dirty the page. Therefore, if writeback happens to complete > > after iomap_truncate_page() (so it still sees the unwritten block) > > but before truncate_setsize(), the cached page becomes inconsistent > > with the on-disk block. A mapped read after the associated page is > > reclaimed or invalidated exposes non-zero post-EOF data. > > > > For example, consider the following sequence when run on a kernel > > modified to explicitly flush the new EOF page within the race > > window: > > > > $ xfs_io -fc "falloc 0 4k" -c fsync /mnt/file > > $ xfs_io -c "pwrite 0 4k" -c "truncate 1k" /mnt/file > > ... > > $ xfs_io -c "mmap 0 4k" -c "mread -v 1k 8" /mnt/file > > 00000400: 00 00 00 00 00 00 00 00 ........ > > $ umount /mnt/; mount <dev> /mnt/ > > $ xfs_io -c "mmap 0 4k" -c "mread -v 1k 8" /mnt/file > > 00000400: cd cd cd cd cd cd cd cd ........ > > > > Update xfs_setattr_size() to explicitly flush the new EOF page prior > > to the page truncate to ensure iomap has the latest state of the > > underlying block. > > > > Fixes: 68a9f5e7007c ("xfs: implement iomap based buffered write path") > > Signed-off-by: Brian Foster <bfoster@xxxxxxxxxx> > > --- > > > > This patch is intentionally simplistic because I wanted to get some > > thoughts on a proper fix and at the same time consider something easily > > backportable. The iomap behavior seems rather odd to me in general, > > particularly if we consider the same kind of behavior can occur on > > file-extending writes. It's just not a user observable problem in that > > case because a sub-page write of a current EOF page (backed by an > > unwritten block) will zero fill the rest of the page at write time > > (before the zero range essentially skips it due to the unwritten block). > > It's not totally clear to me if that's an intentional design > > characteristic of iomap or something we should address. > > > > It _seems_ like the more appropriate fix is that iomap truncate page > > should at least accommodate a dirty page over an unwritten block and > > modify the page (or perhaps just unconditionally do a buffered write on > > a non-aligned truncate, similar to what block_truncate_page() does). For > > example, we could push the UNWRITTEN check from iomap_zero_range_actor() > > down into iomap_zero(), actually check for an existing page there, and > > then either zero it or skip out if none exists. Thoughts? > > I haven't looked at this in much depth yet, but I agree with the > principle that iomap ought to handle the case of unwritten extents > fronted by dirty pagecache. > Ok. What I was originally thinking above turned out to be too inefficient. However, it occurred to me that we already have an efficient cache scanning mechanism in seek data/hole, so I think something like the appended might be doable. Thoughts? Brian --- 8< --- diff --git a/fs/iomap/buffered-io.c b/fs/iomap/buffered-io.c index bcfc288dba3f..676d8d2ae7c7 100644 --- a/fs/iomap/buffered-io.c +++ b/fs/iomap/buffered-io.c @@ -944,6 +944,26 @@ static int iomap_zero(struct inode *inode, loff_t pos, unsigned offset, return iomap_write_end(inode, pos, bytes, bytes, page, iomap, srcmap); } +static void +iomap_zero_range_skip_uncached(struct inode *inode, loff_t *pos, + loff_t *count, loff_t *written) +{ + unsigned dirty_offset, bytes = 0; + + dirty_offset = page_cache_seek_hole_data(inode, *pos, *count, + SEEK_DATA); + if (dirty_offset == -ENOENT) + bytes = *count; + else if (dirty_offset > *pos) + bytes = dirty_offset - *pos; + + if (bytes) { + *pos += bytes; + *count -= bytes; + *written += bytes; + } +} + static loff_t iomap_zero_range_actor(struct inode *inode, loff_t pos, loff_t count, void *data, struct iomap *iomap, struct iomap *srcmap) @@ -953,12 +973,19 @@ iomap_zero_range_actor(struct inode *inode, loff_t pos, loff_t count, int status; /* already zeroed? we're done. */ - if (srcmap->type == IOMAP_HOLE || srcmap->type == IOMAP_UNWRITTEN) + if (srcmap->type == IOMAP_HOLE) return count; do { unsigned offset, bytes; + if (srcmap->type == IOMAP_UNWRITTEN) { + iomap_zero_range_skip_uncached(inode, &pos, &count, + &written); + if (!count) + break; + } + offset = offset_in_page(pos); bytes = min_t(loff_t, PAGE_SIZE - offset, count); diff --git a/fs/iomap/seek.c b/fs/iomap/seek.c index 107ee80c3568..4f5e4eca9906 100644 --- a/fs/iomap/seek.c +++ b/fs/iomap/seek.c @@ -70,7 +70,7 @@ page_seek_hole_data(struct inode *inode, struct page *page, loff_t *lastoff, * * Returns the resulting offset on successs, and -ENOENT otherwise. */ -static loff_t +loff_t page_cache_seek_hole_data(struct inode *inode, loff_t offset, loff_t length, int whence) { diff --git a/include/linux/iomap.h b/include/linux/iomap.h index 4d1d3c3469e9..437ae0d708d6 100644 --- a/include/linux/iomap.h +++ b/include/linux/iomap.h @@ -184,6 +184,9 @@ loff_t iomap_seek_data(struct inode *inode, loff_t offset, const struct iomap_ops *ops); sector_t iomap_bmap(struct address_space *mapping, sector_t bno, const struct iomap_ops *ops); +loff_t page_cache_seek_hole_data(struct inode *inode, loff_t offset, + loff_t length, int whence); + /* * Structure for writeback I/O completions.