Re: [PATCH] xfs: flush new eof page on truncate to avoid post-eof corruption

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 



On Wed, Oct 07, 2020 at 08:33:59AM -0700, Darrick J. Wong wrote:
> On Wed, Oct 07, 2020 at 10:35:09AM -0400, Brian Foster wrote:
> > It is possible to expose non-zeroed post-EOF data in XFS if the new
> > EOF page is dirty, backed by an unwritten block and the truncate
> > happens to race with writeback. iomap_truncate_page() will not zero
> > the post-EOF portion of the page if the underlying block is
> > unwritten. The subsequent call to truncate_setsize() will, but
> > doesn't dirty the page. Therefore, if writeback happens to complete
> > after iomap_truncate_page() (so it still sees the unwritten block)
> > but before truncate_setsize(), the cached page becomes inconsistent
> > with the on-disk block. A mapped read after the associated page is
> > reclaimed or invalidated exposes non-zero post-EOF data.
> > 
> > For example, consider the following sequence when run on a kernel
> > modified to explicitly flush the new EOF page within the race
> > window:
> > 
> > $ xfs_io -fc "falloc 0 4k" -c fsync /mnt/file
> > $ xfs_io -c "pwrite 0 4k" -c "truncate 1k" /mnt/file
> >   ...
> > $ xfs_io -c "mmap 0 4k" -c "mread -v 1k 8" /mnt/file
> > 00000400:  00 00 00 00 00 00 00 00  ........
> > $ umount /mnt/; mount <dev> /mnt/
> > $ xfs_io -c "mmap 0 4k" -c "mread -v 1k 8" /mnt/file
> > 00000400:  cd cd cd cd cd cd cd cd  ........
> > 
> > Update xfs_setattr_size() to explicitly flush the new EOF page prior
> > to the page truncate to ensure iomap has the latest state of the
> > underlying block.
> > 
> > Fixes: 68a9f5e7007c ("xfs: implement iomap based buffered write path")
> > Signed-off-by: Brian Foster <bfoster@xxxxxxxxxx>
> > ---
> > 
> > This patch is intentionally simplistic because I wanted to get some
> > thoughts on a proper fix and at the same time consider something easily
> > backportable. The iomap behavior seems rather odd to me in general,
> > particularly if we consider the same kind of behavior can occur on
> > file-extending writes. It's just not a user observable problem in that
> > case because a sub-page write of a current EOF page (backed by an
> > unwritten block) will zero fill the rest of the page at write time
> > (before the zero range essentially skips it due to the unwritten block).
> > It's not totally clear to me if that's an intentional design
> > characteristic of iomap or something we should address.
> > 
> > It _seems_ like the more appropriate fix is that iomap truncate page
> > should at least accommodate a dirty page over an unwritten block and
> > modify the page (or perhaps just unconditionally do a buffered write on
> > a non-aligned truncate, similar to what block_truncate_page() does). For
> > example, we could push the UNWRITTEN check from iomap_zero_range_actor()
> > down into iomap_zero(), actually check for an existing page there, and
> > then either zero it or skip out if none exists. Thoughts?
> 
> I haven't looked at this in much depth yet, but I agree with the
> principle that iomap ought to handle the case of unwritten extents
> fronted by dirty pagecache.
> 

Ok. What I was originally thinking above turned out to be too
inefficient. However, it occurred to me that we already have an
efficient cache scanning mechanism in seek data/hole, so I think
something like the appended might be doable. Thoughts?

Brian

--- 8< ---

diff --git a/fs/iomap/buffered-io.c b/fs/iomap/buffered-io.c
index bcfc288dba3f..676d8d2ae7c7 100644
--- a/fs/iomap/buffered-io.c
+++ b/fs/iomap/buffered-io.c
@@ -944,6 +944,26 @@ static int iomap_zero(struct inode *inode, loff_t pos, unsigned offset,
 	return iomap_write_end(inode, pos, bytes, bytes, page, iomap, srcmap);
 }
 
+static void
+iomap_zero_range_skip_uncached(struct inode *inode, loff_t *pos,
+		loff_t *count, loff_t *written)
+{
+	unsigned dirty_offset, bytes = 0;
+
+	dirty_offset = page_cache_seek_hole_data(inode, *pos, *count,
+				SEEK_DATA);
+	if (dirty_offset == -ENOENT)
+		bytes = *count;
+	else if (dirty_offset > *pos)
+		bytes = dirty_offset - *pos;
+
+	if (bytes) {
+		*pos += bytes;
+		*count -= bytes;
+		*written += bytes;
+	}
+}
+
 static loff_t
 iomap_zero_range_actor(struct inode *inode, loff_t pos, loff_t count,
 		void *data, struct iomap *iomap, struct iomap *srcmap)
@@ -953,12 +973,19 @@ iomap_zero_range_actor(struct inode *inode, loff_t pos, loff_t count,
 	int status;
 
 	/* already zeroed?  we're done. */
-	if (srcmap->type == IOMAP_HOLE || srcmap->type == IOMAP_UNWRITTEN)
+	if (srcmap->type == IOMAP_HOLE)
 		return count;
 
 	do {
 		unsigned offset, bytes;
 
+		if (srcmap->type == IOMAP_UNWRITTEN) {
+			iomap_zero_range_skip_uncached(inode, &pos, &count,
+				&written);
+			if (!count)
+				break;
+		}
+
 		offset = offset_in_page(pos);
 		bytes = min_t(loff_t, PAGE_SIZE - offset, count);
 
diff --git a/fs/iomap/seek.c b/fs/iomap/seek.c
index 107ee80c3568..4f5e4eca9906 100644
--- a/fs/iomap/seek.c
+++ b/fs/iomap/seek.c
@@ -70,7 +70,7 @@ page_seek_hole_data(struct inode *inode, struct page *page, loff_t *lastoff,
  *
  * Returns the resulting offset on successs, and -ENOENT otherwise.
  */
-static loff_t
+loff_t
 page_cache_seek_hole_data(struct inode *inode, loff_t offset, loff_t length,
 		int whence)
 {
diff --git a/include/linux/iomap.h b/include/linux/iomap.h
index 4d1d3c3469e9..437ae0d708d6 100644
--- a/include/linux/iomap.h
+++ b/include/linux/iomap.h
@@ -184,6 +184,9 @@ loff_t iomap_seek_data(struct inode *inode, loff_t offset,
 		const struct iomap_ops *ops);
 sector_t iomap_bmap(struct address_space *mapping, sector_t bno,
 		const struct iomap_ops *ops);
+loff_t page_cache_seek_hole_data(struct inode *inode, loff_t offset,
+		loff_t length, int whence);
+
 
 /*
  * Structure for writeback I/O completions.




[Index of Archives]     [Linux Ext4 Filesystem]     [Union Filesystem]     [Filesystem Testing]     [Ceph Users]     [Ecryptfs]     [AutoFS]     [Kernel Newbies]     [Share Photos]     [Security]     [Netfilter]     [Bugtraq]     [Yosemite News]     [MIPS Linux]     [ARM Linux]     [Linux Security]     [Linux Cachefs]     [Reiser Filesystem]     [Linux RAID]     [Samba]     [Device Mapper]     [CEPH Development]

  Powered by Linux