On Wed, Oct 21, 2020 at 09:25:47AM -0700, Darrick J. Wong wrote: > On Wed, Oct 21, 2020 at 09:33:29AM -0400, Brian Foster wrote: > > iomap_truncate_page() relies on zero range and zero range > > unconditionally skips unwritten mappings. This is normally not a > > problem as most users synchronize in-core state to the underlying > > block mapping by flushing pagecache prior to calling into iomap. > > This is not the case for iomap_truncate_page(), however. XFS calls > > iomap_truncate_page() on truncate down before flushing the new EOF > > page of the file. This means that if the new EOF block is unwritten > > but covered by a dirty or writeback page (i.e. awaiting unwritten > > conversion after writeback), iomap fails to zero that page. The > > subsequent truncate_setsize() call does perform page zeroing, but > > doesn't dirty the page. Therefore if the new EOF page is written > > back after calling into iomap but before the pagecache truncate, the > > post-EOF zeroing is lost on page reclaim. This exposes stale > > post-EOF data on mapped reads. > > > > Rework iomap_truncate_page() to check pagecache state before calling > > into iomap_apply() and use that info to determine whether we can > > safely skip zeroing unwritten extents. The filesystem has locked out > > concurrent I/O and mapped operations at this point but is not > > serialized against writeback, unwritten extent conversion (I/O > > completion) or page reclaim. Therefore if a page does not exist > > before we acquire the mapping, we can be certain that an unwritten > > extent cannot be converted before we return and thus it is safe to > > skip. If a page does exist over an unwritten block, it could be in > > the dirty or writeback states, convert the underlying mapping at any > > time, and thus should be explicitly written to avoid racing with > > writeback. Finally, since iomap_truncate_page() only targets the new > > EOF block and must now pass additional state to the actor, open code > > the zeroing actor instead of plumbing through zero range. > > > > This does have the tradeoff that an existing clean page is dirtied > > and causes unwritten conversion, but this is analogous to historical > > behavior implemented by block_truncate_page(). This patch restores > > historical behavior to address the data exposure problem and leaves > > filtering out the clean page case for a separate patch. > > Fixes: 68a9f5e7007c ("xfs: implement iomap based buffered write path") > > Signed-off-by: Brian Foster <bfoster@xxxxxxxxxx> > > --- > > > > v2: > > - Rework to check for cached page explicitly and avoid use of seek data. > > v1: https://lore.kernel.org/linux-fsdevel/20201012140350.950064-1-bfoster@xxxxxxxxxx/ > > Has the reproducer listed in that email been turned into a fstest case > yet? :) > Heh.. that reproducer actually required customization to manufacture the problem. I'll have to think more about a generic reproducer. FWIW, I've also just come across a similar post-eof data exposure failure with this patch, so I'll need to dig into that first and foremost and figure out whether this is still incorrect/insufficient for some reason... > > > > fs/iomap/buffered-io.c | 41 ++++++++++++++++++++++++++++++++++++++++- > > 1 file changed, 40 insertions(+), 1 deletion(-) > > > > diff --git a/fs/iomap/buffered-io.c b/fs/iomap/buffered-io.c > > index bcfc288dba3f..2cdfcff02307 100644 > > --- a/fs/iomap/buffered-io.c > > +++ b/fs/iomap/buffered-io.c > > @@ -1000,17 +1000,56 @@ iomap_zero_range(struct inode *inode, loff_t pos, loff_t len, bool *did_zero, > > } > > EXPORT_SYMBOL_GPL(iomap_zero_range); > > > > +struct iomap_trunc_priv { > > + bool *did_zero; > > + bool has_page; > > +}; > > + > > +static loff_t > > +iomap_truncate_page_actor(struct inode *inode, loff_t pos, loff_t count, > > + void *data, struct iomap *iomap, struct iomap *srcmap) > > +{ > > + struct iomap_trunc_priv *priv = data; > > + unsigned offset; > > + int status; > > + > > + if (srcmap->type == IOMAP_HOLE) > > + return count; > > + if (srcmap->type == IOMAP_UNWRITTEN && !priv->has_page) > > + return count; > > + > > + offset = offset_in_page(pos); > > + if (IS_DAX(inode)) > > + status = dax_iomap_zero(pos, offset, count, iomap); > > + else > > + status = iomap_zero(inode, pos, offset, count, iomap, srcmap); > > + if (status < 0) > > + return status; > > + > > + if (priv->did_zero) > > + *priv->did_zero = true; > > + return count; > > +} > > + > > int > > iomap_truncate_page(struct inode *inode, loff_t pos, bool *did_zero, > > const struct iomap_ops *ops) > > { > > + struct iomap_trunc_priv priv = { .did_zero = did_zero }; > > unsigned int blocksize = i_blocksize(inode); > > unsigned int off = pos & (blocksize - 1); > > + loff_t ret; > > > > /* Block boundary? Nothing to do */ > > if (!off) > > return 0; > > - return iomap_zero_range(inode, pos, blocksize - off, did_zero, ops); > > + > > + priv.has_page = filemap_range_has_page(inode->i_mapping, pos, pos); > > Er... shouldn't that second 'pos' be 'pos + blocksize - off - 1', like > the apply call below? I guess it doesn't matter since we're only > interested in the page at pos, but the double usage of pos caught my > eye. > Yeah. I'll fix that up. > I also wonder, can you move this into the actor so that you can pass > *did_zero straight through without the two-item struct? > I don't think so because the idea was to explicitly check for page presence prior to getting the mapping. Brian > --D > > > + ret = iomap_apply(inode, pos, blocksize - off, IOMAP_ZERO, ops, &priv, > > + iomap_truncate_page_actor); > > + if (ret <= 0) > > + return ret; > > + return 0; > > } > > EXPORT_SYMBOL_GPL(iomap_truncate_page); > > > > -- > > 2.25.4 > > >