Re: [PATCH 4/7] xfs: buffered write failure should not truncate the page cache

"Darrick J. Wong" <djwong@xxxxxxxxxx> · Wed, 2 Nov 2022 15:26:42 -0700

On Thu, Nov 03, 2022 at 08:04:33AM +1100, Dave Chinner wrote:
> On Wed, Nov 02, 2022 at 09:41:30AM -0700, Darrick J. Wong wrote:
> > On Tue, Nov 01, 2022 at 11:34:09AM +1100, Dave Chinner wrote:
> > > From: Dave Chinner <dchinner@xxxxxxxxxx>
> > > 
> > > xfs_buffered_write_iomap_end() currently invalidates the page cache
> > > over the unused range of the delalloc extent it allocated. While the
> > > write allocated the delalloc extent, it does not own it exclusively
> > > as the write does not hold any locks that prevent either writeback
> > > or mmap page faults from changing the state of either the page cache
> > > or the extent state backing this range.
> > > 
> > > Whilst xfs_bmap_punch_delalloc_range() already handles races in
> > > extent conversion - it will only punch out delalloc extents and it
> > > ignores any other type of extent - the page cache truncate does not
> > > discriminate between data written by this write or some other task.
> > > As a result, truncating the page cache can result in data corruption
> > > if the write races with mmap modifications to the file over the same
> > > range.
> > > 
> > > generic/346 exercises this workload, and if we randomly fail writes
> > > (as will happen when iomap gets stale iomap detection later in the
> > > patchset), it will randomly corrupt the file data because it removes
> > > data written by mmap() in the same page as the write() that failed.
> > > 
> > > Hence we do not want to punch out the page cache over the range of
> > > the extent we failed to write to - what we actually need to do is
> > > detect the ranges that have dirty data in cache over them and *not
> > > punch them out*.
> > 
> > Same dumb question as hch -- why do we need to punch out the nondirty
> > pagecache after a failed write?  If the folios are uptodate then we're
> > evicting cache unnecessarily, and if they're !uptodate can't we let
> > reclaim do the dirty work for us?
> 
> Sorry, we don't punch out the page cache  - this was badly worded. I
> meant:
> 
> "[...] - what we actually need to do is
> detect the ranges that have dirty data in cache over the delalloc
> extent and retain those regions of the delalloc extent."
> 
> i.e. we never punch the page cache anymore, and we only selectively
> punch the delalloc extent that back the clean regions of thw write
> range...

Ah ok.  I was thinking there was a discrepancy between the description
in the commit message and the code, but then I zoomed out and asked
myself why dump the pagecache at all...

> > I don't know if there are hysterical raisins for this or if the goal is
> > to undo memory consumption after a write failure?  If we're stale-ing
> > the write because the iomapping changed, why not leave the folio where
> > it is, refresh the iomapping, and come back to (possibly?) the same
> > folio?
> 
> I can't say for certain - I haven't gone an looked at the history.
> I suspect it goes back to the days where write() could write zeroes
> into the page cache for eof zeroing or zeroing for file extension
> before the write() started writing data. Maybe that was part of what
> it was trying to undo?

Yeah, I dunno, but it certainly looks like it might be an attempt to
undo the effects of posteof zeroing if the write fails.  Back in 2.6.36
days, commit 155130a4f784 ("get rid of block_write_begin_newtrunc")
changed xfs_vm_write_begin to truncate any posteof areas after a failed
write.  This was part of some sort of "new truncate sequence" that
itself got fixed and patched various times after that.

In 4.8, commit 68a9f5e7007c ("xfs: implement iomap based buffered write
path") changed this truncation to the unprocessed part of a short write,
even if it wasn't posteof.  The commit says it's part of a broad
rewrite, but doesn't mention anything about that particular change.

/me shrugs

--D

> -Dave.
> -- 
> Dave Chinner
> david@xxxxxxxxxxxxx