On Thu, Nov 17, 2022 at 11:41:33AM +1100, Dave Chinner wrote: > On Wed, Nov 16, 2022 at 08:57:19AM -0500, Brian Foster wrote: > > On Tue, Nov 15, 2022 at 12:30:39PM +1100, Dave Chinner wrote: > > > From: Dave Chinner <dchinner@xxxxxxxxxx> > > > > > ... > > > > > > Signed-off-by: Dave Chinner <dchinner@xxxxxxxxxx> > > > --- > > > fs/xfs/xfs_iomap.c | 151 ++++++++++++++++++++++++++++++++++++++++++--- > > > 1 file changed, 141 insertions(+), 10 deletions(-) > > > > > > diff --git a/fs/xfs/xfs_iomap.c b/fs/xfs/xfs_iomap.c > > > index 7bb55dbc19d3..2d48fcc7bd6f 100644 > > > --- a/fs/xfs/xfs_iomap.c > > > +++ b/fs/xfs/xfs_iomap.c > > > @@ -1134,6 +1134,146 @@ xfs_buffered_write_delalloc_punch( > > > end_fsb - start_fsb); > > > } > > > > > ... > > > +/* > > > + * Punch out all the delalloc blocks in the range given except for those that > > > + * have dirty data still pending in the page cache - those are going to be > > > + * written and so must still retain the delalloc backing for writeback. > > > + * > > > + * As we are scanning the page cache for data, we don't need to reimplement the > > > + * wheel - mapping_seek_hole_data() does exactly what we need to identify the > > > + * start and end of data ranges correctly even for sub-folio block sizes. This > > > + * byte range based iteration is especially convenient because it means we don't > > > + * have to care about variable size folios, nor where the start or end of the > > > + * data range lies within a folio, if they lie within the same folio or even if > > > + * there are multiple discontiguous data ranges within the folio. > > > + */ > > > +static int > > > +xfs_buffered_write_delalloc_release( > > > + struct inode *inode, > > > + loff_t start_byte, > > > + loff_t end_byte) > > > +{ > > > + loff_t punch_start_byte = start_byte; > > > + int error = 0; > > > + > > > + /* > > > + * Lock the mapping to avoid races with page faults re-instantiating > > > + * folios and dirtying them via ->page_mkwrite whilst we walk the > > > + * cache and perform delalloc extent removal. Failing to do this can > > > + * leave dirty pages with no space reservation in the cache. > > > + */ > > > + filemap_invalidate_lock(inode->i_mapping); > > > + while (start_byte < end_byte) { > > > + loff_t data_end; > > > + > > > + start_byte = mapping_seek_hole_data(inode->i_mapping, > > > + start_byte, end_byte, SEEK_DATA); > > > > FWIW, the fact that mapping seek data is based on uptodate status means > > that seek behavior can change based on prior reads. > > Yup. It should be obvious that any page cache scan based algorithm > will change based on changing page cache residency. > > > For example, see how > > seek hole/data presents reads of unwritten ranges as data [1]. The same > > thing isn't observable for holes because iomap doesn't check the mapping > > in that case, but underlying iop state is the same and that is what this > > code is looking at. > > Well, yes. That's the fundamental, underlying issue that this > patchset is addressing for the write() operation: that the page > cache contents and the underlying filesystem extent map are not > guaranteed to be coherent and can be changed independently of each > other. > > The whole problem with looking exclusively at filesystem level > extent state (and hence FIEMAP) is that the extent state doesn't > tell us whether the is uncommitted data over the range of the extent > in the page cache. The filesystem extent state and page cache data > *can't be coherent* in a writeback caching environment. This is the > fundamental difference between what the filesystem extent map tells > us (FIEMAP) and what querying the page cache tells us > (SEEK_DATA/SEEK_HOLE). > > This is also the underlying problem with iomap_truncate_page() - it > fails to query the page cache for data over unwritten extents, so > fails to zero the post-EOF part of dirty folios over unwritten > extents and so it all goes wrong... > > > The filtering being done here means we essentially only care about dirty > > pages backed by delalloc blocks. That means if you get here with a dirty > > page and the portion of the page affected by this failed write is > > uptodate, this won't punch an underlying delalloc block even though > > nothing else may have written to it in the meantime. > > Hmmm. IOMAP_F_NEW implies that the newly allocated delalloc iomap > will not span ranges that have pre-existing *dirty* data in the > page cache. Those *must* already have (del)allocated extents, hence > the iomap for the newly allocated delalloc extent will always end > before pre-existing dirty data in the page cache starts. > > Hence the seek scan range over an IOMAP_F_NEW IOMAP_DELALLOC map > precludes stepping into ranges that have pre-existing cached dirty > data. > > We also can't get a racing write() to the same range right now > because this is all under IOLOCK_EXCL, hence we only ever see dirty > folios as a result of race with page faults. page faults zero the > entire folio they insert into the page cache and > iomap_folio_mkwrite_iter() asserts that the entire folio is marked > up to date. Hence if we find a dirty folio outside the range the > write() dirtied, we are guaranteed that the entire dirty folio is up > to date.... > > Yes, there can be pre-existing *clean* folios (and clean partially > up to date folios) in the page cache, but we won't have dirty > partially up to date pages in the middle of the range we are > scanning. Hence we only need to care about the edge cases (folios > that overlap start and ends). We skip the partially written start > block, and we always punch up to the end block if it is different > from the last block we punched up to. If the end of the data spans > into a dirty folio, we know that dirty range is up to date because > the seek scan only returns ranges that are up to date. Hence we > don't punch those partial blocks out.... > > Regardless, let's assume we have a racing write that has partially > updated and dirtied a folio (because we've moved to > XFS_IOLOCK_SHARED locking for writes). This case is already handled > by the mapping_seek_hole_data() based iteration. > > That is, the mapping_seek_hole_data() loop provides us with > *discrete ranges of up to date data* that are independent of folio > size, up-to-date range granularity, dirty range tracking, filesystem > block size, etc. > > Hence if the next discrete range we discover is in the same dirty > folio as the previous discrete range of up to date data, we know we > have a sub-folio sized hole in the data that is not up to date. > Because there is no data over this range, we have to punch out the > underlying delalloc extent over that range. > > IOWs, the dirty state of the folio and/or the granularity of the > dirty range tracking is irrelevant here - we know there was no data > in the cache (dlean or dirty) over this range because it is > discontiguous with the previous range of data returned. > > IOWs, if we have this "up to date" map on a dirty folio like this: > > Data +-------+UUUUUUU+-------+UUUUUUU+-------+ > Extent map +DDDDDDD+DDDDDDD+DDDDDDD+DDDDDDD+DDDDDDD+ > > Then the unrolled iteration and punching we do would look like this: > > First iteration of the range: > > punch_start: > V > +-------+UUUUUUU+-------+UUUUUUU+-------+ > > SEEK_DATA: V > +-------+UUUUUUU+-------+UUUUUUU+-------+ > SEEK_HOLE: ^ > Data range: +UUUUUUU+ > Punch range: +-------+ > Extent map: +-------+DDDDDDD+DDDDDDD+DDDDDDD+DDDDDDD+ > > Second iteration: > > punch_start V > +-------+UUUUUUU+-------+UUUUUUU+-------+ > SEEK_DATA: V > +-------+UUUUUUU+-------+UUUUUUU+-------+ > SEEK_HOLE: ^ > Data range: +UUUUUUU+ > Punch range: +-------+ > Extent map: +-------+DDDDDDD+-------+DDDDDDD+DDDDDDD+ > > Third iteration: > > punch_start V > +-------+UUUUUUU+-------+UUUUUUU+-------+ > SEEK_DATA: - moves into next folio in cache > .... > Punch range: +-------+ ...... > Extent map: +-------+DDDDDDD+-------+DDDDDDD+-------+ ...... > (to end of scan range or start of next data) > > As you can see, this scan does not care about folio size, sub-folio > range granularity or filesystem block sizes. It also matches > exactly how writeback handles dirty, partially up to date folios, so > there's no stray delalloc blocks left around to be tripped over > after failed or short writes occur. I had been wondering about the particular case of dealing with partially uptodate folios, so it was really helpful to watch this all in ASCIIVision(tm). > Indeed, if we move to sub-folio dirty range tracking, we can simply > add a mapping_seek_hole_data() variant that walks dirty ranges in > the page cache rather than up to date ranges. Then we can remove the > inner loop from this code that looks up folios to determine dirty > state. The above algorithm does not change - we just walk from > discrete range to discrete range punching the gaps between them.... > > IOWs, the algorithm is largely future proof - the only thing that > needs to change if we change iomap to track sub-folio dirty ranges > is how we check the data range for being dirty. That should be no > surprise, really, the surprise should be that we can make some > simple mods to page cache seek to remove the need for checking dirty > state in this code altogether.... > > > That sort of state > > can be created by a prior read of the range on a sub-page block size fs, > > or perhaps a racing async readahead (via read fault of a lower > > offset..?), etc. > > Yup, generic/346 exercises this racing unaligned, sub-folio mmap > write vs write() case. This test, specifically, was the reason I > moved to using mapping_seek_hole_data() - g/346 found an endless > stream of bugs in the sub-multi-page-folio range iteration code I > kept trying to write.... > > > I suspect this is not a serious error because the page is dirty > > and writeback will thus convert the block. The only exception to > > that I can see is if the block is beyond EOF (consider a mapped > > read to a page that straddles EOF, followed by a post-eof write > > that fails), writeback won't actually map the block directly. > > I don't think that can happen. iomap_write_failed() calls > truncate_pagecache_range() to remove any newly instantiated cached > data beyond the original EOF. Hence the delalloc punch will remove > everything beyond the original EOF that was allocated for the failed > write. Hence when we get to writeback we're not going to find any > up-to-date data beyond the EOF block in the page cache, nor any > stray delalloc blocks way beyond EOF.... > > > It may convert if contiguous with delalloc blocks inside EOF (and > > sufficiently sized physical extents exist), or even if not, should > > still otherwise be cleaned up by the various other means we > > already have to manage post-eof blocks. > > > > So IOW there's a tradeoff being made here for possible spurious > > allocation and I/O and a subtle dependency on writeback that > > should probably be documented somewhere. > > As per above, I don't think there is any spurious/stale allocation > left behind by the punch code, nor is there any dependency on > writeback to ignore it such issues. > > > The larger concern is that if > > writeback eventually changes based on dirty range tracking in a way that > > breaks this dependency, that introduces yet another stale delalloc block > > landmine associated with this error handling code (regardless of whether > > you want to call that a bug in this code, seek data, whatever), and > > those problems are difficult enough to root cause as it is. > > If iomap changes how it tracks dirty ranges, this punch code only > needs small changes to work with that correctly. There aren't any > unknown landmines here - if we change dirty tracking, we know that > we have to update the code that depends on the existing dirty > tracking mechanisms to work correctly with the new infrastructure... I hope that anyone trying to change the iomap dirty tracking code will find it easier to do with the iteration code in buffered-io.c. Thanks for taking the time to rework that part for v3. --D > Cheers, > > Dave. > -- > Dave Chinner > david@xxxxxxxxxxxxx