Dave Chinner wrote: > On Thu, Sep 15, 2022 at 08:35:38PM -0700, Dan Williams wrote: > > In preparation for moving DAX pages to be 0-based rather than 1-based > > for the idle refcount, the fsdax core wants to have all mappings in a > > "zapped" state before truncate. For typical pages this happens naturally > > via unmap_mapping_range(), for DAX pages some help is needed to record > > this state in the 'struct address_space' of the inode(s) where the page > > is mapped. > > > > That "zapped" state is recorded in DAX entries as a side effect of > > xfs_break_layouts(). Arrange for it to be called before all truncation > > events which already happens for truncate() and PUNCH_HOLE, but not > > truncate_inode_pages_final(). Arrange for xfs_break_layouts() before > > truncate_inode_pages_final(). > > Ugh. That's nasty and awful. > > > > > Cc: Matthew Wilcox <willy@xxxxxxxxxxxxx> > > Cc: Jan Kara <jack@xxxxxxx> > > Cc: "Darrick J. Wong" <djwong@xxxxxxxxxx> > > Cc: Jason Gunthorpe <jgg@xxxxxxxxxx> > > Cc: Christoph Hellwig <hch@xxxxxx> > > Cc: John Hubbard <jhubbard@xxxxxxxxxx> > > Signed-off-by: Dan Williams <dan.j.williams@xxxxxxxxx> > > --- > > fs/xfs/xfs_file.c | 13 +++++++++---- > > fs/xfs/xfs_inode.c | 3 ++- > > fs/xfs/xfs_inode.h | 6 ++++-- > > fs/xfs/xfs_super.c | 22 ++++++++++++++++++++++ > > 4 files changed, 37 insertions(+), 7 deletions(-) > > > > diff --git a/fs/xfs/xfs_file.c b/fs/xfs/xfs_file.c > > index 556e28d06788..d3ff692d5546 100644 > > --- a/fs/xfs/xfs_file.c > > +++ b/fs/xfs/xfs_file.c > > @@ -816,7 +816,8 @@ xfs_wait_dax_page( > > int > > xfs_break_dax_layouts( > > struct inode *inode, > > - bool *retry) > > + bool *retry, > > + int state) > > { > > struct page *page; > > > > @@ -827,8 +828,8 @@ xfs_break_dax_layouts( > > return 0; > > > > *retry = true; > > - return ___wait_var_event(page, dax_page_idle(page), TASK_INTERRUPTIBLE, > > - 0, 0, xfs_wait_dax_page(inode)); > > + return ___wait_var_event(page, dax_page_idle(page), state, 0, 0, > > + xfs_wait_dax_page(inode)); > > } > > > > int > > @@ -839,14 +840,18 @@ xfs_break_layouts( > > { > > bool retry; > > int error; > > + int state = TASK_INTERRUPTIBLE; > > > > ASSERT(xfs_isilocked(XFS_I(inode), XFS_IOLOCK_SHARED|XFS_IOLOCK_EXCL)); > > > > do { > > retry = false; > > switch (reason) { > > + case BREAK_UNMAP_FINAL: > > + state = TASK_UNINTERRUPTIBLE; > > + fallthrough; > > case BREAK_UNMAP: > > - error = xfs_break_dax_layouts(inode, &retry); > > + error = xfs_break_dax_layouts(inode, &retry, state); > > if (error || retry) > > break; > > fallthrough; > > diff --git a/fs/xfs/xfs_inode.c b/fs/xfs/xfs_inode.c > > index 28493c8e9bb2..72ce1cb72736 100644 > > --- a/fs/xfs/xfs_inode.c > > +++ b/fs/xfs/xfs_inode.c > > @@ -3452,6 +3452,7 @@ xfs_mmaplock_two_inodes_and_break_dax_layout( > > struct xfs_inode *ip1, > > struct xfs_inode *ip2) > > { > > + int state = TASK_INTERRUPTIBLE; > > int error; > > bool retry; > > struct page *page; > > @@ -3463,7 +3464,7 @@ xfs_mmaplock_two_inodes_and_break_dax_layout( > > retry = false; > > /* Lock the first inode */ > > xfs_ilock(ip1, XFS_MMAPLOCK_EXCL); > > - error = xfs_break_dax_layouts(VFS_I(ip1), &retry); > > + error = xfs_break_dax_layouts(VFS_I(ip1), &retry, state); > > if (error || retry) { > > xfs_iunlock(ip1, XFS_MMAPLOCK_EXCL); > > if (error == 0 && retry) > > diff --git a/fs/xfs/xfs_inode.h b/fs/xfs/xfs_inode.h > > index fa780f08dc89..e4994eb6e521 100644 > > --- a/fs/xfs/xfs_inode.h > > +++ b/fs/xfs/xfs_inode.h > > @@ -454,11 +454,13 @@ static inline bool xfs_inode_has_large_extent_counts(struct xfs_inode *ip) > > * layout-holder has a consistent view of the file's extent map. While > > * BREAK_WRITE breaks can be satisfied by recalling FL_LAYOUT leases, > > * BREAK_UNMAP breaks additionally require waiting for busy dax-pages to > > - * go idle. > > + * go idle. BREAK_UNMAP_FINAL is an uninterruptible version of > > + * BREAK_UNMAP. > > */ > > enum layout_break_reason { > > BREAK_WRITE, > > BREAK_UNMAP, > > + BREAK_UNMAP_FINAL, > > }; > > > > /* > > @@ -531,7 +533,7 @@ xfs_itruncate_extents( > > } > > > > /* from xfs_file.c */ > > -int xfs_break_dax_layouts(struct inode *inode, bool *retry); > > +int xfs_break_dax_layouts(struct inode *inode, bool *retry, int state); > > int xfs_break_layouts(struct inode *inode, uint *iolock, > > enum layout_break_reason reason); > > > > diff --git a/fs/xfs/xfs_super.c b/fs/xfs/xfs_super.c > > index 9ac59814bbb6..ebb4a6eba3fc 100644 > > --- a/fs/xfs/xfs_super.c > > +++ b/fs/xfs/xfs_super.c > > @@ -725,6 +725,27 @@ xfs_fs_drop_inode( > > return generic_drop_inode(inode); > > } > > > > +STATIC void > > +xfs_fs_evict_inode( > > + struct inode *inode) > > +{ > > + struct xfs_inode *ip = XFS_I(inode); > > + uint iolock = XFS_IOLOCK_EXCL | XFS_MMAPLOCK_EXCL; > > + long error; > > + > > + xfs_ilock(ip, iolock); > > I'm guessing you never ran this through lockdep. I always run with lockdep enabled in my development kernels, but maybe my testing was insufficient? Somewhat moot with your concerns below... > The general rule is that XFS should not take inode locks directly in > the inode eviction path because lockdep tends to throw all manner of > memory reclaim related false positives when we do this. We most > definitely don't want to be doing this for anything other than > regular files that are DAX enabled, yes? Guilty. I sought to satisfy the locking expectations of the break_layouts internals rather than drop the unnecessary locking. > > We also don't want to arbitrarily block memory reclaim for long > periods of time waiting on an inode lock. People seem to get very > upset when we introduce unbound latencies into the memory reclaim > path... > > Indeed, what are you actually trying to serialise against here? > Nothing should have a reference to the inode, nor should anything be > able to find and take a new reference to the inode while it is being > evicted.... Ok. > > > + error = xfs_break_layouts(inode, &iolock, BREAK_UNMAP_FINAL); > > + > > + /* The final layout break is uninterruptible */ > > + ASSERT_ALWAYS(!error); > > We don't do error handling with BUG(). If xfs_break_layouts() truly > can't fail (what happens if the fs is shut down and some internal > call path now detects that and returns -EFSCORRUPTED?), theni > WARN_ON_ONCE() and continuing to tear down the inode so the system > is not immediately compromised is the appropriate action here. > > > + > > + truncate_inode_pages_final(&inode->i_data); > > + clear_inode(inode); > > + > > + xfs_iunlock(ip, iolock); > > +} > > That all said, this really looks like a bit of a band-aid. It definitely is since DAX is in this transitory state between doing some activities page-less and others with page metadata. If DAX was fully committed to behaving like a typical page then unmap_mapping_range() would have already satisfied this reference counting situation. > I can't work out why would we we ever have an actual layout lease > here that needs breaking given they are file based and active files > hold a reference to the inode. If we ever break that, then I suspect > this change will cause major problems for anyone using pNFS with XFS > as xfs_break_layouts() can end up waiting for NFS delegation > revocation. This is something we should never be doing in inode > eviction/memory reclaim. > > Hence I have to ask why this lease break is being done > unconditionally for all inodes, instead of only calling > xfs_break_dax_layouts() directly on DAX enabled regular files? I > also wonder what exciting new system deadlocks this will create > because BREAK_UNMAP_FINAL can essentially block forever waiting on > dax mappings going away. If that DAX mapping reclaim requires memory > allocations..... There should be no memory allocations in the DAX mapping reclaim path. Also, the page pins it waits for are precluded from being GUP_LONGTERM. > > /me looks deeper into the dax_layout_busy_page() stuff and realises > that both ext4 and XFS implementations of ext4_break_layouts() and > xfs_break_dax_layouts() are actually identical. > > That is, filemap_invalidate_unlock() and xfs_iunlock(ip, > XFS_MMAPLOCK_EXCL) operate on exactly the same > inode->i_mapping->invalidate_lock. Hence the implementations in ext4 > and XFS are both functionally identical. I assume you mean for the purposes of this "final" break since xfs_file_allocate() holds XFS_IOLOCK_EXCL over xfs_break_layouts(). > Further, when the inode is > in the eviction path there is no reason for needing to take that > mapping->invalidation_lock to invalidate remaining stale DAX > mappings before truncate blasts them away. > > IOWs, I don't see why fixing this problem needs to add new code to > XFS or ext4 at all. The DAX mapping invalidation and waiting can be > done enitrely within truncate_inode_pages_final() (conditional on > IS_DAX()) after mapping_set_exiting() has been set with generic code > and it should not require locking at all. I also think that > ext4_break_layouts() and xfs_break_dax_layouts() should be merged > into a generic dax infrastructure function so the filesystems don't > need to care about the internal details of DAX mappings at all... Yes, I think I can make that happen. Thanks Dave.