On Thu, Sep 15, 2022 at 08:35:38PM -0700, Dan Williams wrote: > In preparation for moving DAX pages to be 0-based rather than 1-based > for the idle refcount, the fsdax core wants to have all mappings in a > "zapped" state before truncate. For typical pages this happens naturally > via unmap_mapping_range(), for DAX pages some help is needed to record > this state in the 'struct address_space' of the inode(s) where the page > is mapped. > > That "zapped" state is recorded in DAX entries as a side effect of > xfs_break_layouts(). Arrange for it to be called before all truncation > events which already happens for truncate() and PUNCH_HOLE, but not > truncate_inode_pages_final(). Arrange for xfs_break_layouts() before > truncate_inode_pages_final(). Ugh. That's nasty and awful. > Cc: Matthew Wilcox <willy@xxxxxxxxxxxxx> > Cc: Jan Kara <jack@xxxxxxx> > Cc: "Darrick J. Wong" <djwong@xxxxxxxxxx> > Cc: Jason Gunthorpe <jgg@xxxxxxxxxx> > Cc: Christoph Hellwig <hch@xxxxxx> > Cc: John Hubbard <jhubbard@xxxxxxxxxx> > Signed-off-by: Dan Williams <dan.j.williams@xxxxxxxxx> > --- > fs/xfs/xfs_file.c | 13 +++++++++---- > fs/xfs/xfs_inode.c | 3 ++- > fs/xfs/xfs_inode.h | 6 ++++-- > fs/xfs/xfs_super.c | 22 ++++++++++++++++++++++ > 4 files changed, 37 insertions(+), 7 deletions(-) > > diff --git a/fs/xfs/xfs_file.c b/fs/xfs/xfs_file.c > index 556e28d06788..d3ff692d5546 100644 > --- a/fs/xfs/xfs_file.c > +++ b/fs/xfs/xfs_file.c > @@ -816,7 +816,8 @@ xfs_wait_dax_page( > int > xfs_break_dax_layouts( > struct inode *inode, > - bool *retry) > + bool *retry, > + int state) > { > struct page *page; > > @@ -827,8 +828,8 @@ xfs_break_dax_layouts( > return 0; > > *retry = true; > - return ___wait_var_event(page, dax_page_idle(page), TASK_INTERRUPTIBLE, > - 0, 0, xfs_wait_dax_page(inode)); > + return ___wait_var_event(page, dax_page_idle(page), state, 0, 0, > + xfs_wait_dax_page(inode)); > } > > int > @@ -839,14 +840,18 @@ xfs_break_layouts( > { > bool retry; > int error; > + int state = TASK_INTERRUPTIBLE; > > ASSERT(xfs_isilocked(XFS_I(inode), XFS_IOLOCK_SHARED|XFS_IOLOCK_EXCL)); > > do { > retry = false; > switch (reason) { > + case BREAK_UNMAP_FINAL: > + state = TASK_UNINTERRUPTIBLE; > + fallthrough; > case BREAK_UNMAP: > - error = xfs_break_dax_layouts(inode, &retry); > + error = xfs_break_dax_layouts(inode, &retry, state); > if (error || retry) > break; > fallthrough; > diff --git a/fs/xfs/xfs_inode.c b/fs/xfs/xfs_inode.c > index 28493c8e9bb2..72ce1cb72736 100644 > --- a/fs/xfs/xfs_inode.c > +++ b/fs/xfs/xfs_inode.c > @@ -3452,6 +3452,7 @@ xfs_mmaplock_two_inodes_and_break_dax_layout( > struct xfs_inode *ip1, > struct xfs_inode *ip2) > { > + int state = TASK_INTERRUPTIBLE; > int error; > bool retry; > struct page *page; > @@ -3463,7 +3464,7 @@ xfs_mmaplock_two_inodes_and_break_dax_layout( > retry = false; > /* Lock the first inode */ > xfs_ilock(ip1, XFS_MMAPLOCK_EXCL); > - error = xfs_break_dax_layouts(VFS_I(ip1), &retry); > + error = xfs_break_dax_layouts(VFS_I(ip1), &retry, state); > if (error || retry) { > xfs_iunlock(ip1, XFS_MMAPLOCK_EXCL); > if (error == 0 && retry) > diff --git a/fs/xfs/xfs_inode.h b/fs/xfs/xfs_inode.h > index fa780f08dc89..e4994eb6e521 100644 > --- a/fs/xfs/xfs_inode.h > +++ b/fs/xfs/xfs_inode.h > @@ -454,11 +454,13 @@ static inline bool xfs_inode_has_large_extent_counts(struct xfs_inode *ip) > * layout-holder has a consistent view of the file's extent map. While > * BREAK_WRITE breaks can be satisfied by recalling FL_LAYOUT leases, > * BREAK_UNMAP breaks additionally require waiting for busy dax-pages to > - * go idle. > + * go idle. BREAK_UNMAP_FINAL is an uninterruptible version of > + * BREAK_UNMAP. > */ > enum layout_break_reason { > BREAK_WRITE, > BREAK_UNMAP, > + BREAK_UNMAP_FINAL, > }; > > /* > @@ -531,7 +533,7 @@ xfs_itruncate_extents( > } > > /* from xfs_file.c */ > -int xfs_break_dax_layouts(struct inode *inode, bool *retry); > +int xfs_break_dax_layouts(struct inode *inode, bool *retry, int state); > int xfs_break_layouts(struct inode *inode, uint *iolock, > enum layout_break_reason reason); > > diff --git a/fs/xfs/xfs_super.c b/fs/xfs/xfs_super.c > index 9ac59814bbb6..ebb4a6eba3fc 100644 > --- a/fs/xfs/xfs_super.c > +++ b/fs/xfs/xfs_super.c > @@ -725,6 +725,27 @@ xfs_fs_drop_inode( > return generic_drop_inode(inode); > } > > +STATIC void > +xfs_fs_evict_inode( > + struct inode *inode) > +{ > + struct xfs_inode *ip = XFS_I(inode); > + uint iolock = XFS_IOLOCK_EXCL | XFS_MMAPLOCK_EXCL; > + long error; > + > + xfs_ilock(ip, iolock); I'm guessing you never ran this through lockdep. The general rule is that XFS should not take inode locks directly in the inode eviction path because lockdep tends to throw all manner of memory reclaim related false positives when we do this. We most definitely don't want to be doing this for anything other than regular files that are DAX enabled, yes? We also don't want to arbitrarily block memory reclaim for long periods of time waiting on an inode lock. People seem to get very upset when we introduce unbound latencies into the memory reclaim path... Indeed, what are you actually trying to serialise against here? Nothing should have a reference to the inode, nor should anything be able to find and take a new reference to the inode while it is being evicted.... > + error = xfs_break_layouts(inode, &iolock, BREAK_UNMAP_FINAL); > + > + /* The final layout break is uninterruptible */ > + ASSERT_ALWAYS(!error); We don't do error handling with BUG(). If xfs_break_layouts() truly can't fail (what happens if the fs is shut down and some internal call path now detects that and returns -EFSCORRUPTED?), theni WARN_ON_ONCE() and continuing to tear down the inode so the system is not immediately compromised is the appropriate action here. > + > + truncate_inode_pages_final(&inode->i_data); > + clear_inode(inode); > + > + xfs_iunlock(ip, iolock); > +} That all said, this really looks like a bit of a band-aid. I can't work out why would we we ever have an actual layout lease here that needs breaking given they are file based and active files hold a reference to the inode. If we ever break that, then I suspect this change will cause major problems for anyone using pNFS with XFS as xfs_break_layouts() can end up waiting for NFS delegation revocation. This is something we should never be doing in inode eviction/memory reclaim. Hence I have to ask why this lease break is being done unconditionally for all inodes, instead of only calling xfs_break_dax_layouts() directly on DAX enabled regular files? I also wonder what exciting new system deadlocks this will create because BREAK_UNMAP_FINAL can essentially block forever waiting on dax mappings going away. If that DAX mapping reclaim requires memory allocations..... /me looks deeper into the dax_layout_busy_page() stuff and realises that both ext4 and XFS implementations of ext4_break_layouts() and xfs_break_dax_layouts() are actually identical. That is, filemap_invalidate_unlock() and xfs_iunlock(ip, XFS_MMAPLOCK_EXCL) operate on exactly the same inode->i_mapping->invalidate_lock. Hence the implementations in ext4 and XFS are both functionally identical. Further, when the inode is in the eviction path there is no reason for needing to take that mapping->invalidation_lock to invalidate remaining stale DAX mappings before truncate blasts them away. IOWs, I don't see why fixing this problem needs to add new code to XFS or ext4 at all. The DAX mapping invalidation and waiting can be done enitrely within truncate_inode_pages_final() (conditional on IS_DAX()) after mapping_set_exiting() has been set with generic code and it should not require locking at all. I also think that ext4_break_layouts() and xfs_break_dax_layouts() should be merged into a generic dax infrastructure function so the filesystems don't need to care about the internal details of DAX mappings at all... -Dave. -- Dave Chinner david@xxxxxxxxxxxxx