On Fri, Jan 14, 2022 at 01:30:43PM -0800, Darrick J. Wong wrote: > On Fri, Jan 14, 2022 at 02:45:10PM -0500, Brian Foster wrote: > > On Fri, Jan 14, 2022 at 09:35:35AM -0800, Darrick J. Wong wrote: > > > On Fri, Jan 14, 2022 at 09:38:10AM +1100, Dave Chinner wrote: > > > > On Thu, Jan 13, 2022 at 08:37:01AM -0500, Brian Foster wrote: > > > > > STATIC void > > > > > xfs_fs_destroy_inode( > > > > > @@ -764,6 +769,16 @@ xfs_fs_sync_fs( > > > > > * when the state is either SB_FREEZE_FS or SB_FREEZE_COMPLETE. > > > > > */ > > > > > if (sb->s_writers.frozen == SB_FREEZE_PAGEFAULT) { > > > > > + struct xfs_icwalk icw = {0}; > > > > > + > > > > > + /* > > > > > + * Clear out eofb and cowblocks inodes so eviction while frozen > > > > > + * doesn't leave them sitting in the inactivation queue where > > > > > + * they cannot be processed. > > > > > + */ > > > > > + icw.icw_flags = XFS_ICWALK_FLAG_SYNC; > > > > > + xfs_blockgc_free_space(mp, &icw); > > > > > > > > Is a SYNC walk safe to run here? I know we run > > > > xfs_blockgc_free_space() from XFS_IOC_FREE_EOFBLOCKS under > > > > SB_FREEZE_WRITE protection, but here we have both frozen writes and > > > > page faults we're running in a much more constrained freeze context > > > > here. > > > > > > > > i.e. the SYNC walk will keep busy looping if it can't get the > > > > IOLOCK_EXCL on an inode that is in cache, so if we end up with an > > > > inode locked and blocked on SB_FREEZE_WRITE or SB_FREEZE_PAGEFAULT > > > > for whatever reason this will never return.... > > > > > > Are you referring to the case where one could be read()ing from a file > > > into a buffer that's really a mmap'd page from another file while the > > > underlying fs is being frozen? > > > > > > > I thought this was generally safe as freeze protection sits outside of > > the locks, but I'm not terribly sure about the read to a mapped buffer > > case. If that allows an iolock holder to block on a pagefault due to > > freeze, then SB_FREEZE_PAGEFAULT might be too late for a synchronous > > scan (i.e. assuming nothing blocks this earlier or prefaults/pins the > > target buffer)..? > > I think so. xfs_file_buffered_read takes IOLOCK_SHARED and calls > filemap_read, which calls copy_page_to_iter. I /think/ the messy iovec > code calls copyout, which can then hit a write page fault, which takes > us to __xfs_filemap_fault. That takes SB_PAGEFAULT, which is the > opposite order of what now goes on during a freeze. Yeah, so this was the sort of thing I was concerned about. I couldn't put my finger on an actual case, but the general locking situation felt wrong which is why I asked the question. That is, normal case for a -write- operation is: Prevent write freeze (shared lock!) take IO lock (shared or exclusive) <page fault> Prevent fault freeze (shared) take MMAP lock do allocation Prevent internal freeze (shared) run transaction Which means the stack cannot happen unless a freeze is completely locked out at the first level and it has to wait for the write operation to complete before proceeding. So for modification operations, doing: lock out write freeze (excl lock) lock out fault freeze (excl lock) take IOLOCK excl would be safe. However, in the case of a read operation, we don't have that inital FREEZE_WRITE protection, so the first level an operation might hit is the page fault freeze protection. Hence we now get inversion on the write freeze/IOLOCK/fault freeze because we have IOLOCK/fault freeze without any write freeze protection so this can happen: task 1 freeze freeze write ..... freeze page fault read() IOLOCK_SHARED <page fault> sb_start_pagefault() <blocks on fault freeze> xfs_blockgc_free_space(SYNC) try IOLOCK_EXCL <busy loops on failed IOLOCK_EXCL> So the sync walk here doesn't seem safe. A non-blocking, best-effort walk would be fine, but busy-looping on inodes we can't lock looks to be a deadlock vector. ISTR that we recently wne through a similar readdir vs page fault locking discussion, so it's not just read() file IO that could trigger page fault freeze first with an unprotected IOLOCK already held. Cheers, Dave. -- Dave Chinner david@xxxxxxxxxxxxx