On Wed, Jan 31, 2024 at 05:20:10AM +0000, Matthew Wilcox wrote: > On Tue, Jan 30, 2024 at 11:58:22PM -0500, Theodore Ts'o wrote: > > Hmm, could XFS pre-fault target memory buffer for the bulkstat output > > before starting its transaction? Alternatively, ext4 could do a save > > of current->journal_info before starting to process the page fault, > > and restore it when it is done. Both of these seem a bit hacky, and > > the question is indeed, are there other avenues that might cause the > > transaction context nesting, such that a more general solution is > > called for? > > I'd suggest that saving off current->journal_info is risky because > it might cover a real problem where you've taken a pagefault inside > a transaction (eg ext4 faulting while in the middle of a transaction on > the same filesystem that contains the faulting file). Depends. Look at it from the POV of XFS: 1. we can identify current->journal_info as belonging to XFS because of the TRAN magic number in the first 32 bits of the structure. 2. the struct xfs_trans has a pointer to the xfs_mount which points to the VFS superblock, which means we can identify exactly which filesystem instance the transaction belongs to. 3. We can determine if the transaction has a journal reservation from the log ticket the transaction holds. >From these three things, we can identify if we are about to recurse into the same filesystem, and if so, determine if it is safe to run new transactions from within that context. > Seems to me that we shouldn't be writing to userspace while in the > middle of a transaction. We could even assert that in copy_to_user()? On further thinking I don't think the problem is taking page faults within a filesystem transaction context is the problem here. We're using the transaction context for automated garbage collection so we never have to care about leaking object references in the code, not because we are running a modification and holding a journal reservation. It's the journals reservations that cannot be allowed to nest in XFS, not the transactions themselves. IOWs, the real problem is that multiple filesystems use the same task field for their own individual purposes, and that then leads these sorts of recursion problems when a task jumps from one filesystem context to another and the task field is then mis-interpretted. I've had a look at the XFS usage of current->journal_info, and I think we can just remove it. It's used for warning that we are running a writepages operations from within transaction context (which should never happen from XFS nor memory reclaim). it's also used in the ->destroy_inode path to determine if we are running in a context where we cannot block on filesystem locks or transaction reservations. The former we can remove with no loss, the latter we can simply check PF_MEMALLOC_NOFS as that is set by transaction contexts. IOWs, I think that, in general, page faults in user buffers are fine in empty XFS transaction contexts as long as we aren't using some structure that some other filesystem might then to process the page fault.... This may not be true for other filesystems, but I don't think we can really say "page faults in any filesystem transaction are unsafe and should be banned".... -Dave. -- Dave Chinner david@xxxxxxxxxxxxx