Re: [PATCH] xfs: don't use current->journal_info

"Darrick J. Wong" <djwong@xxxxxxxxxx> · Wed, 21 Feb 2024 15:25:36 -0800

On Thu, Feb 22, 2024 at 09:47:23AM +1100, Dave Chinner wrote:
> From: Dave Chinner <dchinner@xxxxxxxxxx>
> 
> syzbot reported an ext4 panic during a page fault where found a
> journal handle when it didn't expect to find one. The structure
> it tripped over had a value of 'TRAN' in the first entry in the
> structure, and that indicates it tripped over a struct xfs_trans
> instead of a jbd2 handle.
> 
> The reason for this is that the page fault was taken during a
> copy-out to a user buffer from an xfs bulkstat operation. XFS uses
> an "empty" transaction context for bulkstat to do automated metadata
> buffer cleanup, and so the transaction context is valid across the
> copyout of the bulkstat info into the user buffer.
> 
> We are using empty transaction contexts like this in XFS in more
> places to reduce the risk of failing to release objects we reference
> during the operation, especially during error handling. Hence we
> really need to ensure that we can take page faults from these
> contexts without leaving landmines for the code processing the page
> fault to trip over.
> 
> We really only use current->journal_info for a single warning check
> in xfs_vm_writepages() to ensure we aren't doing writeback from a
> transaction context. Writeback might need to do allocation, so it
> can need to run transactions itself. Hence it's a debug check to
> warn us that we've done something silly, and largely it is not all
> that useful.
> 
> So let's just remove all the use of current->journal_info in XFS and
> get rid of all the potential issues from nested contexts where
> current->journal_info might get misused by another filesytsem
> context.

I wonder if this is too much, though?

Conceptually, I'd rather we set current->journal_info to some random
number whenever we start a !NO_WRITECOUNT (aka a non-empty) transaction
and clear it during transaction commit/cancel.  That way, *we* can catch
the case where some other filesystem starts a transaction and
accidentally bounces into an updating write fault inside XFS.

That might be outweighed by the weird semantics of ext4 where the
zeroness of journal_info changes its behavior in ways I don't want to
understand.  Thoughts?

--D

> Reported-by: syzbot+cdee56dbcdf0096ef605@xxxxxxxxxxxxxxxxxxxxxxxxx
> Signed-off-by: Dave Chinner <dchinner@xxxxxxxxxx>
> ---
>  fs/xfs/scrub/common.c | 4 +---
>  fs/xfs/xfs_aops.c     | 7 -------
>  fs/xfs/xfs_icache.c   | 8 +++++---
>  fs/xfs/xfs_trans.h    | 9 +--------
>  4 files changed, 7 insertions(+), 21 deletions(-)
> 
> diff --git a/fs/xfs/scrub/common.c b/fs/xfs/scrub/common.c
> index 81f2b96bb5a7..d853348a48c8 100644
> --- a/fs/xfs/scrub/common.c
> +++ b/fs/xfs/scrub/common.c
> @@ -1000,9 +1000,7 @@ xchk_irele(
>  	struct xfs_scrub	*sc,
>  	struct xfs_inode	*ip)
>  {
> -	if (current->journal_info != NULL) {
> -		ASSERT(current->journal_info == sc->tp);
> -
> +	if (sc->tp) {
>  		/*
>  		 * If we are in a transaction, we /cannot/ drop the inode
>  		 * ourselves, because the VFS will trigger writeback, which
> diff --git a/fs/xfs/xfs_aops.c b/fs/xfs/xfs_aops.c
> index 813f85156b0c..bc3b649d29c4 100644
> --- a/fs/xfs/xfs_aops.c
> +++ b/fs/xfs/xfs_aops.c
> @@ -502,13 +502,6 @@ xfs_vm_writepages(
>  {
>  	struct xfs_writepage_ctx wpc = { };
>  
> -	/*
> -	 * Writing back data in a transaction context can result in recursive
> -	 * transactions. This is bad, so issue a warning and get out of here.
> -	 */
> -	if (WARN_ON_ONCE(current->journal_info))
> -		return 0;
> -
>  	xfs_iflags_clear(XFS_I(mapping->host), XFS_ITRUNCATED);
>  	return iomap_writepages(mapping, wbc, &wpc.ctx, &xfs_writeback_ops);
>  }
> diff --git a/fs/xfs/xfs_icache.c b/fs/xfs/xfs_icache.c
> index 06046827b5fe..9b966af7d55c 100644
> --- a/fs/xfs/xfs_icache.c
> +++ b/fs/xfs/xfs_icache.c
> @@ -2030,8 +2030,10 @@ xfs_inodegc_want_queue_work(
>   *  - Memory shrinkers queued the inactivation worker and it hasn't finished.
>   *  - The queue depth exceeds the maximum allowable percpu backlog.
>   *
> - * Note: If the current thread is running a transaction, we don't ever want to
> - * wait for other transactions because that could introduce a deadlock.
> + * Note: If we are in a NOFS context here (e.g. current thread is running a
> + * transaction) the we don't want to block here as inodegc progress may require
> + * filesystem resources we hold to make progress and that could result in a
> + * deadlock. Hence we skip out of here if we are in a scoped NOFS context.
>   */
>  static inline bool
>  xfs_inodegc_want_flush_work(
> @@ -2039,7 +2041,7 @@ xfs_inodegc_want_flush_work(
>  	unsigned int		items,
>  	unsigned int		shrinker_hits)
>  {
> -	if (current->journal_info)
> +	if (current->flags & PF_MEMALLOC_NOFS)
>  		return false;
>  
>  	if (shrinker_hits > 0)
> diff --git a/fs/xfs/xfs_trans.h b/fs/xfs/xfs_trans.h
> index c6d0795085a3..2bd673715ace 100644
> --- a/fs/xfs/xfs_trans.h
> +++ b/fs/xfs/xfs_trans.h
> @@ -269,19 +269,14 @@ static inline void
>  xfs_trans_set_context(
>  	struct xfs_trans	*tp)
>  {
> -	ASSERT(current->journal_info == NULL);
>  	tp->t_pflags = memalloc_nofs_save();
> -	current->journal_info = tp;
>  }
>  
>  static inline void
>  xfs_trans_clear_context(
>  	struct xfs_trans	*tp)
>  {
> -	if (current->journal_info == tp) {
> -		memalloc_nofs_restore(tp->t_pflags);
> -		current->journal_info = NULL;
> -	}
> +	memalloc_nofs_restore(tp->t_pflags);
>  }
>  
>  static inline void
> @@ -289,10 +284,8 @@ xfs_trans_switch_context(
>  	struct xfs_trans	*old_tp,
>  	struct xfs_trans	*new_tp)
>  {
> -	ASSERT(current->journal_info == old_tp);
>  	new_tp->t_pflags = old_tp->t_pflags;
>  	old_tp->t_pflags = 0;
> -	current->journal_info = new_tp;
>  }
>  
>  #endif	/* __XFS_TRANS_H__ */
> -- 
> 2.43.0
> 
>