On Fri, Jun 28, 2019 at 07:37:17AM +0200, Christoph Hellwig wrote: > On Thu, Jun 27, 2019 at 03:30:30PM -0700, Darrick J. Wong wrote: > > I think the wording of this is too indirect. The reason we need to set > > NOFS is because we could be doing writeback as part of reclaiming > > memory, which means that we cannot recurse back into filesystems to > > satisfy the memory allocation needed to create a transaction. The NOFS > > part applies to any memory allocation, of course. > > > > If you're fine with the wording below I'll just edit that into the > > patch: > > > > /* > > * We can allocate memory here while doing writeback on behalf of > > * memory reclaim. To avoid memory allocation deadlocks set the > > * task-wide nofs context for the following operations. > > */ > > nofs_flag = memalloc_nofs_save(); > > Fine with me. > > > > trace_xfs_end_io_direct_write(ip, offset, size); > > > @@ -395,10 +396,11 @@ xfs_dio_write_end_io( > > > */ > > > XFS_STATS_ADD(ip->i_mount, xs_write_bytes, size); > > > > > > + nofs_flag = memalloc_nofs_save(); > > > > Hmm, do we need this here? I can't remember if there was a need for > > setting NOFS under xfs_reflink_end_cow from a dio completion or if that > > was purely the buffered writeback case... > > We certainly had to add it for the unwritten extent conversion, maybe > the corner case just didn't manage to show up for COW yet: AFAICT the referenced patch solved the problem of writeback ioend completion deadlocking with memory reclaim by changing the transaction allocation call in the xfs_iomap_write_unwritten function, which is called by writeback ioend completion. However, the directio endio code also calls xfs_iomap_write_unwritten. I can't tell if NOFS is actually needed in that context, or if we've just been operating like that for a decade because that's the method that was chosen to solve the deadlock. I think this boils down to -- does writeback induced by memory reclaim ever block on directio? I don't think it does, but as a counterpoint xfs has been like this for 10 years and there don't seem to be many complaints about directio endio pushing memory reclaim too hard... --D > > commit 80641dc66a2d6dfb22af4413227a92b8ab84c7bb > Author: Christoph Hellwig <hch@xxxxxxxxxxxxx> > Date: Mon Oct 19 04:00:03 2009 +0000 > > xfs: I/O completion handlers must use NOFS allocations > > When completing I/O requests we must not allow the memory allocator to > recurse into the filesystem, as we might deadlock on waiting for the > I/O completion otherwise. The only thing currently allocating normal > GFP_KERNEL memory is the allocation of the transaction structure for > the unwritten extent conversion. Add a memflags argument to > _xfs_trans_alloc to allow controlling the allocator behaviour. > > Signed-off-by: Christoph Hellwig <hch@xxxxxx> > Reported-by: Thomas Neumann <tneumann@xxxxxxxxxxxxxxxxxxxxx> > Tested-by: Thomas Neumann <tneumann@xxxxxxxxxxxxxxxxxxxxx> > Reviewed-by: Alex Elder <aelder@xxxxxxx> > Signed-off-by: Alex Elder <aelder@xxxxxxx> > > diff --git a/fs/xfs/xfs_fsops.c b/fs/xfs/xfs_fsops.c > index 2d0b3e1da9e6..6f83f58c099f 100644 > --- a/fs/xfs/xfs_fsops.c > +++ b/fs/xfs/xfs_fsops.c > @@ -611,7 +611,7 @@ xfs_fs_log_dummy( > xfs_inode_t *ip; > int error; > > - tp = _xfs_trans_alloc(mp, XFS_TRANS_DUMMY1); > + tp = _xfs_trans_alloc(mp, XFS_TRANS_DUMMY1, KM_SLEEP); > error = xfs_trans_reserve(tp, 0, XFS_ICHANGE_LOG_RES(mp), 0, 0, 0); > if (error) { > xfs_trans_cancel(tp, 0); > diff --git a/fs/xfs/xfs_iomap.c b/fs/xfs/xfs_iomap.c > index 67ae5555a30a..7294abce6ef2 100644 > --- a/fs/xfs/xfs_iomap.c > +++ b/fs/xfs/xfs_iomap.c > @@ -860,8 +860,15 @@ xfs_iomap_write_unwritten( > * set up a transaction to convert the range of extents > * from unwritten to real. Do allocations in a loop until > * we have covered the range passed in. > + * > + * Note that we open code the transaction allocation here > + * to pass KM_NOFS--we can't risk to recursing back into > + * the filesystem here as we might be asked to write out > + * the same inode that we complete here and might deadlock > + * on the iolock. > */ > - tp = xfs_trans_alloc(mp, XFS_TRANS_STRAT_WRITE); > + xfs_wait_for_freeze(mp, SB_FREEZE_TRANS); > + tp = _xfs_trans_alloc(mp, XFS_TRANS_STRAT_WRITE, KM_NOFS); > tp->t_flags |= XFS_TRANS_RESERVE; > error = xfs_trans_reserve(tp, resblks, > XFS_WRITE_LOG_RES(mp), 0, > diff --git a/fs/xfs/xfs_mount.c b/fs/xfs/xfs_mount.c > index 8b6c9e807efb..4d509f742bd2 100644 > --- a/fs/xfs/xfs_mount.c > +++ b/fs/xfs/xfs_mount.c > @@ -1471,7 +1471,7 @@ xfs_log_sbcount( > if (!xfs_sb_version_haslazysbcount(&mp->m_sb)) > return 0; > > - tp = _xfs_trans_alloc(mp, XFS_TRANS_SB_COUNT); > + tp = _xfs_trans_alloc(mp, XFS_TRANS_SB_COUNT, KM_SLEEP); > error = xfs_trans_reserve(tp, 0, mp->m_sb.sb_sectsize + 128, 0, 0, > XFS_DEFAULT_LOG_COUNT); > if (error) { > diff --git a/fs/xfs/xfs_trans.c b/fs/xfs/xfs_trans.c > index 66b849358e62..237badcbac3b 100644 > --- a/fs/xfs/xfs_trans.c > +++ b/fs/xfs/xfs_trans.c > @@ -236,19 +236,20 @@ xfs_trans_alloc( > uint type) > { > xfs_wait_for_freeze(mp, SB_FREEZE_TRANS); > - return _xfs_trans_alloc(mp, type); > + return _xfs_trans_alloc(mp, type, KM_SLEEP); > } > > xfs_trans_t * > _xfs_trans_alloc( > xfs_mount_t *mp, > - uint type) > + uint type, > + uint memflags) > { > xfs_trans_t *tp; > > atomic_inc(&mp->m_active_trans); > > - tp = kmem_zone_zalloc(xfs_trans_zone, KM_SLEEP); > + tp = kmem_zone_zalloc(xfs_trans_zone, memflags); > tp->t_magic = XFS_TRANS_MAGIC; > tp->t_type = type; > tp->t_mountp = mp; > diff --git a/fs/xfs/xfs_trans.h b/fs/xfs/xfs_trans.h > index ed47fc77759c..a0574f593f52 100644 > --- a/fs/xfs/xfs_trans.h > +++ b/fs/xfs/xfs_trans.h > @@ -924,7 +924,7 @@ typedef struct xfs_trans { > * XFS transaction mechanism exported interfaces. > */ > xfs_trans_t *xfs_trans_alloc(struct xfs_mount *, uint); > -xfs_trans_t *_xfs_trans_alloc(struct xfs_mount *, uint); > +xfs_trans_t *_xfs_trans_alloc(struct xfs_mount *, uint, uint); > xfs_trans_t *xfs_trans_dup(xfs_trans_t *); > int xfs_trans_reserve(xfs_trans_t *, uint, uint, uint, > uint, uint);