On Fri, Apr 09, 2021 at 09:47:24AM -0400, Brian Foster wrote: > On Wed, Apr 07, 2021 at 08:31:52AM -0700, Darrick J. Wong wrote: > > On Mon, Apr 05, 2021 at 10:59:00AM -0400, Brian Foster wrote: > > > Per-inode ioend completion batching has a log reservation deadlock > > > vector between preallocated append transactions and transactions > > > that are acquired at completion time for other purposes (i.e., > > > unwritten extent conversion or COW fork remaps). For example, if the > > > ioend completion workqueue task executes on a batch of ioends that > > > are sorted such that an append ioend sits at the tail, it's possible > > > for the outstanding append transaction reservation to block > > > allocation of transactions required to process preceding ioends in > > > the list. > > > > > > Append ioend completion is historically the common path for on-disk > > > inode size updates. While file extending writes may have completed > > > sometime earlier, the on-disk inode size is only updated after > > > successful writeback completion. These transactions are preallocated > > > serially from writeback context to mitigate concurrency and > > > associated log reservation pressure across completions processed by > > > multi-threaded workqueue tasks. > > > > > > However, now that delalloc blocks unconditionally map to unwritten > > > extents at physical block allocation time, size updates via append > > > ioends are relatively rare. This means that inode size updates most > > > commonly occur as part of the preexisting completion time > > > transaction to convert unwritten extents. As a result, there is no > > > longer a strong need to preallocate size update transactions. > > > > > > Remove the preallocation of inode size update transactions to avoid > > > the ioend completion processing log reservation deadlock. Instead, > > > continue to send all potential size extending ioends to workqueue > > > context for completion and allocate the transaction from that > > > context. This ensures that no outstanding log reservation is owned > > > by the ioend completion worker task when it begins to process > > > ioends. > > > > > > Signed-off-by: Brian Foster <bfoster@xxxxxxxxxx> > > > --- > > > fs/xfs/xfs_aops.c | 45 +++------------------------------------------ > > > 1 file changed, 3 insertions(+), 42 deletions(-) > > > > > > diff --git a/fs/xfs/xfs_aops.c b/fs/xfs/xfs_aops.c > > > index 1cc7c36d98e9..c1951975bd6a 100644 > > > --- a/fs/xfs/xfs_aops.c > > > +++ b/fs/xfs/xfs_aops.c > > > @@ -39,33 +39,6 @@ static inline bool xfs_ioend_is_append(struct iomap_ioend *ioend) > > > XFS_I(ioend->io_inode)->i_d.di_size; > > > } > > > > > > -STATIC int > > > -xfs_setfilesize_trans_alloc( > > > - struct iomap_ioend *ioend) > > > -{ > > > - struct xfs_mount *mp = XFS_I(ioend->io_inode)->i_mount; > > > - struct xfs_trans *tp; > > > - int error; > > > - > > > - error = xfs_trans_alloc(mp, &M_RES(mp)->tr_fsyncts, 0, 0, 0, &tp); > > > - if (error) > > > - return error; > > > - > > > - ioend->io_private = tp; > > > - > > > - /* > > > - * We may pass freeze protection with a transaction. So tell lockdep > > > - * we released it. > > > - */ > > > - __sb_writers_release(ioend->io_inode->i_sb, SB_FREEZE_FS); > > > - /* > > > - * We hand off the transaction to the completion thread now, so > > > - * clear the flag here. > > > - */ > > > - xfs_trans_clear_context(tp); > > > - return 0; > > > -} > > > - > > > /* > > > * Update on-disk file size now that data has been written to disk. > > > */ > > > @@ -182,12 +155,10 @@ xfs_end_ioend( > > > error = xfs_reflink_end_cow(ip, offset, size); > > > > Seems reasonable to me. xfs_reflink_end_cow_extent should probably > > learn how to extend the ondisk EOF as patch 6/4. > > > > After looking into this since multiple people asked for it, I'm not as > convinced it's the right approach as I was on first thought. The reason > being that the reflink completion code actually remaps in reverse order > (I suspect due to depending on the same behavior of the bmap unmap code, > which is used to pull extents from the COW fork..). Yes. There's no algorithmic reason why _end_cow has to call bunmapi directly (and thereby remap in reverse order) -- it could log a bunmap intent item for the data fork before logging the refcount btree intent and the bmap remap intent. With such a rework in place, we would effectively be doing the remap in order of increasing file offset, instead of forwards-backwards like we do now. If you want to chase that, you might also want to look at this patch[1] that eliminates the need to requeue bunmap operations. [1] https://git.kernel.org/pub/scm/linux/kernel/git/djwong/xfs-linux.git/commit/?h=reflink-speedups&id=323b0acb563ea50bfff79801083f5ba8544c7985 > That means that while inter-ioend completion takes some care (even if > not necessarily guaranteed) to order on-disk size updates against > completion within the associated i_size, a size update along with the > COW fork remap transaction could result in the opposite in some cases > (i.e. if the cow fork is fragmented relative to the data fork). I > suspect this is a rare situation and not a critical problem, but at the > same time I find it a bit odd to implement intra-ioend ordering > inconsistent with the broader inter-ioend logic. > > I could probably be convinced otherwise, but I think this warrants more > thought and is not necessarily a straightforward cleanup. I'm also not > sure how common disk size updates via COW fork I/O completion really are > in the first place, so I'm not sure it's worth the oddness to save an > extra transaction. In any event, I'll probably leave this off from this > series for now because it doesn't have much to do with fixing the > deadlock problem and can revisit it if anybody has thoughts on any of > the above.. The only case I can think of is an extending write to a reflinked file if (say) you snapshot /var on a live system. --D > > Brian > > > Reviewed-by: Darrick J. Wong <djwong@xxxxxxxxxx> > > > > --D > > > > > else if (ioend->io_type == IOMAP_UNWRITTEN) > > > error = xfs_iomap_write_unwritten(ip, offset, size, false); > > > - else > > > - ASSERT(!xfs_ioend_is_append(ioend) || ioend->io_private); > > > > > > done: > > > - if (ioend->io_private) > > > - error = xfs_setfilesize_ioend(ioend, error); > > > + if (!error && xfs_ioend_is_append(ioend)) > > > + error = xfs_setfilesize(ip, ioend->io_offset, ioend->io_size); > > > iomap_finish_ioends(ioend, error); > > > memalloc_nofs_restore(nofs_flag); > > > } > > > @@ -237,7 +208,7 @@ xfs_end_io( > > > > > > static inline bool xfs_ioend_needs_workqueue(struct iomap_ioend *ioend) > > > { > > > - return ioend->io_private || > > > + return xfs_ioend_is_append(ioend) || > > > ioend->io_type == IOMAP_UNWRITTEN || > > > (ioend->io_flags & IOMAP_F_SHARED); > > > } > > > @@ -250,8 +221,6 @@ xfs_end_bio( > > > struct xfs_inode *ip = XFS_I(ioend->io_inode); > > > unsigned long flags; > > > > > > - ASSERT(xfs_ioend_needs_workqueue(ioend)); > > > - > > > spin_lock_irqsave(&ip->i_ioend_lock, flags); > > > if (list_empty(&ip->i_ioend_list)) > > > WARN_ON_ONCE(!queue_work(ip->i_mount->m_unwritten_workqueue, > > > @@ -501,14 +470,6 @@ xfs_prepare_ioend( > > > ioend->io_offset, ioend->io_size); > > > } > > > > > > - /* Reserve log space if we might write beyond the on-disk inode size. */ > > > - if (!status && > > > - ((ioend->io_flags & IOMAP_F_SHARED) || > > > - ioend->io_type != IOMAP_UNWRITTEN) && > > > - xfs_ioend_is_append(ioend) && > > > - !ioend->io_private) > > > - status = xfs_setfilesize_trans_alloc(ioend); > > > - > > > memalloc_nofs_restore(nofs_flag); > > > > > > if (xfs_ioend_needs_workqueue(ioend)) > > > -- > > > 2.26.3 > > > > > >