Re: [PATCH] xfs, iomap: limit individual ioend chain lengths in writeback

"Darrick J. Wong" <djwong@xxxxxxxxxx> · Wed, 26 Jan 2022 09:20:42 -0800

On Thu, Jan 20, 2022 at 02:47:33PM +1100, Dave Chinner wrote:
> From: Dave Chinner <dchinner@xxxxxxxxxx>
> 
> Trond Myklebust reported soft lockups in XFS IO completion such as
> this:
> 
>  watchdog: BUG: soft lockup - CPU#12 stuck for 23s! [kworker/12:1:3106]
>  CPU: 12 PID: 3106 Comm: kworker/12:1 Not tainted 4.18.0-305.10.2.el8_4.x86_64 #1
>  Workqueue: xfs-conv/md127 xfs_end_io [xfs]
>  RIP: 0010:_raw_spin_unlock_irqrestore+0x11/0x20
>  Call Trace:
>   wake_up_page_bit+0x8a/0x110
>   iomap_finish_ioend+0xd7/0x1c0
>   iomap_finish_ioends+0x7f/0xb0
>   xfs_end_ioend+0x6b/0x100 [xfs]
>   xfs_end_io+0xb9/0xe0 [xfs]
>   process_one_work+0x1a7/0x360
>   worker_thread+0x1fa/0x390
>   kthread+0x116/0x130
>   ret_from_fork+0x35/0x40
> 
> Ioends are processed as an atomic completion unit when all the
> chained bios in the ioend have completed their IO. Logically
> contiguous ioends can also be merged and completed as a single,
> larger unit.  Both of these things can be problematic as both the
> bio chains per ioend and the size of the merged ioends processed as
> a single completion are both unbound.
> 
> If we have a large sequential dirty region in the page cache,
> write_cache_pages() will keep feeding us sequential pages and we
> will keep mapping them into ioends and bios until we get a dirty
> page at a non-sequential file offset. These large sequential runs
> can will result in bio and ioend chaining to optimise the io
> patterns. The pages iunder writeback are pinned within these chains
> until the submission chaining is broken, allowing the entire chain
> to be completed. This can result in huge chains being processed
> in IO completion context.
> 
> We get deep bio chaining if we have large contiguous physical
> extents. We will keep adding pages to the current bio until it is
> full, then we'll chain a new bio to keep adding pages for writeback.
> Hence we can build bio chains that map millions of pages and tens of
> gigabytes of RAM if the page cache contains big enough contiguous
> dirty file regions. This long bio chain pins those pages until the
> final bio in the chain completes and the ioend can iterate all the
> chained bios and complete them.
> 
> OTOH, if we have a physically fragmented file, we end up submitting
> one ioend per physical fragment that each have a small bio or bio
> chain attached to them. We do not chain these at IO submission time,
> but instead we chain them at completion time based on file
> offset via iomap_ioend_try_merge(). Hence we can end up with unbound
> ioend chains being built via completion merging.
> 
> XFS can then do COW remapping or unwritten extent conversion on that
> merged chain, which involves walking an extent fragment at a time
> and running a transaction to modify the physical extent information.
> IOWs, we merge all the discontiguous ioends together into a
> contiguous file range, only to then process them individually as
> discontiguous extents.
> 
> This extent manipulation is computationally expensive and can run in
> a tight loop, so merging logically contiguous but physically
> discontigous ioends gains us nothing except for hiding the fact the
> fact we broke the ioends up into individual physical extents at
> submission and then need to loop over those individual physical
> extents at completion.
> 
> Hence we need to have mechanisms to limit ioend sizes and
> to break up completion processing of large merged ioend chains:
> 
> 1. bio chains per ioend need to be bound in length. Pure overwrites
> go straight to iomap_finish_ioend() in softirq context with the
> exact bio chain attached to the ioend by submission. Hence the only
> way to prevent long holdoffs here is to bound ioend submission
> sizes because we can't reschedule in softirq context.
> 
> 2. iomap_finish_ioends() has to handle unbound merged ioend chains
> correctly. This relies on any one call to iomap_finish_ioend() being
> bound in runtime so that cond_resched() can be issued regularly as
> the long ioend chain is processed. i.e. this relies on mechanism #1
> to limit individual ioend sizes to work correctly.
> 
> 3. filesystems have to loop over the merged ioends to process
> physical extent manipulations. This means they can loop internally,
> and so we break merging at physical extent boundaries so the
> filesystem can easily insert reschedule points between individual
> extent manipulations.
> 
> Signed-off-by: Dave Chinner <dchinner@xxxxxxxxxx>
> Reported-and-tested-by: Trond Myklebust <trondmy@xxxxxxxxxxxxxxx>

Looks good, sorry I didn't get to this earlier...
Reviewed-by: Darrick J. Wong <djwong@xxxxxxxxxx>

--D

> ---
>  fs/iomap/buffered-io.c | 52 ++++++++++++++++++++++++++++++++++++++----
>  fs/xfs/xfs_aops.c      | 16 ++++++++++++-
>  include/linux/iomap.h  |  2 ++
>  3 files changed, 65 insertions(+), 5 deletions(-)
> 
> diff --git a/fs/iomap/buffered-io.c b/fs/iomap/buffered-io.c
> index c938bbad075e..6c51a75d0be6 100644
> --- a/fs/iomap/buffered-io.c
> +++ b/fs/iomap/buffered-io.c
> @@ -21,6 +21,8 @@
>  
>  #include "../internal.h"
>  
> +#define IOEND_BATCH_SIZE	4096
> +
>  /*
>   * Structure allocated for each folio when block size < folio size
>   * to track sub-folio uptodate status and I/O completions.
> @@ -1039,7 +1041,7 @@ static void iomap_finish_folio_write(struct inode *inode, struct folio *folio,
>   * state, release holds on bios, and finally free up memory.  Do not use the
>   * ioend after this.
>   */
> -static void
> +static u32
>  iomap_finish_ioend(struct iomap_ioend *ioend, int error)
>  {
>  	struct inode *inode = ioend->io_inode;
> @@ -1048,6 +1050,7 @@ iomap_finish_ioend(struct iomap_ioend *ioend, int error)
>  	u64 start = bio->bi_iter.bi_sector;
>  	loff_t offset = ioend->io_offset;
>  	bool quiet = bio_flagged(bio, BIO_QUIET);
> +	u32 folio_count = 0;
>  
>  	for (bio = &ioend->io_inline_bio; bio; bio = next) {
>  		struct folio_iter fi;
> @@ -1062,9 +1065,11 @@ iomap_finish_ioend(struct iomap_ioend *ioend, int error)
>  			next = bio->bi_private;
>  
>  		/* walk all folios in bio, ending page IO on them */
> -		bio_for_each_folio_all(fi, bio)
> +		bio_for_each_folio_all(fi, bio) {
>  			iomap_finish_folio_write(inode, fi.folio, fi.length,
>  					error);
> +			folio_count++;
> +		}
>  		bio_put(bio);
>  	}
>  	/* The ioend has been freed by bio_put() */
> @@ -1074,20 +1079,36 @@ iomap_finish_ioend(struct iomap_ioend *ioend, int error)
>  "%s: writeback error on inode %lu, offset %lld, sector %llu",
>  			inode->i_sb->s_id, inode->i_ino, offset, start);
>  	}
> +	return folio_count;
>  }
>  
> +/*
> + * Ioend completion routine for merged bios. This can only be called from task
> + * contexts as merged ioends can be of unbound length. Hence we have to break up
> + * the writeback completions into manageable chunks to avoid long scheduler
> + * holdoffs. We aim to keep scheduler holdoffs down below 10ms so that we get
> + * good batch processing throughput without creating adverse scheduler latency
> + * conditions.
> + */
>  void
>  iomap_finish_ioends(struct iomap_ioend *ioend, int error)
>  {
>  	struct list_head tmp;
> +	u32 completions;
> +
> +	might_sleep();
>  
>  	list_replace_init(&ioend->io_list, &tmp);
> -	iomap_finish_ioend(ioend, error);
> +	completions = iomap_finish_ioend(ioend, error);
>  
>  	while (!list_empty(&tmp)) {
> +		if (completions > IOEND_BATCH_SIZE * 8) {
> +			cond_resched();
> +			completions = 0;
> +		}
>  		ioend = list_first_entry(&tmp, struct iomap_ioend, io_list);
>  		list_del_init(&ioend->io_list);
> -		iomap_finish_ioend(ioend, error);
> +		completions += iomap_finish_ioend(ioend, error);
>  	}
>  }
>  EXPORT_SYMBOL_GPL(iomap_finish_ioends);
> @@ -1108,6 +1129,18 @@ iomap_ioend_can_merge(struct iomap_ioend *ioend, struct iomap_ioend *next)
>  		return false;
>  	if (ioend->io_offset + ioend->io_size != next->io_offset)
>  		return false;
> +	/*
> +	 * Do not merge physically discontiguous ioends. The filesystem
> +	 * completion functions will have to iterate the physical
> +	 * discontiguities even if we merge the ioends at a logical level, so
> +	 * we don't gain anything by merging physical discontiguities here.
> +	 *
> +	 * We cannot use bio->bi_iter.bi_sector here as it is modified during
> +	 * submission so does not point to the start sector of the bio at
> +	 * completion.
> +	 */
> +	if (ioend->io_sector + (ioend->io_size >> 9) != next->io_sector)
> +		return false;
>  	return true;
>  }
>  
> @@ -1209,8 +1242,10 @@ iomap_alloc_ioend(struct inode *inode, struct iomap_writepage_ctx *wpc,
>  	ioend->io_flags = wpc->iomap.flags;
>  	ioend->io_inode = inode;
>  	ioend->io_size = 0;
> +	ioend->io_folios = 0;
>  	ioend->io_offset = offset;
>  	ioend->io_bio = bio;
> +	ioend->io_sector = sector;
>  	return ioend;
>  }
>  
> @@ -1251,6 +1286,13 @@ iomap_can_add_to_ioend(struct iomap_writepage_ctx *wpc, loff_t offset,
>  		return false;
>  	if (sector != bio_end_sector(wpc->ioend->io_bio))
>  		return false;
> +	/*
> +	 * Limit ioend bio chain lengths to minimise IO completion latency. This
> +	 * also prevents long tight loops ending page writeback on all the
> +	 * folios in the ioend.
> +	 */
> +	if (wpc->ioend->io_folios >= IOEND_BATCH_SIZE)
> +		return false;
>  	return true;
>  }
>  
> @@ -1335,6 +1377,8 @@ iomap_writepage_map(struct iomap_writepage_ctx *wpc,
>  				 &submit_list);
>  		count++;
>  	}
> +	if (count)
> +		wpc->ioend->io_folios++;
>  
>  	WARN_ON_ONCE(!wpc->ioend && !list_empty(&submit_list));
>  	WARN_ON_ONCE(!folio_test_locked(folio));
> diff --git a/fs/xfs/xfs_aops.c b/fs/xfs/xfs_aops.c
> index 2705f91bdd0d..9d6a67c7d227 100644
> --- a/fs/xfs/xfs_aops.c
> +++ b/fs/xfs/xfs_aops.c
> @@ -136,7 +136,20 @@ xfs_end_ioend(
>  	memalloc_nofs_restore(nofs_flag);
>  }
>  
> -/* Finish all pending io completions. */
> +/*
> + * Finish all pending IO completions that require transactional modifications.
> + *
> + * We try to merge physical and logically contiguous ioends before completion to
> + * minimise the number of transactions we need to perform during IO completion.
> + * Both unwritten extent conversion and COW remapping need to iterate and modify
> + * one physical extent at a time, so we gain nothing by merging physically
> + * discontiguous extents here.
> + *
> + * The ioend chain length that we can be processing here is largely unbound in
> + * length and we may have to perform significant amounts of work on each ioend
> + * to complete it. Hence we have to be careful about holding the CPU for too
> + * long in this loop.
> + */
>  void
>  xfs_end_io(
>  	struct work_struct	*work)
> @@ -157,6 +170,7 @@ xfs_end_io(
>  		list_del_init(&ioend->io_list);
>  		iomap_ioend_try_merge(ioend, &tmp);
>  		xfs_end_ioend(ioend);
> +		cond_resched();
>  	}
>  }
>  
> diff --git a/include/linux/iomap.h b/include/linux/iomap.h
> index b55bd49e55f5..97a3a2edb585 100644
> --- a/include/linux/iomap.h
> +++ b/include/linux/iomap.h
> @@ -263,9 +263,11 @@ struct iomap_ioend {
>  	struct list_head	io_list;	/* next ioend in chain */
>  	u16			io_type;
>  	u16			io_flags;	/* IOMAP_F_* */
> +	u32			io_folios;	/* folios added to ioend */
>  	struct inode		*io_inode;	/* file being written to */
>  	size_t			io_size;	/* size of the extent */
>  	loff_t			io_offset;	/* offset in the file */
> +	sector_t		io_sector;	/* start sector of ioend */
>  	struct bio		*io_bio;	/* bio being built */
>  	struct bio		io_inline_bio;	/* MUST BE LAST! */
>  };
> -- 
> 2.33.0
>