Re: [GIT PULL] xfs: xlog_write rework and CIL scalability

"Darrick J. Wong" <djwong@xxxxxxxxxx> · Tue, 11 Jan 2022 09:58:29 -0800

On Tue, Jan 11, 2022 at 04:04:37PM +1100, Dave Chinner wrote:
> On Thu, Jan 06, 2022 at 01:40:33PM -0800, Darrick J. Wong wrote:
> > On Fri, Dec 10, 2021 at 11:09:56AM +1100, Dave Chinner wrote:
> > > Hi Darrick,
> > > 
> > > Can you please pull the following changes from the tag listed below
> > > for the XFS dev tree?
> > 
> > Hi Dave,
> > 
> > I tried, but the regressions with generic/017 persist.  It trips the
> > ticket reservation pretty consistently within 45-60 seconds of starting,
> > at least on the OCI VM that I created.  /dev/sd[ab] are (software
> > defined) disks that can sustain reads of ~50MB/s and ~5000iops; and
> > writes of about half those numbers.
> > 
> >  run fstests generic/017 at 2022-01-06 13:18:59
> >  XFS (sda4): Mounting V5 Filesystem
> >  XFS (sda4): Ending clean mount
> >  XFS (sda4): Quotacheck needed: Please wait.
> >  XFS (sda4): Quotacheck: Done.
> >  XFS (sda4): ctx ticket reservation ran out. Need to up reservation
> >  XFS (sda4): ticket reservation summary:
> >  XFS (sda4):   unit res    = 548636 bytes
> >  XFS (sda4):   current res = -76116 bytes
> >  XFS (sda4):   original count  = 1
> >  XFS (sda4):   remaining count = 1
> >  XFS (sda4): Log I/O Error (0x2) detected at xlog_write+0x5ee/0x660 [xfs] (fs/xfs/xfs_log.c:2499).  Shutting down filesystem.
> >  XFS (sda4): Please unmount the filesystem and rectify the problem(s)
> >  XFS (sda3): Unmounting Filesystem
> >  XFS (sda4): Unmounting Filesystem
> 
> Ok, I *think* I've worked out what was going on here. The patch
> below has run several hundred iterations of g/017 with an external
> log on two different fs/log size configurations that typically
> reproduced in within 10 cycles.
> 
> Essentially, the problem is largely caused by using
> XLOG_CIL_BLOCKING_SPACE_LIMIT() instead of XLOG_CIL_SPACE_LIMIT()
> when determining how much used space we can allow the percpu
> counters to accumulate before aggregating them back into the global
> counter. Using the hard limit meant that we could accumulate almost
> the entire hard limit before we aggregate even a single percpu value
> back into the global limit, resulting in failing to trigger either
> condition for aggregation until we'd effectively blown through the
> hard limit.
> 
> This then meant the extra reservations that need to be taken for
> space used beyond the hard limit didn't get stolen for the ctx
> ticket, and it then overruns.
> 
> It also means that we could overrun the hard limit substantially
> before throttling kicked in. With the percpu aggregation threshold
> brought back down to the (soft limit / num online cpus) we are
> guaranteed to always start aggregation back into the global counter
> before or at the point in time the soft limit should be hit, meaning
> that we start updating the global counter much sooner and so are it
> tracks actual space used once over the soft limit much more closely.
> 
> Darrick, can you rerun the branch with the patch below also included, and
> see if it reproduces on your setup? If it does, can you grab a trace
> of the trace_printk() calls I left in the patch?

Ok, I'll do that and report back.

> Note that this change does not make the algorithm fully correct - we
> can still have accumulation on other CPUs that isn't folded back
> into the global value. What I want is feedback on whether it makes
> the problem largely go away on configs other than my own before
> spending more time coming up with a better lockless aggregation
> algorithm...

--D

> Cheers,
> 
> Dave.
> -- 
> Dave Chinner
> david@xxxxxxxxxxxxx
> 
> ---
>  fs/xfs/xfs_log_cil.c | 18 ++++++++++++++----
>  1 file changed, 14 insertions(+), 4 deletions(-)
> 
> diff --git a/fs/xfs/xfs_log_cil.c b/fs/xfs/xfs_log_cil.c
> index 366c0aaad640..47d46d6e15b3 100644
> --- a/fs/xfs/xfs_log_cil.c
> +++ b/fs/xfs/xfs_log_cil.c
> @@ -96,6 +96,9 @@ xlog_cil_pcp_aggregate(
>  		ctx->ticket->t_curr_res += cilpcp->space_reserved;
>  		ctx->ticket->t_unit_res += cilpcp->space_reserved;
>  		cilpcp->space_reserved = 0;
> +	trace_printk("cilpcp space used %d, reserved %d unit-res %d cur-res %d",
> +			cilpcp->space_used, cilpcp->space_reserved,
> +			ctx->ticket->t_unit_res, ctx->ticket->t_curr_res);
>  
>  		if (!list_empty(&cilpcp->busy_extents)) {
>  			list_splice_init(&cilpcp->busy_extents,
> @@ -515,11 +518,16 @@ xlog_cil_insert_items(
>  	 *
>  	 * This can steal more than we need, but that's OK.
>  	 */
> -	space_used = atomic_read(&ctx->space_used);
> +	space_used = atomic_read(&ctx->space_used) + len;
>  	if (atomic_read(&cil->xc_iclog_hdrs) > 0 ||
> -	    space_used + len >= XLOG_CIL_BLOCKING_SPACE_LIMIT(log)) {
> +	    space_used >= XLOG_CIL_BLOCKING_SPACE_LIMIT(log)) {
>  		int	split_res = log->l_iclog_hsize +
>  					sizeof(struct xlog_op_header);
> +
> +	trace_printk("space used %d, len %d iclog hdrs %d, slim %d, hlim %d",
> +			space_used, len, atomic_read(&cil->xc_iclog_hdrs),
> +			XLOG_CIL_SPACE_LIMIT(log),
> +			XLOG_CIL_BLOCKING_SPACE_LIMIT(log));
>  		if (ctx_res)
>  			ctx_res += split_res * (tp->t_ticket->t_iclog_hdrs - 1);
>  		else
> @@ -540,8 +548,9 @@ xlog_cil_insert_items(
>  	cilpcp->space_used += len;
>  	if (space_used >= XLOG_CIL_SPACE_LIMIT(log) ||
>  	    cilpcp->space_used >
> -			((XLOG_CIL_BLOCKING_SPACE_LIMIT(log) - space_used) /
> -					num_online_cpus())) {
> +			(XLOG_CIL_SPACE_LIMIT(log) / num_online_cpus())) {
> +	trace_printk("cilpcp space used %d, reserved %d ctxres %d",
> +			cilpcp->space_used, cilpcp->space_reserved, ctx_res);
>  		atomic_add(cilpcp->space_used, &ctx->space_used);
>  		cilpcp->space_used = 0;
>  	}
> @@ -1331,6 +1340,7 @@ xlog_cil_push_background(
>  
>  	spin_lock(&cil->xc_push_lock);
>  	if (cil->xc_push_seq < cil->xc_current_sequence) {
> +		trace_printk("push sapce used %d", space_used);
>  		cil->xc_push_seq = cil->xc_current_sequence;
>  		queue_work(cil->xc_push_wq, &cil->xc_ctx->push_work);
>  	}