Re: [GIT PULL] xfs: xlog_write rework and CIL scalability

Dave Chinner <david@xxxxxxxxxxxxx> · Tue, 11 Jan 2022 16:04:37 +1100

On Thu, Jan 06, 2022 at 01:40:33PM -0800, Darrick J. Wong wrote:
> On Fri, Dec 10, 2021 at 11:09:56AM +1100, Dave Chinner wrote:
> > Hi Darrick,
> > 
> > Can you please pull the following changes from the tag listed below
> > for the XFS dev tree?
> 
> Hi Dave,
> 
> I tried, but the regressions with generic/017 persist.  It trips the
> ticket reservation pretty consistently within 45-60 seconds of starting,
> at least on the OCI VM that I created.  /dev/sd[ab] are (software
> defined) disks that can sustain reads of ~50MB/s and ~5000iops; and
> writes of about half those numbers.
> 
>  run fstests generic/017 at 2022-01-06 13:18:59
>  XFS (sda4): Mounting V5 Filesystem
>  XFS (sda4): Ending clean mount
>  XFS (sda4): Quotacheck needed: Please wait.
>  XFS (sda4): Quotacheck: Done.
>  XFS (sda4): ctx ticket reservation ran out. Need to up reservation
>  XFS (sda4): ticket reservation summary:
>  XFS (sda4):   unit res    = 548636 bytes
>  XFS (sda4):   current res = -76116 bytes
>  XFS (sda4):   original count  = 1
>  XFS (sda4):   remaining count = 1
>  XFS (sda4): Log I/O Error (0x2) detected at xlog_write+0x5ee/0x660 [xfs] (fs/xfs/xfs_log.c:2499).  Shutting down filesystem.
>  XFS (sda4): Please unmount the filesystem and rectify the problem(s)
>  XFS (sda3): Unmounting Filesystem
>  XFS (sda4): Unmounting Filesystem

Ok, I *think* I've worked out what was going on here. The patch
below has run several hundred iterations of g/017 with an external
log on two different fs/log size configurations that typically
reproduced in within 10 cycles.

Essentially, the problem is largely caused by using
XLOG_CIL_BLOCKING_SPACE_LIMIT() instead of XLOG_CIL_SPACE_LIMIT()
when determining how much used space we can allow the percpu
counters to accumulate before aggregating them back into the global
counter. Using the hard limit meant that we could accumulate almost
the entire hard limit before we aggregate even a single percpu value
back into the global limit, resulting in failing to trigger either
condition for aggregation until we'd effectively blown through the
hard limit.

This then meant the extra reservations that need to be taken for
space used beyond the hard limit didn't get stolen for the ctx
ticket, and it then overruns.

It also means that we could overrun the hard limit substantially
before throttling kicked in. With the percpu aggregation threshold
brought back down to the (soft limit / num online cpus) we are
guaranteed to always start aggregation back into the global counter
before or at the point in time the soft limit should be hit, meaning
that we start updating the global counter much sooner and so are it
tracks actual space used once over the soft limit much more closely.

Darrick, can you rerun the branch with the patch below also included, and
see if it reproduces on your setup? If it does, can you grab a trace
of the trace_printk() calls I left in the patch?

Note that this change does not make the algorithm fully correct - we
can still have accumulation on other CPUs that isn't folded back
into the global value. What I want is feedback on whether it makes
the problem largely go away on configs other than my own before
spending more time coming up with a better lockless aggregation
algorithm...

Cheers,

Dave.
-- 
Dave Chinner
david@xxxxxxxxxxxxx

---
 fs/xfs/xfs_log_cil.c | 18 ++++++++++++++----
 1 file changed, 14 insertions(+), 4 deletions(-)

diff --git a/fs/xfs/xfs_log_cil.c b/fs/xfs/xfs_log_cil.c
index 366c0aaad640..47d46d6e15b3 100644
--- a/fs/xfs/xfs_log_cil.c
+++ b/fs/xfs/xfs_log_cil.c
@@ -96,6 +96,9 @@ xlog_cil_pcp_aggregate(
 		ctx->ticket->t_curr_res += cilpcp->space_reserved;
 		ctx->ticket->t_unit_res += cilpcp->space_reserved;
 		cilpcp->space_reserved = 0;
+	trace_printk("cilpcp space used %d, reserved %d unit-res %d cur-res %d",
+			cilpcp->space_used, cilpcp->space_reserved,
+			ctx->ticket->t_unit_res, ctx->ticket->t_curr_res);
 
 		if (!list_empty(&cilpcp->busy_extents)) {
 			list_splice_init(&cilpcp->busy_extents,
@@ -515,11 +518,16 @@ xlog_cil_insert_items(
 	 *
 	 * This can steal more than we need, but that's OK.
 	 */
-	space_used = atomic_read(&ctx->space_used);
+	space_used = atomic_read(&ctx->space_used) + len;
 	if (atomic_read(&cil->xc_iclog_hdrs) > 0 ||
-	    space_used + len >= XLOG_CIL_BLOCKING_SPACE_LIMIT(log)) {
+	    space_used >= XLOG_CIL_BLOCKING_SPACE_LIMIT(log)) {
 		int	split_res = log->l_iclog_hsize +
 					sizeof(struct xlog_op_header);
+
+	trace_printk("space used %d, len %d iclog hdrs %d, slim %d, hlim %d",
+			space_used, len, atomic_read(&cil->xc_iclog_hdrs),
+			XLOG_CIL_SPACE_LIMIT(log),
+			XLOG_CIL_BLOCKING_SPACE_LIMIT(log));
 		if (ctx_res)
 			ctx_res += split_res * (tp->t_ticket->t_iclog_hdrs - 1);
 		else
@@ -540,8 +548,9 @@ xlog_cil_insert_items(
 	cilpcp->space_used += len;
 	if (space_used >= XLOG_CIL_SPACE_LIMIT(log) ||
 	    cilpcp->space_used >
-			((XLOG_CIL_BLOCKING_SPACE_LIMIT(log) - space_used) /
-					num_online_cpus())) {
+			(XLOG_CIL_SPACE_LIMIT(log) / num_online_cpus())) {
+	trace_printk("cilpcp space used %d, reserved %d ctxres %d",
+			cilpcp->space_used, cilpcp->space_reserved, ctx_res);
 		atomic_add(cilpcp->space_used, &ctx->space_used);
 		cilpcp->space_used = 0;
 	}
@@ -1331,6 +1340,7 @@ xlog_cil_push_background(
 
 	spin_lock(&cil->xc_push_lock);
 	if (cil->xc_push_seq < cil->xc_current_sequence) {
+		trace_printk("push sapce used %d", space_used);
 		cil->xc_push_seq = cil->xc_current_sequence;
 		queue_work(cil->xc_push_wq, &cil->xc_ctx->push_work);
 	}