Re: [PATCH 2/2] xfs: Throttle commits on delayed background CIL push

Dave Chinner <david@xxxxxxxxxxxxx> · Tue, 8 Oct 2019 14:34:35 +1100

On Thu, Oct 03, 2019 at 10:39:34AM -0400, Brian Foster wrote:
> On Thu, Oct 03, 2019 at 10:53:10AM +1000, Dave Chinner wrote:
> > On Wed, Oct 02, 2019 at 08:40:56AM -0400, Brian Foster wrote:
> > > Ok, that's not quite what I was concerned about when you mentioned
> > > waiting on the CIL to flush to disk. No matter, the important bit here
> > > is the performance cost of including the extra blocking on log I/O (to
> > > cycle iclogs back to active for reuse) in the throttle.
> > > 
> > > I'm curious about how noticeable this extra blocking would be because
> > > it's one likely cause of the CIL pressure buildup in the first place. My
> > > original tests reproduced huge CIL checkpoints purely based on one CIL
> > > push being blocked behind the processing of another, the latter taking
> > > relatively more time due to log I/O.
> > 
> > It's very noticable. I dropped that as a throttle mechanism because
> > even on a SSD at 200us per write, a 32MB CIL flush takes for
> > 20-30ms. And it stops everything dead for that time because it
> > stalls all async transaction commits while the log flushes.
> > 
> > When several CIL flushes a second occur (pushing 100-150MB/s to the
> > log), I start seeing a substantial amount of additional time
> > (15-20% of total CPU time) being spent idle just waiting for CIL
> > flush completion. And it only gets worse as the storage gets
> > slower...
> > 
> > The throttle that these patches implement are typically only
> > stalling incoming transactions for a couple of hundred microseconds.
> > The timestamps from the groups of log events show the blocking of
> > all threads are very close together in time, and the push work to
> > switch over to the new context to allow them to unblock and continue
> > happens within another 200-300 microseconds. And the worst case I've
> > seen, this is happening once or twice a second. IOWs, the blocking
> > time of the throttle is very short and largely unnoticable, and the
> > difference to performance it causes is far, far less than the noise
> > threshold of the benchmarks.
> > 
> 
> Ok, thanks for the info. Is that 200-300us delay reliable on a sustained
> workload (where the throttle triggers multiple times)? If so, I suppose
> that longer delay I was wondering about is not as likely in practice.

THe more threads there are, the longer the capture and delay time,
but I'm not seeing delays that are long enough to measurably reduce
the CPU consumption of the workloads nor the throughput of the
benchmark outside of the normal noise floor.

> > > This is not to say there aren't other causes of excessively sized
> > > checkpoints. Rather, if we're at a point where we've blocked
> > > transactions on this new threshold, that means we've doubled the
> > > background threshold in the time we've first triggered a background CIL
> > > push and the push actually started. From that, it seems fairly likely
> > > that we could replenish the CIL to the background threshold once
> > > threads are unblocked but before the previous push completes.
> > 
> > That's just fine. The CIL is actually designed to allow that sort of
> > overlap between multiple committing background contexts and the
> > current aggregating context. As long as each individual CIL commit
> > size doesn't go over half the log and there's always a complete
> > commit in the log (which there should be via head/tail limits),
> > the only limit on the number of concurrent committing CIL contexts
> > on the fly at once is the amount of log space we have available....
> > 
> 
> Right, but if we're blocking transactions on the start of a CIL push and
> the start of a CIL push blocks on completion of the previous push, it's
> possible to indirectly block transactions on completion of a CIL push.
> There is some staggering there compared to blocking a transaction
> directly on a CIL push start to finish, however, so perhaps that keeps
> the delay down in practice.

*nod*

The idea is that this all works as a multi-stage pipeline so there
is always overlap and work being done, even if there are short
stalls in individual work units in the pipeline:

reservation
  modification
    commit
      cil insert
	Cil push
	  iclog write
	    iclog completion
	      AIL insert
	        AIL push
		  metadata IO completion
		    AIL removal
		      log tail updates

Given that CPUs are generally much faster than the IO part of the
pipeline, short CPU scheduling delays don't introduce stalls in the
IO parts of the pipeline and so largely don't affect throughput.
It's when delays become long enough to fill the buffers of the
previous stage and stall multiple stages that performance goes down.
Ideally, there should be no stalls or pipeline delays after the
transaction has a reservation as those are bound by the final stage
throughput (log tail updates).

What we are doing here is adding more feedback between smaller parts
of the pipeline to manage the flow of modifications within the
pipeline. In this case, it is between the iclog writes and the CIL
insert stages so that we don't buffer the entire pipeline's work on
the CIL insert stage when CIL pushes stall on iclog writes due to
slowness in iclog completion and (perhaps) AIL insertion stages.

So, yeah, we can commit and insert in the CIL while a push is
occurring, and mostly they will happen at the same time and the CIL
is large enough that a typical push will complete at the same time
the cil inserts fill a new queue to be pushed....

> > > That sounds more involved than what I was thinking. My thought is that
> > > this throttle is already not predictable or deterministic (i.e. we're
> > > essentially waiting on a scheduler event) and so might not require the
> > > extra complexity of a new waitqueue. It certainly could be the case that
> > > blocking on the entire push is just too long in practice, but since this
> > 
> > I've made that sort of change here and measured the regression.
> > Then I discarded it as an unworkable option and looked for other
> > solutions.
> 
> Ok. What was the regression that resulted from the longer throttle
> delay, out of curiosity?

Well, I tried two things. The first was throttling by waiting for
the entire CIL push by issuing a flush_work() similar to the
xlog_cil_push_now() code. That introduced substantial pipeline
stalls at the CIL insert point, and performance went way down.

The second was moving the up_write() in xlog_cil_push() down to
after then CIL had been formatted and written to the iclogs
entirely. This didn't wait on the commit record write, so removed
taht part of the serialisation from the throttling. i.e. this stalls
incoming inserts (commits) on the down_read() of the context lock
until the CIL push had processed the CIL. This didn't make much
difference, because all the work is in formatting the thousands of
CIL logvecs into the iclogs and writing them to disk.

Basically, blocking the CIL insert on CIL pushes and iclog IO
creates pipeline stalls and prevents us from committing new changes
while the CIL is flushing. Hence we reduce concurrency in the
transaction commit path, and performance goes down accordingly...

Cheers,

Dave.
-- 
Dave Chinner
david@xxxxxxxxxxxxx