Re: Regression in XFS for fsync heavy workload

Dave Chinner <david@xxxxxxxxxxxxx> · Wed, 16 Mar 2022 12:06:27 +1100

On Tue, Mar 15, 2022 at 01:49:43PM +0100, Jan Kara wrote:
> Hello,
> 
> I was tracking down a regression in dbench workload on XFS we have
> identified during our performance testing. These are results from one of
> our test machine (server with 64GB of RAM, 48 CPUs, SATA SSD for the test
> disk):
> 
> 			       good		       bad
> Amean     1        64.29 (   0.00%)       73.11 * -13.70%*
> Amean     2        84.71 (   0.00%)       98.05 * -15.75%*
> Amean     4       146.97 (   0.00%)      148.29 *  -0.90%*
> Amean     8       252.94 (   0.00%)      254.91 *  -0.78%*
> Amean     16      454.79 (   0.00%)      456.70 *  -0.42%*
> Amean     32      858.84 (   0.00%)      857.74 (   0.13%)
> Amean     64     1828.72 (   0.00%)     1865.99 *  -2.04%*
> 
> Note that the numbers are actually times to complete workload, not
> traditional dbench throughput numbers so lower is better.

How does this work? Dbench is a fixed time workload - the only
variability from run to run is the time it takes to run the cleanup
phase. Is this some hacked around version of dbench?

> Eventually I have
> tracked down the problem to commit bad77c375e8d ("xfs: CIL checkpoint
> flushes caches unconditionally").  Before this commit we submit ~63k cache
> flush requests during the dbench run, after this commit we submit ~150k
> cache flush requests. And the additional cache flushes are coming from
> xlog_cil_push_work().

Yup, that was a known trade-off with this series of changes to the
REQ_FLUSH/REQ_FUA behaviour of the CIL - fsync heavy workloads could
see higher cache flush counts, and low concurrency rates would see
that as increased runtime because there isn't the journal write load
needed to compeltely mitigate the impact of more frequent cache
flushes.

The trade-off is, OTOH, that other workloads see 10-1000x lower cache
flush rates when streaming large CIL checkpoints through the log as
fast as it can issue sequential IO to the disk. e.g. a 32MB CIL
threshold (>256MB journal) with 32KB iclogbuf size (default) on an
async workload will write ~1000 iclogs to flush a 32MB CIL
checkpoint. Prior to this series, every one of those iclog writes
would be issued as REQ_FLUSH|REQ_FUA. After this patch series,
we do a REQ_FLUSH from xfs_flush_bdev_async(), the first iclog is
REQ_FUA, and the last iclog is REQ_FLUSH|REQ_FUA. IOW, we reduce
the number of FUA + cache flushes by 3 orders or magnitude for these
cases....

There's a heap more detail of the changes and the complexity in
the commit message for the commit two patches further along in
that series where these benefits are realised - see commit
eef983ffeae7 ("xfs: journal IO cache flush reductions") for more
details.

When doing this work, I didn't count cache flushes. What I looked at
was the number of log forces vs the number of sleeps waiting on log
forces vs log writes vs the number of stalls waiting for log writes.
These numbers showed improvements across the board, so any increase
in overhead from physical cache flushes was not reflected in the
throughput increases I was measuring at the "fsync drives log
forces" level.

> The reason as far as I understand it is that
> xlog_cil_push_work() never actually ends up writing the iclog (I can see
> this in the traces) because it is writing just very small amounts (my
> debugging shows xlog_cil_push_work() tends to add 300-1000 bytes to iclog,
> 4000 bytes is the largest number I've seen) and very frequent fsync(2)
> calls from dbench always end up forcing iclog before it gets filled. So the
> cache flushes issued by xlog_cil_push_work() are just pointless overhead
> for this workload AFAIU.

It's not quite that straight forward.

Keep in mind that the block layer is supposed to merge new flush
requests that occur while there is still a flush in progress. hence
the only time this async flush should cause extra flush requests to
physically occur unless you have storage that either ignores flush
requests (in which case we don't care because bio_submit() aborts
real quick) or is really, really fast and so cache flush requests
complete before we start hitting the block layer merge case or
slowing down other IO.  If storage is slow and there's any amoutn of
concurrency, then we're going to be waiting on merged flush requests
in the block layer if there's any amount of concurrency, so the
impact is fairly well bound there, too.

Hence cache flush latency is only going to impact on very
low concurrency workloads where any additional wait time directly
translates to reduced throughput. That's pretty much what your
numbers indicate, too.

> Is there a way we could help this? I had some idea like call
> xfs_flush_bdev_async() only once we find enough items while walking the
> cil->xc_cil list that we think iclog write is likely.

That might be possible, but I don't see that a list walk can
determine this effectively. I've been putting off trying to optimise
this stuff because the infrastructure needed to make decisions like
this efficiently is still backed up waiting on merge.

We need the xlog_write() rework merged, because this patch:

https://lore.kernel.org/linux-xfs/20220309052937.2696447-1-david@xxxxxxxxxxxxx/T/#mf335766c6c17dbf9c438ed30fa0b7e15d355a6be

provides the information we need to determine how many iclogs we are
likely to need to determine whether to make a sync or async flush.

The resultant rewrite of xlog_write() also gives us a callback for
the first iclog in any write to set state on that iclog appropriate
for the current context. That can easily be used to set the
XLOG_ICL_NEED_FLUSH flag if an async flush has not been completed.

> This should still
> submit it rather early to provide the latency advantage. Otherwise postpone
> the flush to the moment we know we are going to flush the iclog to save
> pointless flushes. But we would have to record whether the flush happened
> or not in the iclog and it would all get a bit hairy...

I think we can just set the NEED_FLUSH flag appropriately.

However, given all this, I'm wondering if the async cache flush was
really a case of premature optimisation. That is, we don't really
gain anything by reducing the flush latency of the first iclog write
wehn we are writing 100-1000 iclogs before the commit record, and it
can be harmful to some workloads by issuing more flushes than we
need to.

So perhaps the right thing to do is just get rid of it and always
mark the first iclog in a checkpoint as NEED_FLUSH....

> I'm definitely not
> an expert in XFS logging code so that's why I'm just writing here my
> current findings if people have some ideas.

Thanks for running the tests and doing some investigation, Jan. We
should be able to mitigate some of this impact, so let me run a
couple of experiments here and I'll get back to you.

Cheers,

Dave.
-- 
Dave Chinner
david@xxxxxxxxxxxxx