Re: [PATCH 8/8] xfs: drop async cache flushes from CIL commits.

"Williams, Dan J" <dan.j.williams@xxxxxxxxx> · Thu, 16 Jun 2022 22:23:10 +0000

On Wed, 2022-03-30 at 12:10 +1100, Dave Chinner wrote:
> From: Dave Chinner <dchinner@xxxxxxxxxx>
> 
> Jan Kara reported a performance regression in dbench that he
> bisected down to commit bad77c375e8d ("xfs: CIL checkpoint
> flushes caches unconditionally").
> 
> Whilst developing the journal flush/fua optimisations this cache was
> part of, it appeared to made a significant difference to
> performance. However, now that this patchset has settled and all the
> correctness issues fixed, there does not appear to be any
> significant performance benefit to asynchronous cache flushes.
> 
> In fact, the opposite is true on some storage types and workloads,
> where additional cache flushes that can occur from fsync heavy
> workloads have measurable and significant impact on overall
> throughput.
> 
> Local dbench testing shows little difference on dbench runs with
> sync vs async cache flushes on either fast or slow SSD storage, and
> no difference in streaming concurrent async transaction workloads
> like fs-mark.
> 
> Fast NVME storage.
> 
> > From `dbench -t 30`, CIL scale:
> 
> clients         async                   sync
>                 BW      Latency         BW      Latency
> 1                935.18   0.855          915.64   0.903
> 8               2404.51   6.873         2341.77   6.511
> 16              3003.42   6.460         2931.57   6.529
> 32              3697.23   7.939         3596.28   7.894
> 128             7237.43  15.495         7217.74  11.588
> 512             5079.24  90.587         5167.08  95.822
> 
> fsmark, 32 threads, create w/ 64 byte xattr w/32k logbsize
> 
>         create          chown           unlink
> async   1m41s           1m16s           2m03s
> sync    1m40s           1m19s           1m54s
> 
> Slower SATA SSD storage:
> 
> > From `dbench -t 30`, CIL scale:
> 
> clients         async                   sync
>                 BW      Latency         BW      Latency
> 1                 78.59  15.792           83.78  10.729
> 8                367.88  92.067          404.63  59.943
> 16               564.51  72.524          602.71  76.089
> 32               831.66 105.984          870.26 110.482
> 128             1659.76 102.969         1624.73  91.356
> 512             2135.91 223.054         2603.07 161.160
> 
> fsmark, 16 threads, create w/32k logbsize
> 
>         create          unlink
> async   5m06s           4m15s
> sync    5m00s           4m22s
> 
> And on Jan's test machine:
> 
>                    5.18-rc8-vanilla       5.18-rc8-patched
> Amean     1        71.22 (   0.00%)       64.94 *   8.81%*
> Amean     2        93.03 (   0.00%)       84.80 *   8.85%*
> Amean     4       150.54 (   0.00%)      137.51 *   8.66%*
> Amean     8       252.53 (   0.00%)      242.24 *   4.08%*
> Amean     16      454.13 (   0.00%)      439.08 *   3.31%*
> Amean     32      835.24 (   0.00%)      829.74 *   0.66%*
> Amean     64     1740.59 (   0.00%)     1686.73 *   3.09%*
> 
> Performance and cache flush behaviour is restored to pre-regression
> levels.
> 
> As such, we can now consider the async cache flush mechanism an
> unnecessary exercise in premature optimisation and hence we can
> now remove it and the infrastructure it requires completely.

It turns out this regresses umount latency on DAX filesystems. Sravani
reached out to me after noticing that v5.18 takes up to 10 minutes to
complete umount. On a whim I guessed this patch and upon revert of
commit 919edbadebe1 ("xfs: drop async cache flushes from CIL commits")
on top of v5.18 umount time goes back down to v5.17 levels.

Perf confirms that all of that CPU time is being spent in
arch_wb_cache_pmem(). It likely means that rather than amortizing that
same latency periodically throughout the workload run, it is all being
delayed until umount.

I assume this latency would also show up without DAX if page-cache is
now allowed to continue growing, or is there some other signal that
triggers async flushes in that case?

On the one hand it makes sense that if a workload dirties data and
delays syncing, it is explicitly asking to pay that cost later. For DAX
it is unfortunate that likely by the time this late flush comes around
the dirty data has long since left the CPU cache, worse if the workload
had been diligently cleaning caches itself, but the fs can not trust
that userspace is doing that correctly.