On Wed, 2022-03-30 at 12:10 +1100, Dave Chinner wrote: > From: Dave Chinner <dchinner@xxxxxxxxxx> > > Jan Kara reported a performance regression in dbench that he > bisected down to commit bad77c375e8d ("xfs: CIL checkpoint > flushes caches unconditionally"). > > Whilst developing the journal flush/fua optimisations this cache was > part of, it appeared to made a significant difference to > performance. However, now that this patchset has settled and all the > correctness issues fixed, there does not appear to be any > significant performance benefit to asynchronous cache flushes. > > In fact, the opposite is true on some storage types and workloads, > where additional cache flushes that can occur from fsync heavy > workloads have measurable and significant impact on overall > throughput. > > Local dbench testing shows little difference on dbench runs with > sync vs async cache flushes on either fast or slow SSD storage, and > no difference in streaming concurrent async transaction workloads > like fs-mark. > > Fast NVME storage. > > > From `dbench -t 30`, CIL scale: > > clients async sync > BW Latency BW Latency > 1 935.18 0.855 915.64 0.903 > 8 2404.51 6.873 2341.77 6.511 > 16 3003.42 6.460 2931.57 6.529 > 32 3697.23 7.939 3596.28 7.894 > 128 7237.43 15.495 7217.74 11.588 > 512 5079.24 90.587 5167.08 95.822 > > fsmark, 32 threads, create w/ 64 byte xattr w/32k logbsize > > create chown unlink > async 1m41s 1m16s 2m03s > sync 1m40s 1m19s 1m54s > > Slower SATA SSD storage: > > > From `dbench -t 30`, CIL scale: > > clients async sync > BW Latency BW Latency > 1 78.59 15.792 83.78 10.729 > 8 367.88 92.067 404.63 59.943 > 16 564.51 72.524 602.71 76.089 > 32 831.66 105.984 870.26 110.482 > 128 1659.76 102.969 1624.73 91.356 > 512 2135.91 223.054 2603.07 161.160 > > fsmark, 16 threads, create w/32k logbsize > > create unlink > async 5m06s 4m15s > sync 5m00s 4m22s > > And on Jan's test machine: > > 5.18-rc8-vanilla 5.18-rc8-patched > Amean 1 71.22 ( 0.00%) 64.94 * 8.81%* > Amean 2 93.03 ( 0.00%) 84.80 * 8.85%* > Amean 4 150.54 ( 0.00%) 137.51 * 8.66%* > Amean 8 252.53 ( 0.00%) 242.24 * 4.08%* > Amean 16 454.13 ( 0.00%) 439.08 * 3.31%* > Amean 32 835.24 ( 0.00%) 829.74 * 0.66%* > Amean 64 1740.59 ( 0.00%) 1686.73 * 3.09%* > > Performance and cache flush behaviour is restored to pre-regression > levels. > > As such, we can now consider the async cache flush mechanism an > unnecessary exercise in premature optimisation and hence we can > now remove it and the infrastructure it requires completely. It turns out this regresses umount latency on DAX filesystems. Sravani reached out to me after noticing that v5.18 takes up to 10 minutes to complete umount. On a whim I guessed this patch and upon revert of commit 919edbadebe1 ("xfs: drop async cache flushes from CIL commits") on top of v5.18 umount time goes back down to v5.17 levels. Perf confirms that all of that CPU time is being spent in arch_wb_cache_pmem(). It likely means that rather than amortizing that same latency periodically throughout the workload run, it is all being delayed until umount. I assume this latency would also show up without DAX if page-cache is now allowed to continue growing, or is there some other signal that triggers async flushes in that case? On the one hand it makes sense that if a workload dirties data and delays syncing, it is explicitly asking to pay that cost later. For DAX it is unfortunate that likely by the time this late flush comes around the dirty data has long since left the CPU cache, worse if the workload had been diligently cleaning caches itself, but the fs can not trust that userspace is doing that correctly.