Regression in XFS for fsync heavy workload

Jan Kara <jack@xxxxxxx> · Tue, 15 Mar 2022 13:49:43 +0100

Hello,

I was tracking down a regression in dbench workload on XFS we have
identified during our performance testing. These are results from one of
our test machine (server with 64GB of RAM, 48 CPUs, SATA SSD for the test
disk):

			       good		       bad
Amean     1        64.29 (   0.00%)       73.11 * -13.70%*
Amean     2        84.71 (   0.00%)       98.05 * -15.75%*
Amean     4       146.97 (   0.00%)      148.29 *  -0.90%*
Amean     8       252.94 (   0.00%)      254.91 *  -0.78%*
Amean     16      454.79 (   0.00%)      456.70 *  -0.42%*
Amean     32      858.84 (   0.00%)      857.74 (   0.13%)
Amean     64     1828.72 (   0.00%)     1865.99 *  -2.04%*

Note that the numbers are actually times to complete workload, not
traditional dbench throughput numbers so lower is better. Eventually I have
tracked down the problem to commit bad77c375e8d ("xfs: CIL checkpoint
flushes caches unconditionally"). Before this commit we submit ~63k cache
flush requests during the dbench run, after this commit we submit ~150k
cache flush requests. And the additional cache flushes are coming from
xlog_cil_push_work(). The reason as far as I understand it is that
xlog_cil_push_work() never actually ends up writing the iclog (I can see
this in the traces) because it is writing just very small amounts (my
debugging shows xlog_cil_push_work() tends to add 300-1000 bytes to iclog,
4000 bytes is the largest number I've seen) and very frequent fsync(2)
calls from dbench always end up forcing iclog before it gets filled. So the
cache flushes issued by xlog_cil_push_work() are just pointless overhead
for this workload AFAIU.

Is there a way we could help this? I had some idea like call
xfs_flush_bdev_async() only once we find enough items while walking the
cil->xc_cil list that we think iclog write is likely. This should still
submit it rather early to provide the latency advantage. Otherwise postpone
the flush to the moment we know we are going to flush the iclog to save
pointless flushes. But we would have to record whether the flush happened
or not in the iclog and it would all get a bit hairy... I'm definitely not
an expert in XFS logging code so that's why I'm just writing here my
current findings if people have some ideas.

								Honza

-- 
Jan Kara <jack@xxxxxxxx>
SUSE Labs, CR