On Tue, Jan 28, 2020 at 01:19:36AM +0000, Mel Gorman wrote: > > <SNIP> > > After all this, I have two questions that would help me understand > > if this is what you are seeing: > > > > 1. to confirm: does removing just the WQ_UNBOUND from the CIL push > > workqueue (as added in 8ab39f11d974) make the regression go away? > > > > I'll have to check in the morning. Around the v5.4 development timeframe, > I'm definite that reverting the patch helped but that was not an option > given that it's fixing a correctness issue. > This is a comparison of the baseline kernel (tip at the time I started), the proposed fix and a revert. The revert was not clean but I do not believe it matters dbench4 Loadfile Execution Time 5.5.0-rc7 5.5.0-rc7 5.5.0-rc7 tipsched-20200124 kworkerstack-v1r2 revert-XFS-wq-v1r2 Amean 1 58.69 ( 0.00%) 30.21 * 48.53%* 47.48 * 19.10%* Amean 2 60.90 ( 0.00%) 35.29 * 42.05%* 51.13 * 16.04%* Amean 4 66.77 ( 0.00%) 46.55 * 30.28%* 59.54 * 10.82%* Amean 8 81.41 ( 0.00%) 68.46 * 15.91%* 77.25 * 5.11%* Amean 16 113.29 ( 0.00%) 107.79 * 4.85%* 112.33 * 0.85%* Amean 32 199.10 ( 0.00%) 198.22 * 0.44%* 200.31 * -0.61%* Amean 64 478.99 ( 0.00%) 477.06 * 0.40%* 482.17 * -0.66%* Amean 128 1345.26 ( 0.00%) 1372.64 * -2.04%* 1368.94 * -1.76%* Stddev 1 2.64 ( 0.00%) 4.17 ( -58.08%) 5.01 ( -89.89%) Stddev 2 4.35 ( 0.00%) 5.38 ( -23.73%) 4.48 ( -2.90%) Stddev 4 6.77 ( 0.00%) 6.56 ( 3.00%) 7.40 ( -9.40%) Stddev 8 11.61 ( 0.00%) 10.91 ( 6.04%) 11.62 ( -0.05%) Stddev 16 18.63 ( 0.00%) 19.19 ( -3.01%) 19.12 ( -2.66%) Stddev 32 38.71 ( 0.00%) 38.30 ( 1.06%) 38.82 ( -0.28%) Stddev 64 100.28 ( 0.00%) 91.24 ( 9.02%) 95.68 ( 4.59%) Stddev 128 186.87 ( 0.00%) 160.34 ( 14.20%) 170.85 ( 8.57%) According to this, commit 8ab39f11d974 ("xfs: prevent CIL push holdoff in log recovery") did introduce some unintended behaviour. The fix actually performs better than a revert with the obvious benefit that it does not reintroduce the functional breakage (log starvation) that the commit originally fixed. I still think that XFS is not the problem here, it's just the messenger. The functional fix, delegating work to kworkers running on the same CPU and blk-mq delivering IO completions to the same CPU as the IO issuer are all sane decisions IMO. I do not think that adjusting any of them to wakeup the task on a new CPU is sensible due to the loss of data cache locality and potential snags with power management when waking a CPU from idle state. Peter, Ingo and Vincent -- I know the timing is bad due to the merge window but do you have any thoughts on allowing select_idle_sibling to stack a wakee task on the same CPU as a waker in this specific case? -- Mel Gorman SUSE Labs