Re: [PATCH] [Regression, v5.0] mm: boosted kswapd reclaim b0rks system cache balance

Mel Gorman <mgorman@xxxxxxxxxxxxxxxxxxx> · Thu, 8 Aug 2019 00:48:15 +0100

On Thu, Aug 08, 2019 at 08:32:41AM +1000, Dave Chinner wrote:
> On Wed, Aug 07, 2019 at 09:56:15PM +0100, Mel Gorman wrote:
> > On Wed, Aug 07, 2019 at 04:03:16PM +0100, Mel Gorman wrote:
> > > <SNIP>
> > >
> > > On that basis, it may justify ripping out the may_shrinkslab logic
> > > everywhere. The downside is that some microbenchmarks will notice.
> > > Specifically IO benchmarks that fill memory and reread (particularly
> > > rereading the metadata via any inode operation) may show reduced
> > > results. Such benchmarks can be strongly affected by whether the inode
> > > information is still memory resident and watermark boosting reduces
> > > the changes the data is still resident in memory. Technically still a
> > > regression but a tunable one.
> > > 
> > > Hence the following "it builds" patch that has zero supporting data on
> > > whether it's a good idea or not.
> > > 
> > 
> > This is a more complete version of the same patch that summaries the
> > problem and includes data from my own testing
> ....
> > A fsmark benchmark configuration was constructed similar to
> > what Dave reported and is codified by the mmtest configuration
> > config-io-fsmark-small-file-stream.  It was evaluated on a 1-socket machine
> > to avoid dealing with NUMA-related issues and the timing of reclaim. The
> > storage was an SSD Samsung Evo and a fresh XFS filesystem was used for
> > the test data.
> 
> Have you run fstrim on that drive recently? I'm running these tests
> on a 960 EVO ssd, and when I started looking at shrinkers 3 weeks
> ago I had all sorts of whacky performance problems and inconsistent
> results. Turned out there were all sorts of random long IO latencies
> occurring (in the hundreds of milliseconds) because the drive was
> constantly running garbage collection to free up space. As a result
> it was both blocking on GC and thermal throttling under these fsmark
> workloads.
> 

No, I was under the impression that making a new filesystem typically
trimmed it as well. Maybe that's just some filesystems (e.g. ext4) or
just completely wrong.

> I made a new XFS filesystem on it (lazy man's rm -rf *),

Ah, all IO tests I do make a new filesystem. I know there is the whole
problem of filesystem aging but I've yet to come across a methodology
that two people can agree on that is a sensible, reproducible method.

> then ran
> fstrim on it to tell the drive all the space is free. Drive temps
> dropped 30C immediately, and all of the whacky performance anomolies
> went away. I now fstrim the drive in my vm startup scripts before
> each test run, and it's giving consistent results again.
> 

I'll replicate that if making a new filesystem is not guaranteed to
trim. It'll muck up historical data but that happens to me every so
often anyway.

> > It is likely that the test configuration is not a proper match for Dave's
> > test as the results are different in terms of performance. However, my
> > configuration reports fsmark performance every 10% of memory worth of
> > files and I suspect Dave's configuration reported Files/sec when memory
> > was already full. THP was enabled for mine, disabled for Dave's and
> > probably a whole load of other methodology differences that rarely
> > get recorded properly.
> 
> Yup, like I forgot to mention that my test system is using a 4-node
> fakenuma setup (i.e. 4 nodes, 4GB RAM and 4 CPUs per node, so
> there are 4 separate kswapd's doing concurrent reclaim). That
> changes reclaim patterns as well.
> 

Good to know. In this particular case, I don't think I need to exactly
replicate what you have given that the slam reclaim behaviour is
definitely more consistent and the ratios of slab/pagecache are
predictable.

> 
> > fsmark
> >                                    5.3.0-rc3              5.3.0-rc3
> >                                      vanilla          shrinker-v1r1
> > Min       1-files/sec     5181.70 (   0.00%)     3204.20 ( -38.16%)
> > 1st-qrtle 1-files/sec    14877.10 (   0.00%)     6596.90 ( -55.66%)
> > 2nd-qrtle 1-files/sec     6521.30 (   0.00%)     5707.80 ( -12.47%)
> > 3rd-qrtle 1-files/sec     5614.30 (   0.00%)     5363.80 (  -4.46%)
> > Max-1     1-files/sec    18463.00 (   0.00%)    18479.90 (   0.09%)
> > Max-5     1-files/sec    18028.40 (   0.00%)    17829.00 (  -1.11%)
> > Max-10    1-files/sec    17502.70 (   0.00%)    17080.90 (  -2.41%)
> > Max-90    1-files/sec     5438.80 (   0.00%)     5106.60 (  -6.11%)
> > Max-95    1-files/sec     5390.30 (   0.00%)     5020.40 (  -6.86%)
> > Max-99    1-files/sec     5271.20 (   0.00%)     3376.20 ( -35.95%)
> > Max       1-files/sec    18463.00 (   0.00%)    18479.90 (   0.09%)
> > Hmean     1-files/sec     7459.11 (   0.00%)     6249.49 ( -16.22%)
> > Stddev    1-files/sec     4733.16 (   0.00%)     4362.10 (   7.84%)
> > CoeffVar  1-files/sec       51.66 (   0.00%)       57.49 ( -11.29%)
> > BHmean-99 1-files/sec     7515.09 (   0.00%)     6351.81 ( -15.48%)
> > BHmean-95 1-files/sec     7625.39 (   0.00%)     6486.09 ( -14.94%)
> > BHmean-90 1-files/sec     7803.19 (   0.00%)     6588.61 ( -15.57%)
> > BHmean-75 1-files/sec     8518.74 (   0.00%)     6954.25 ( -18.37%)
> > BHmean-50 1-files/sec    10953.31 (   0.00%)     8017.89 ( -26.80%)
> > BHmean-25 1-files/sec    16732.38 (   0.00%)    11739.65 ( -29.84%)
> > 
> >                    5.3.0-rc3   5.3.0-rc3
> >                      vanillashrinker-v1r1
> > Duration User          77.29       89.09
> > Duration System      1097.13     1332.86
> > Duration Elapsed     2014.14     2596.39
> 
> I'm not sure we are testing or measuring exactly the same things :)
> 

Probably not.

> > This is showing that fsmark runs slower as a result of this patch but
> > there are other important observations that justify the patch.
> > 
> > 1. With the vanilla kernel, the number of dirty pages in the system
> >    is very low for much of the test. With this patch, dirty pages
> >    is generally kept at 10% which matches vm.dirty_background_ratio
> >    which is normal expected historical behaviour.
> > 
> > 2. With the vanilla kernel, the ratio of Slab/Pagecache is close to
> >    0.95 for much of the test i.e. Slab is being left alone and dominating
> >    memory consumption. With the patch applied, the ratio varies between
> >    0.35 and 0.45 with the bulk of the measured ratios roughly half way
> >    between those values. This is a different balance to what Dave reported
> >    but it was at least consistent.
> 
> Yeah, the balance is typically a bit different for different configs
> and storage. The trick is getting the balance to be roughly
> consistent across a range of different configs. The fakenuma setup
> also has a significant impact on where the balance is found. And I
> can't remember if the "fixed" memory usage numbers I quoted came
> from a run with my "make XFS inode reclaim nonblocking" patchset or
> not.
> 

Again, I wouldn't sweat too much about it. The generated graphs
definitely showed more consistent behaviour even if the headline
performance was not improved.

> > 3. Slabs are scanned throughout the entire test with the patch applied.
> >    The vanille kernel has long periods with no scan activity and then
> >    relatively massive spikes.
> > 
> > 4. Overall vmstats are closer to normal expectations
> > 
> > 	                                5.3.0-rc3      5.3.0-rc3
> > 	                                  vanilla  shrinker-v1r1
> > 	Direct pages scanned             60308.00        5226.00
> > 	Kswapd pages scanned          18316110.00    12295574.00
> > 	Kswapd pages reclaimed        13121037.00     7280152.00
> > 	Direct pages reclaimed           11817.00        5226.00
> > 	Kswapd efficiency %                 71.64          59.21
> > 	Kswapd velocity                   9093.76        4735.64
> > 	Direct efficiency %                 19.59         100.00
> > 	Direct velocity                     29.94           2.01
> > 	Page reclaim immediate          247921.00           0.00
> > 	Slabs scanned                 16602344.00    29369536.00
> > 	Direct inode steals               1574.00         800.00
> > 	Kswapd inode steals             130033.00     3968788.00
> > 	Kswapd skipped wait                  0.00           0.00
> 
> That looks a lot better. Patch looks reasonable, though I'm
> interested to know what impact it has on tests you ran in the
> original commit for the boosting.
> 

I'll find out soon enough but I'm leaning on the side that kswapd reclaim
should be predictable and that even if there are some performance problems
as a result of it, there will be others that see a gain. It'll be a case
of "no matter what way you jump, someone shouts" but kswapd having spiky
unpredictable behaviour is a recipe for "sometimes my machine is crap
and I've no idea why".

-- 
Mel Gorman
SUSE Labs