Re: [PATCH 2/3] writeback: Record if the congestion was unnecessary

Mel Gorman <mel@xxxxxxxxx> · Tue, 31 Aug 2010 16:02:07 +0100

On Mon, Aug 30, 2010 at 03:19:29PM +0200, Johannes Weiner wrote:
> On Fri, Aug 27, 2010 at 10:24:16AM +0100, Mel Gorman wrote:
> > On Fri, Aug 27, 2010 at 10:16:48AM +0200, Johannes Weiner wrote:
> > > On Thu, Aug 26, 2010 at 09:31:30PM +0100, Mel Gorman wrote:
> > > > On Thu, Aug 26, 2010 at 08:29:04PM +0200, Johannes Weiner wrote:
> > > > > On Thu, Aug 26, 2010 at 04:14:15PM +0100, Mel Gorman wrote:
> > > > > > If congestion_wait() is called when there is no congestion, the caller
> > > > > > will wait for the full timeout. This can cause unreasonable and
> > > > > > unnecessary stalls. There are a number of potential modifications that
> > > > > > could be made to wake sleepers but this patch measures how serious the
> > > > > > problem is. It keeps count of how many congested BDIs there are. If
> > > > > > congestion_wait() is called with no BDIs congested, the tracepoint will
> > > > > > record that the wait was unnecessary.
> > > > > 
> > > > > I am not convinced that unnecessary is the right word.  On a workload
> > > > > without any IO (i.e. no congestion_wait() necessary, ever), I noticed
> > > > > the VM regressing both in time and in reclaiming the right pages when
> > > > > simply removing congestion_wait() from the direct reclaim paths (the
> > > > > one in __alloc_pages_slowpath and the other one in
> > > > > do_try_to_free_pages).
> > > > > 
> > > > > So just being stupid and waiting for the timeout in direct reclaim
> > > > > while kswapd can make progress seemed to do a better job for that
> > > > > load.
> > > > > 
> > > > > I can not exactly pinpoint the reason for that behaviour, it would be
> > > > > nice if somebody had an idea.
> > > > > 
> > > > 
> > > > There is a possibility that the behaviour in that case was due to flusher
> > > > threads doing the writes rather than direct reclaim queueing pages for IO
> > > > in an inefficient manner. So the stall is stupid but happens to work out
> > > > well because flusher threads get the chance to do work.
> > > 
> > > The workload was accessing a large sparse-file through mmap, so there
> > > wasn't much IO in the first place.
> > > 
> > 
> > Then waiting on congestion was the totally wrong thing to do. We were
> > effectively calling sleep(HZ/10) and magically this was helping in some
> > undefined manner. Do you know *which* called of congestion_wait() was
> > the most important to you?
> 
> Removing congestion_wait() in do_try_to_free_pages() definitely
> worsens reclaim behaviour for this workload:
> 
> 1. wallclock time of the testrun increases by 11%
> 
> 2. the scanners do a worse job and go for the wrong zone:
> 
> -pgalloc_dma 79597
> -pgalloc_dma32 134465902
> +pgalloc_dma 297089
> +pgalloc_dma32 134247237
> 
> -pgsteal_dma 77501
> -pgsteal_dma32 133939446
> +pgsteal_dma 294998
> +pgsteal_dma32 133722312
> 
> -pgscan_kswapd_dma 145897
> -pgscan_kswapd_dma32 266141381
> +pgscan_kswapd_dma 287981
> +pgscan_kswapd_dma32 186647637
> 
> -pgscan_direct_dma 9666
> -pgscan_direct_dma32 1758655
> +pgscan_direct_dma 302495
> +pgscan_direct_dma32 80947179
> 
> -pageoutrun 1768531
> -allocstall 614
> +pageoutrun 1927451
> +allocstall 8566
> 
> I attached the full vmstat contents below.  Also the test program,
> which I ran in this case as: ./mapped-file-stream 1 $((512 << 30))
> 

Excellent stuff. I didn't look at your vmstat output because it was for
an old patch and you have already highlighted the problems related to
the workload. Chances are, I'd just reach the same conclusions. What is
interesting is your workload.

> > > > > So personally I think it's a good idea to get an insight on the use of
> > > > > congestion_wait() [patch 1] but I don't agree with changing its
> > > > > behaviour just yet, or judging its usefulness solely on whether it
> > > > > correctly waits for bdi congestion.
> > > > > 
> > > > 
> > > > Unfortunately, I strongly suspect that some of the desktop stalls seen during
> > > > IO (one of which involved no writes) were due to calling congestion_wait
> > > > and waiting the full timeout where no writes are going on.
> > > 
> > > Oh, I am in full agreement here!  Removing those congestion_wait() as
> > > described above showed a reduction in peak latency.  The dilemma is
> > > only that it increased the overall walltime of the load.
> > > 
> > 
> > Do you know why because leaving in random sleeps() hardly seems to be
> > the right approach?
> 
> I am still trying to find out what's going wrong.
> 
> > > And the scanning behaviour deteriorated, as in having increased
> > > scanning pressure on other zones than the unpatched kernel did.
> > > 
> > 
> > Probably because it was scanning more but not finding what it needed.
> > There is a condition other than congestion it is having trouble with. In
> > some respects, I think if we change congestion_wait() as I propose,
> > we may see a case where CPU usage is higher because it's now
> > encountering the unspecified reclaim problem we have.
> 
> Exactly.
> 
> > > So I think very much that we need a fix.  congestion_wait() causes
> > > stalls and relying on random sleeps for the current reclaim behaviour
> > > can not be the solution, at all.
> > > 
> > > I just don't think we can remove it based on the argument that it
> > > doesn't do what it is supposed to do, when it does other things right
> > > that it is not supposed to do ;-)
> > > 
> > 
> > We are not removing it, we are just stopping it going to sleep for
> > stupid reasons. If we find that wall time is increasing as a result, we
> > have a path to figuring out what the real underlying problem is instead
> > of sweeping it under the rug.
> 
> Well, for that testcase it is in effect the same as a removal as
> there's never congestion.
> 
> But again: I agree with your changes per-se, I just don't think they
> should get merged as long as they knowingly catalyze a problem that
> has yet to be identified.

Ok, well there was some significant feedback on why wholesale changing of
congestion_wait() reached too far and I've incorporated that feedback. I
also integrated your workload into my testsuite (btw, because there is no
license the script has to download it from a google archive. I might get
back to you on licensing this so it can be made a permanent part of the suite).

These are the results just for your workload on the only machine I had
available with a lot of disk. There are a bunch of kernels because I'm testing
a superset of different series posted recently. The nocongest column is an
unreleased patch that has congestion_wait() and wait_iff_congested() that
only goes to sleep if there is real congestion or a lot of writeback going
on. Rather than worrying about the patch contents for now, lets consider
the results for just your workload.

The report is in 4 parts. The first is the vmstat counter differences as
a result of running your test. The exact interpretation of good and bad
here is open to interpretation. The second part is based on the vmscan
tracepoints. The third part is based on the congestion tracepoints and
the final part reports CPU usage and elapsed time.

MICRO
                                      traceonly-v1r4          nocongest-v1r4           lowlumpy-v1r4     nodirect-v1r4
pgalloc_dma                      89409.00 (   0.00%)     47750.00 ( -87.24%)     47430.00 ( -88.51%)     47246.00 ( -89.24%)
pgalloc_dma32                101407571.00 (   0.00%) 101518722.00 (   0.11%) 101502059.00 (   0.09%) 101511868.00 (   0.10%)
pgalloc_normal                       0.00 (   0.00%)         0.00 (   0.00%)         0.00 (   0.00%)         0.00 (   0.00%)
pgsteal_dma                      74529.00 (   0.00%)     43386.00 ( -71.78%)     43213.00 ( -72.47%)     42691.00 ( -74.58%)
pgsteal_dma32                100666955.00 (   0.00%) 100712596.00 (   0.05%) 100712537.00 (   0.05%) 100713305.00 (   0.05%)
pgsteal_normal                       0.00 (   0.00%)         0.00 (   0.00%)         0.00 (   0.00%)         0.00 (   0.00%)
pgscan_kswapd_dma               118198.00 (   0.00%)     47370.00 (-149.52%)     49515.00 (-138.71%)     46134.00 (-156.21%)
pgscan_kswapd_dma32          177619794.00 (   0.00%) 161549938.00 (  -9.95%) 161679701.00 (  -9.86%) 156657926.00 ( -13.38%)
pgscan_kswapd_normal                 0.00 (   0.00%)         0.00 (   0.00%)         0.00 (   0.00%)         0.00 (   0.00%)
pgscan_direct_dma                27128.00 (   0.00%)     39215.00 (  30.82%)     36561.00 (  25.80%)     38806.00 (  30.09%)
pgscan_direct_dma32           23927492.00 (   0.00%)  40122173.00 (  40.36%)  39997463.00 (  40.18%)  45041626.00 (  46.88%)
pgscan_direct_normal                 0.00 (   0.00%)         0.00 (   0.00%)         0.00 (   0.00%)         0.00 (   0.00%)
pageoutrun                      756020.00 (   0.00%)    903192.00 (  16.29%)    899965.00 (  15.99%)    868055.00 (  12.91%)
allocstall                        2722.00 (   0.00%)     70156.00 (  96.12%)     67554.00 (  95.97%)     87691.00 (  96.90%)

So, the allocstall counts go up of course because it is incremented
every time direct reclaim is entered and nocongest is only going to
sleep when there is congestion or significant writeback. I don't see
this as being nceessarily bad.

Direct scanning rates go up a bit as you'd expect - again because we are
sleeping less. It's interesting that the pages reclaimed is reduced implying
that despite higher scanning rates, there is less reclaim activity.

It's debatable if this is good or not because higher scanning rates in
themselves are not bad but fewer pages reclaimed seems positive so lets
see what the rest of the reports look like.

FTrace Reclaim Statistics: vmscan
            micro-traceonly-v1r4-micromicro-nocongest-v1r4-micromicro-lowlumpy-v1r4-micromicro-nodirect-v1r4-micro
                traceonly-v1r4    nocongest-v1r4     lowlumpy-v1r4     nodirect-v1r4
Direct reclaims                               2722      70156      67554      87691 
Direct reclaim pages scanned              23955333   40161426   40034132   45080524 
Direct reclaim write file async I/O              0          0          0          0 
Direct reclaim write anon async I/O              0          0          0          0 
Direct reclaim write file sync I/O               0          0          0          0 
Direct reclaim write anon sync I/O               0          0          0          0 
Wake kswapd requests                       2718040   17801688   17622777   18379572 
Kswapd wakeups                                  24          1          1          1 
Kswapd pages scanned                     177738381  161597313  161729224  156704078 
Kswapd reclaim write file async I/O              0          0          0          0 
Kswapd reclaim write anon async I/O              0          0          0          0 
Kswapd reclaim write file sync I/O               0          0          0          0 
Kswapd reclaim write anon sync I/O               0          0          0          0 
Time stalled direct reclaim (seconds)       247.97      76.97      77.15      76.63 
Time kswapd awake (seconds)                 489.17     400.20     403.19     390.08 

Total pages scanned                      201693714 201758739 201763356 201784602
%age total pages scanned/written             0.00%     0.00%     0.00%     0.00%
%age  file pages scanned/written             0.00%     0.00%     0.00%     0.00%
Percentage Time Spent Direct Reclaim        41.41%    16.03%    15.96%    16.32%
Percentage Time kswapd Awake                98.76%    98.94%    98.96%    98.87%

Interesting, kswapd is now staying awake (woke up only once) even though
the total time awake was reduced and it looks like because it was requested
to wake up a lot more that was keeping it awake. Despite the higher scan
rates from direct reclaim, the time actually spent direct reclaiming is
significantly reduced.

Scanning rates and times we direct reclaim go up but as we finish work a
lot faster, it would seem that we are doing less work overall.

FTrace Reclaim Statistics: congestion_wait
Direct number congest     waited              3664          0          0          0 
Direct time   congest     waited          247636ms        0ms        0ms        0ms 
Direct full   congest     waited              3081          0          0          0 
Direct number conditional waited                 0      47587      45659      58779 
Direct time   conditional waited               0ms        0ms        0ms        0ms 
Direct full   conditional waited              3081          0          0          0 
KSwapd number congest     waited              1448        949        909        981 
KSwapd time   congest     waited          118552ms    31652ms    32780ms    38732ms 
KSwapd full   congest     waited              1056         90        115        147 
KSwapd number conditional waited                 0          0          0          0 
KSwapd time   conditional waited               0ms        0ms        0ms        0ms 
KSwapd full   conditional waited              1056         90        115        147 

congest waited is congestion_wait() and conditional waited is
wait_iff_congested(). Look at what happens to the congest waited times
for direct reclaim - it disappeared and despite the number of times
wait_iff_congested() was called, it never actually decided it needed to
sleep. kswapd is still congestion waiting but the time it spent is
reduced.

MMTests Statistics: duration
User/Sys Time Running Test (seconds)         350.9    403.27    406.12    393.02
Total Elapsed Time (seconds)                495.29    404.47    407.44    394.53

This is plain old time. The same test completes 91 seconds faster.
Ordinarily at this point I would be preparing to do a full series report
including the other benchmarks but I'm interested in seeing if there is
a significantly different reading of the above report as to whether it
is a "good" or "bad" result?

-- 
Mel Gorman
Part-time Phd Student                          Linux Technology Center
University of Limerick                         IBM Dublin Software Lab

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@xxxxxxxxxx  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@xxxxxxxxx";> email@xxxxxxxxx </a>