[PATCH 0/9] Reduce writeback from page reclaim context V5

Mel Gorman <mel@xxxxxxxxx> · Wed, 28 Jul 2010 11:27:14 +0100

This is a follow-on series from "Avoid overflowing of stack during page
reclaim". It eliminates writeback requiring a filesystem from direct reclaim
and follows on by reducing the amount of IO required from page reclaim to
mitigate any corner cases from the modification.

Changelog since V4
  o Add patch to prioritise inodes for writeback
  o Drop modifications to XFS and btrfs
  o Correct units in post-processing script
  o Add new patches from Wu related to writeback
  o Only kick flusher threads when dirty file pages are countered
  o Increase size of writeback window when reclaim encounters dirty pages
  o Remove looping logic from shrink_page_list and instead do it all from
    shrink_inactive_list
  o Rebase to 2.6.35-rc6

Changelog since V3
  o Distinguish between file and anon related IO from page reclaim
  o Allow anon writeback from reclaim context
  o Sync old inodes first in background writeback
  o Pre-emptively clean pages when dirty pages are encountered on the LRU
  o Rebase to 2.6.35-rc5

Changelog since V2
  o Add acks and reviewed-bys
  o Do not lock multiple pages at the same time for writeback as it's unsafe
  o Drop the clean_page_list function. It alters timing with very little
    benefit. Without the contiguous writing, it doesn't do much to simplify
    the subsequent patches either
  o Throttle processes that encounter dirty pages in direct reclaim. Instead
    wakeup flusher threads to clean the number of pages encountered that were
    dirty

Changelog since V1
  o Merge with series that reduces stack usage in page reclaim in general
  o Allow memcg to writeback pages as they are not expected to overflow stack
  o Drop the contiguous-write patch for the moment

There is a problem in the stack depth usage of page reclaim. Particularly
during direct reclaim, it is possible to overflow the stack if it calls into
the filesystems writepage function. This patch series begins by preventing
writeback from direct reclaim and allowing btrfs and xfs to writeback from
kswapd context. As this is a potentially large change, the remainder of
the series aims to reduce any filesystem writeback from page reclaim and
depend more on background flush.

The first patch in the series is a roll-up of what is currently in mmotm. It's
provided for convenience of testing.

Patch 2 and 3 note that it is important to distinguish between file and anon
page writeback from page reclaim as they use stack to different depths. It
updates the trace points and scripts appropriately noting which mmotm patch
they should be merged with.

Patch 4 notes that the units in the report are incorrect and fixes it.

Patch 5 prevents direct reclaim writing out filesystem pages while still
allowing writeback of anon pages which is in less danger of stack overflow
and doesn't have something like background flush to clean the pages.
For filesystem pages, flusher threads are asked to clean the number of
pages encountered, the caller waits on congestion and puts the pages back
on the LRU.  For lumpy reclaim, the caller will wait for a time calling the
flusher multiple times waiting on dirty pages to be written out before trying
to reclaim the dirty pages a second time. This increases the responsibility
of kswapd somewhat because it's now cleaning pages on behalf of direct
reclaimers but unlike background flushers, kswapd knows what zone pages
need to be cleaned from. As it is async IO, it should not cause kswapd to
stall (at least until the queue is congested) but the order that pages are
reclaimed on the LRU is altered. Dirty pages that would have been reclaimed
by direct reclaimers are getting another lap on the LRU. The dirty pages
could have been put on a dedicated list but this increased counter overhead
and the number of lists and it is unclear if it is necessary.

Patches 6 and 7 cover writeback-related changes, the first a roll-up of
what is in linux-next and the second which is a roll-up of a series to
write back older inodes first.

Patch 8 notes that dirty pages can still be found at the end of the LRU.
If a number of them are encountered, it's reasonable to assume that a similar
number of dirty pages will be discovered in the very near future as that was
the dirtying pattern at the time. The patch pre-emptively kicks background
flusher to clean a number of pages creating feedback from page reclaim to
background flusher that is based on scanning rates.

Patch 9 notes that patch 8 depends on a certain amount of luck that the
right inodes are found. To improve the odds, inodes with pages at the end
of the LRU list are flagged. This is later picked up by background flushers
and the inodes moved immediately to the dispatch queue.

I ran a number of tests with monitoring on X86, X86-64 and PPC64. Each
machine had 3G of RAM and the CPUs were

X86:	Intel P4 2-core
X86-64:	AMD Phenom 4-core
PPC64:	PPC970MP

Each used a single disk and the onboard IO controller. Dirty ratio was left
at 20. Tests on an earlier series indicated that moving to 40 did not make
much difference. The filesystem used for all tests was XFS.

Five kernels are compared.

traceonly-v5		is the first 4 patches of this series
nodirect-v5		is the first 5 patches
flusholdest-v5		makes background flush behave like kupdated (patch 1-7)
flushforward-v5		pre-emptively cleans pages when encountered on the LRU (patch 1-8)
flushprio-v5		flags inodes with dirty pages at end of LRU (patch 1-9)

The results on each test is broken up into two parts.  The first part is
a report based on the ftrace postprocessing script and reports on direct
reclaim and kswapd activity. The second part reports what percentage of
time was spent in direct reclaim and kswapd being awake.

To work out the percentage of time spent in direct reclaim, I used
/usr/bin/time to get the User + Sys CPU time. The stalled time was taken
from the post-processing script.  The total time is (User + Sys + Stall)
and obviously the percentage is of stalled over total time.

I am omitting the actual performance results simply because they are not
interesting with very few significant changes.

kernbench
=========

No writeback from reclaim initiated and no performance change of significance.

IOzone
======

No writeback from reclaim initiated and no performance change of significance.

SysBench
========

The results were based on a read/write and as the machine is under-provisioned
for the type of tests, figures are very unstable so not reported.  with
variances up to 15%. Part of the problem is that larger thread counts push
the test into swap as the memory is insufficient and destabilises results
further. I could tune for this, but it was reclaim that was important.

X86
                                     traceonly-v5 nodirect-v5 flusholdest-v5  flushforward-v5   flushprio-v5
Direct reclaims                                 18         15         31         22         34 
Direct reclaim pages scanned                  1767       1885       3498       1666       2176 
Direct reclaim write file async I/O            259          0          0          0          0 
Direct reclaim write anon async I/O             28         26         86         32         32 
Direct reclaim write file sync I/O               0          0          0          0          0 
Direct reclaim write anon sync I/O               0          0          0          0          0 
Wake kswapd requests                        266128     279738     365549     234752     352031 
Kswapd wakeups                                 541        597        569        609        554 
Kswapd pages scanned                      12485052   12620304   12713097   12302805   11592971 
Kswapd reclaim write file async I/O         401149     374001     500370       5142       7271 
Kswapd reclaim write anon async I/O         135967     130201     130628     129526     105762 
Kswapd reclaim write file sync I/O               0          0          0          0          0 
Kswapd reclaim write anon sync I/O               0          0          0          0          0 
Time stalled direct reclaim (seconds)         0.00       0.04       0.58       0.59       0.40 
Time kswapd awake (seconds)                1040.47    1001.26    1057.79     884.41     909.75 

User/Sys Time Running Test (seconds)       1258.12   1270.56   1285.73   1273.26   1270.09
Percentage Time Spent Direct Reclaim         0.00%     0.00%     0.05%     0.05%     0.03%
Total Elapsed Time (seconds)               7622.12   7664.25   7730.25   7957.59   7661.06
Percentage Time kswapd Awake                13.65%    13.06%    13.68%    11.11%    11.87%

Dirty file pages on the X86 test machine were not much of a problem to
begin with and the patches eliminate them as expected and time to complete
the test was not negatively impacted as a result.

Pre-emptively writing back a window of dirty pages when countered on the
LRU makes a big difference - the number of dirty file pages encountered by
kswapd was reduced by 99%. Prioritising inodes did not appear to help but it
intuitively makes sense but probably requires a larger machine to illustrate.

X86-64
                                     traceonly-v5 nodirect-v5 flusholdest-v5  flushforward-v5   flushprio-v5
Direct reclaims                               1169        865        878        787        658 
Direct reclaim pages scanned                152716     187967     142147      89971      71595 
Direct reclaim write file async I/O          19236          0          0          0          0 
Direct reclaim write anon async I/O            222        288        778       4686       1810 
Direct reclaim write file sync I/O              10          0          0          0          0 
Direct reclaim write anon sync I/O               0          0          0          0          0 
Wake kswapd requests                        989721     802358     792731     797486     718918 
Kswapd wakeups                                1478       1455       1554       1487       1551 
Kswapd pages scanned                      16265687   16107419   17122329   15026395   15523377 
Kswapd reclaim write file async I/O         542359     642927     722882     131580     110082 
Kswapd reclaim write anon async I/O         220076     254827     250112     202714     210651 
Kswapd reclaim write file sync I/O               0          0          0          0          0 
Kswapd reclaim write anon sync I/O               0          0          0          0          0 
Time stalled direct reclaim (seconds)        28.84      20.09      18.82       4.15       3.85 
Time kswapd awake (seconds)                2057.64    2039.19    2182.38    2102.69    2218.86 

User/Sys Time Running Test (seconds)        648.71    641.01    666.85    658.78     661.4
Percentage Time Spent Direct Reclaim         4.26%     3.04%     2.74%     0.63%     0.58%
Total Elapsed Time (seconds)               6249.47   6227.25   6586.07   6609.44   6779.25
Percentage Time kswapd Awake                32.93%    32.75%    33.14%    31.81%    32.73%

Direct reclaim of filesystem pages is eliminated as expected without an
impact on time although kswapd had to write back more pages as a result.

Flushing just the oldest inode was not much of a help in terms of how many
pages needed to be written back from reclaim but pre-emptively waking flusher
threads helped a lot with a reduction of 76% in the number of dirty pages
written back by kswapd. Prioritising which inodes to write back further
reduced the number of dirty pages written by kswapd.

PPC64
                                     traceonly-v5 nodirect-v5 flusholdest-v5  flushforward-v5   flushprio-v5
Direct reclaims                               3768       4228       3941       3265       2397 
Direct reclaim pages scanned                245094     269801     332357     236762     158289 
Direct reclaim write file async I/O          10616          0          0          0          0 
Direct reclaim write anon async I/O          11546       1193      15267      21602      12538 
Direct reclaim write file sync I/O              23          0          0          0          0 
Direct reclaim write anon sync I/O               0          1          0          0          0 
Wake kswapd requests                       1398632    1642783    1606559    1813376    1139228 
Kswapd wakeups                                 476        583        520        560        472 
Kswapd pages scanned                      14302902   17580548   16401067   16151135   12920691 
Kswapd reclaim write file async I/O        1381817    1723621    1917396    1389232     909794 
Kswapd reclaim write anon async I/O         157035     146377     130640     121417     125872 
Kswapd reclaim write file sync I/O               0          0          0          0          0 
Kswapd reclaim write anon sync I/O               0          0          0          0          0 
Time stalled direct reclaim (seconds)        58.73      52.24      71.39      51.26      28.30 
Time kswapd awake (seconds)                1737.50    2024.64    1973.51    2098.99    1571.24 

User/Sys Time Running Test (seconds)       1235.53   1326.49   1277.11   1320.22   1223.05
Percentage Time Spent Direct Reclaim         4.54%     3.79%     5.29%     3.74%     2.26%
Total Elapsed Time (seconds)               8132.27   9184.86   8686.22   9133.44   7957.50
Percentage Time kswapd Awake                21.37%    22.04%    22.72%    22.98%    19.75%

Direct reclaim filesystem writes are eliminated but the scan rates went
way up. For some unknown reason, kswapd was scanning a lot harder meaning
that the total number of dirty pages encountered when pre-emptively waking
flusher threads was not reduced. The full series did reduce the amount of
IO queued by page reclaim and the strongest indicator the flushprio makes
a positive difference.

Stress HighAlloc
================

This test builds a large number of kernels simultaneously so that the total
workload is 1.5 times the size of RAM. It then attempts to allocate all of
RAM as huge pages. The metric is the percentage of memory allocated using
load (Pass 1), a second attempt under load (Pass 2) and when the kernel
compiles are finishes and the system is quiet (At Rest). The patches have
little impact on the success rates.

X86
                                     traceonly-v5 nodirect-v5 flusholdest-v5  flushforward-v5   flushprio-v5
Direct reclaims                                538        508        512        633        593 
Direct reclaim pages scanned                190812      70478      74946     127525      76617 
Direct reclaim write file async I/O            628          0          0          0          0 
Direct reclaim write anon async I/O          35379       7199       4746       7331       5738 
Direct reclaim write file sync I/O             312          0          0          0          0 
Direct reclaim write anon sync I/O           22652       3429        491         41          8 
Wake kswapd requests                          1609       1586       1589       1750       1632 
Kswapd wakeups                                 485        444        463        527        598 
Kswapd pages scanned                      25022132    2664510    4084357    6451609    2512773 
Kswapd reclaim write file async I/O         290963       7744      41175        145        261 
Kswapd reclaim write anon async I/O        4170347     216140     140890     118612      91151 
Kswapd reclaim write file sync I/O               0          0          0          0          0 
Kswapd reclaim write anon sync I/O               0          0          0          0          0 
Time stalled direct reclaim (seconds)     15440.79     703.88     402.93     329.52     172.65 
Time kswapd awake (seconds)               13618.08     611.77     464.31     394.77     250.73 

User/Sys Time Running Test (seconds)       2724.57   1863.47   1844.16   1780.75    1834.6
Percentage Time Spent Direct Reclaim        85.00%    27.42%    17.93%    15.62%     8.60%
Total Elapsed Time (seconds)              17607.54   2796.22   2522.39   2432.76   2285.82
Percentage Time kswapd Awake                77.34%    21.88%    18.41%    16.23%    10.97%

Total time running the test was massively reduced by the series the full
series eliminates writebacks from page reclaim to almost negligible levels.

X86-64
Direct reclaims                               1176       1099       1036       1163       1082 
Direct reclaim pages scanned                184337     122290     122255     143548     141703 
Direct reclaim write file async I/O           2317          0          0          0          0 
Direct reclaim write anon async I/O          35551      15499      18552       8564      16526 
Direct reclaim write file sync I/O            1817          0          0          0          0 
Direct reclaim write anon sync I/O           15920       6131       6516         55        473 
Wake kswapd requests                          1175       4843       1032       8758       1082 
Kswapd wakeups                                1106        868        767        990        810 
Kswapd pages scanned                      27016425    4411416    4602059    4541873    4748067 
Kswapd reclaim write file async I/O         608823      58552      42697      95656       3039 
Kswapd reclaim write anon async I/O        3221178     321297     324274     148098     240716 
Kswapd reclaim write file sync I/O               0          0          0          0          0 
Kswapd reclaim write anon sync I/O               0          0          0          0          0 
Time stalled direct reclaim (seconds)     10163.96     808.97     977.48       0.00     442.69 
Time kswapd awake (seconds)                6530.96     637.66     796.94     344.83     462.47 

User/Sys Time Running Test (seconds)       2824.92   2804.07   2774.22   2783.62   2794.37
Percentage Time Spent Direct Reclaim        78.25%    22.39%    26.05%     0.00%    13.68%
Total Elapsed Time (seconds)              12509.94   3181.40   3362.38   2670.34   2835.91
Percentage Time kswapd Awake                52.21%    20.04%    23.70%    12.91%    16.31%

Like X86, total time spent on the test was significantly reduced and like
elsewhere, filesystem IO due to reclaim is way down.

PPC64
                                     traceonly-v5 nodirect-v5 flusholdest-v5  flushforward-v5   flushprio-v5
Direct reclaims                                557        703        750        671        777 
Direct reclaim pages scanned                139469     117372     126661     109564     117223 
Direct reclaim write file async I/O            639          0          0          0          0 
Direct reclaim write anon async I/O          28997      15147      17780      12098      17165 
Direct reclaim write file sync I/O             276          0          0          0          0 
Direct reclaim write anon sync I/O           11486      12128       8582       4071       1985 
Wake kswapd requests                           278        261        295        300        284 
Kswapd wakeups                                 217        177        201        161        179 
Kswapd pages scanned                       8099598    4109065    6229604    4645288    4007059 
Kswapd reclaim write file async I/O         213775      42499      43389       6484       2868 
Kswapd reclaim write anon async I/O        1836546     503102     872233     316106     368779 
Kswapd reclaim write file sync I/O               0          0          0          0          0 
Kswapd reclaim write anon sync I/O               0          0          0          0          0 
Time stalled direct reclaim (seconds)      4604.20    1247.14    2007.51     749.63     793.94 
Time kswapd awake (seconds)                4020.08    1211.26    1886.92     762.10     805.14 

User/Sys Time Running Test (seconds)       3585.62   2607.25   2575.92    2219.6   2559.04
Percentage Time Spent Direct Reclaim        56.22%    32.36%    43.80%    25.25%    23.68%
Total Elapsed Time (seconds)               6023.51   2765.10   3492.52   2245.45   2308.06
Percentage Time kswapd Awake                66.74%    43.81%    54.03%    33.94%    34.88%

Similar story, the test completed faster and page reclaim IO is down.

Overall, I think this series is a step in the right direction particularly
with respect to reducing the number of pages written by page reclaim.

In terms of merging, I don't think it matters if Wu's patches go in before
or after. If they go in after, the last patch in this series has a minor
conflict but it is easily resolved.  I included roll-ups here to present
the results but the rest of the series should stand on its own.

Any opinions on suitability for merging of the following parts of the series?

  vmscan: tracing: Roll up of patches currently in mmotm
  vmscan: tracing: Update trace event to track if page reclaim IO is
    for anon or file pages
  vmscan: tracing: Update post-processing script to distinguish between
    anon and file IO from page reclaim
  vmscan: tracing: Correct units in post-processing script
  vmscan: Do not writeback filesystem pages in direct reclaim
  vmscan: Kick flusher threads to clean pages when reclaim is
    encountering dirty pages
  writeback: Prioritise dirty inodes encountered by reclaim for
    background flushing

 .../trace/postprocess/trace-vmscan-postprocess.pl  |  686 ++++++++++++++++++++
 fs/btrfs/inode.c                                   |    2 +-
 fs/drop_caches.c                                   |    2 +-
 fs/fs-writeback.c                                  |  225 +++++--
 fs/gfs2/inode.c                                    |    2 +-
 fs/inode.c                                         |   16 +-
 fs/nilfs2/gcdat.c                                  |    2 +-
 fs/notify/inode_mark.c                             |    6 +-
 fs/notify/inotify/inotify.c                        |    7 +-
 fs/quota/dquot.c                                   |    2 +-
 fs/xfs/linux-2.6/xfs_iops.c                        |    4 +-
 include/linux/backing-dev.h                        |    7 +-
 include/linux/fs.h                                 |    9 +-
 include/linux/memcontrol.h                         |    5 -
 include/linux/mmzone.h                             |   15 -
 include/linux/writeback.h                          |    5 +-
 include/trace/events/gfpflags.h                    |   37 +
 include/trace/events/kmem.h                        |   38 +-
 include/trace/events/vmscan.h                      |  202 ++++++
 include/trace/events/writeback.h                   |  148 +++++
 mm/backing-dev.c                                   |  115 +---
 mm/memcontrol.c                                    |   31 -
 mm/page-writeback.c                                |    5 +-
 mm/page_alloc.c                                    |    2 -
 mm/vmscan.c                                        |  483 ++++++++------
 mm/vmstat.c                                        |    2 -
 26 files changed, 1584 insertions(+), 474 deletions(-)
 create mode 100644 Documentation/trace/postprocess/trace-vmscan-postprocess.pl
 create mode 100644 include/trace/events/gfpflags.h
 create mode 100644 include/trace/events/vmscan.h
 create mode 100644 include/trace/events/writeback.h

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@xxxxxxxxxx  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@xxxxxxxxx";> email@xxxxxxxxx </a>