Re: [PATCH mm-unstable v1 1/4] mm/mglru: fix underprotected page cache

Kairui Song <ryncsn@xxxxxxxxx> · Wed, 20 Dec 2023 02:58:22 +0800

Yu Zhao <yuzhao@xxxxxxxxxx> 于2023年12月19日周二 11:45写道：
>
> On Mon, Dec 18, 2023 at 8:21 PM Yu Zhao <yuzhao@xxxxxxxxxx> wrote:
> >
> > On Mon, Dec 18, 2023 at 11:05 AM Kairui Song <ryncsn@xxxxxxxxx> wrote:
> > >
> > > Yu Zhao <yuzhao@xxxxxxxxxx> 于2023年12月15日周五 12:56写道：
> > > >
> > > > On Thu, Dec 14, 2023 at 04:51:00PM -0700, Yu Zhao wrote:
> > > > > On Thu, Dec 14, 2023 at 11:38 AM Kairui Song <ryncsn@xxxxxxxxx> wrote:
> > > > > >
> > > > > > Yu Zhao <yuzhao@xxxxxxxxxx> 于2023年12月14日周四 11:09写道：
> > > > > > > On Wed, Dec 13, 2023 at 12:59:14AM -0700, Yu Zhao wrote:
> > > > > > > > On Tue, Dec 12, 2023 at 8:03 PM Kairui Song <ryncsn@xxxxxxxxx> wrote:
> > > > > > > > >
> > > > > > > > > Kairui Song <ryncsn@xxxxxxxxx> 于2023年12月12日周二 14:52写道：
> > > > > > > > > >
> > > > > > > > > > Yu Zhao <yuzhao@xxxxxxxxxx> 于2023年12月12日周二 06:07写道：
> > > > > > > > > > >
> > > > > > > > > > > On Fri, Dec 8, 2023 at 1:24 AM Kairui Song <ryncsn@xxxxxxxxx> wrote:
> > > > > > > > > > > >
> > > > > > > > > > > > Yu Zhao <yuzhao@xxxxxxxxxx> 于2023年12月8日周五 14:14写道：
> > > > > > > > > > > > >
> > > > > > > > > > > > > Unmapped folios accessed through file descriptors can be
> > > > > > > > > > > > > underprotected. Those folios are added to the oldest generation based
> > > > > > > > > > > > > on:
> > > > > > > > > > > > > 1. The fact that they are less costly to reclaim (no need to walk the
> > > > > > > > > > > > >    rmap and flush the TLB) and have less impact on performance (don't
> > > > > > > > > > > > >    cause major PFs and can be non-blocking if needed again).
> > > > > > > > > > > > > 2. The observation that they are likely to be single-use. E.g., for
> > > > > > > > > > > > >    client use cases like Android, its apps parse configuration files
> > > > > > > > > > > > >    and store the data in heap (anon); for server use cases like MySQL,
> > > > > > > > > > > > >    it reads from InnoDB files and holds the cached data for tables in
> > > > > > > > > > > > >    buffer pools (anon).
> > > > > > > > > > > > >
> > > > > > > > > > > > > However, the oldest generation can be very short lived, and if so, it
> > > > > > > > > > > > > doesn't provide the PID controller with enough time to respond to a
> > > > > > > > > > > > > surge of refaults. (Note that the PID controller uses weighted
> > > > > > > > > > > > > refaults and those from evicted generations only take a half of the
> > > > > > > > > > > > > whole weight.) In other words, for a short lived generation, the
> > > > > > > > > > > > > moving average smooths out the spike quickly.
> > > > > > > > > > > > >
> > > > > > > > > > > > > To fix the problem:
> > > > > > > > > > > > > 1. For folios that are already on LRU, if they can be beyond the
> > > > > > > > > > > > >    tracking range of tiers, i.e., five accesses through file
> > > > > > > > > > > > >    descriptors, move them to the second oldest generation to give them
> > > > > > > > > > > > >    more time to age. (Note that tiers are used by the PID controller
> > > > > > > > > > > > >    to statistically determine whether folios accessed multiple times
> > > > > > > > > > > > >    through file descriptors are worth protecting.)
> > > > > > > > > > > > > 2. When adding unmapped folios to LRU, adjust the placement of them so
> > > > > > > > > > > > >    that they are not too close to the tail. The effect of this is
> > > > > > > > > > > > >    similar to the above.
> > > > > > > > > > > > >
> > > > > > > > > > > > > On Android, launching 55 apps sequentially:
> > > > > > > > > > > > >                            Before     After      Change
> > > > > > > > > > > > >   workingset_refault_anon  25641024   25598972   0%
> > > > > > > > > > > > >   workingset_refault_file  115016834  106178438  -8%
> > > > > > > > > > > >
> > > > > > > > > > > > Hi Yu,
> > > > > > > > > > > >
> > > > > > > > > > > > Thanks you for your amazing works on MGLRU.
> > > > > > > > > > > >
> > > > > > > > > > > > I believe this is the similar issue I was trying to resolve previously:
> > > > > > > > > > > > https://lwn.net/Articles/945266/
> > > > > > > > > > > > The idea is to use refault distance to decide if the page should be
> > > > > > > > > > > > place in oldest generation or some other gen, which per my test,
> > > > > > > > > > > > worked very well, and we have been using refault distance for MGLRU in
> > > > > > > > > > > > multiple workloads.
> > > > > > > > > > > >
> > > > > > > > > > > > There are a few issues left in my previous RFC series, like anon pages
> > > > > > > > > > > > in MGLRU shouldn't be considered, I wanted to collect feedback or test
> > > > > > > > > > > > cases, but unfortunately it seems didn't get too much attention
> > > > > > > > > > > > upstream.
> > > > > > > > > > > >
> > > > > > > > > > > > I think both this patch and my previous series are for solving the
> > > > > > > > > > > > file pages underpertected issue, and I did a quick test using this
> > > > > > > > > > > > series, for mongodb test, refault distance seems still a better
> > > > > > > > > > > > solution (I'm not saying these two optimization are mutually exclusive
> > > > > > > > > > > > though, just they do have some conflicts in implementation and solving
> > > > > > > > > > > > similar problem):
> > > > > > > > > > > >
> > > > > > > > > > > > Previous result:
> > > > > > > > > > > > ==================================================================
> > > > > > > > > > > > Execution Results after 905 seconds
> > > > > > > > > > > > ------------------------------------------------------------------
> > > > > > > > > > > >                   Executed        Time (µs)       Rate
> > > > > > > > > > > >   STOCK_LEVEL     2542            27121571486.2   0.09 txn/s
> > > > > > > > > > > > ------------------------------------------------------------------
> > > > > > > > > > > >   TOTAL           2542            27121571486.2   0.09 txn/s
> > > > > > > > > > > >
> > > > > > > > > > > > This patch:
> > > > > > > > > > > > ==================================================================
> > > > > > > > > > > > Execution Results after 900 seconds
> > > > > > > > > > > > ------------------------------------------------------------------
> > > > > > > > > > > >                   Executed        Time (µs)       Rate
> > > > > > > > > > > >   STOCK_LEVEL     1594            27061522574.4   0.06 txn/s
> > > > > > > > > > > > ------------------------------------------------------------------
> > > > > > > > > > > >   TOTAL           1594            27061522574.4   0.06 txn/s
> > > > > > > > > > > >
> > > > > > > > > > > > Unpatched version is always around ~500.
> > > > > > > > > > >
> > > > > > > > > > > Thanks for the test results!
> > > > > > > > > > >
> > > > > > > > > > > > I think there are a few points here:
> > > > > > > > > > > > - Refault distance make use of page shadow so it can better
> > > > > > > > > > > > distinguish evicted pages of different access pattern (re-access
> > > > > > > > > > > > distance).
> > > > > > > > > > > > - Throttled refault distance can help hold part of workingset when
> > > > > > > > > > > > memory is too small to hold the whole workingset.
> > > > > > > > > > > >
> > > > > > > > > > > > So maybe part of this patch and the bits of previous series can be
> > > > > > > > > > > > combined to work better on this issue, how do you think?
> > > > > > > > > > >
> > > > > > > > > > > I'll try to find some time this week to look at your RFC. It'd be a
> > > > > > > > >
> > > > > > > > > Hi Yu,
> > > > > > > > >
> > > > > > > > > I'm working on V4 of the RFC now, which just update some comments, and
> > > > > > > > > skip anon page re-activation in refault path for mglru which was not
> > > > > > > > > very helpful, only some tiny adjustment.
> > > > > > > > > And I found it easier to test with fio, using following test script:
> > > > > > > > >
> > > > > > > > > #!/bin/bash
> > > > > > > > > swapoff -a
> > > > > > > > >
> > > > > > > > > modprobe brd rd_nr=1 rd_size=16777216
> > > > > > > > > mkfs.ext4 /dev/ram0
> > > > > > > > > mount /dev/ram0 /mnt
> > > > > > > > >
> > > > > > > > > mkdir -p /sys/fs/cgroup/benchmark
> > > > > > > > > cd /sys/fs/cgroup/benchmark
> > > > > > > > >
> > > > > > > > > echo 4G > memory.max
> > > > > > > > > echo $$ > cgroup.procs
> > > > > > > > > echo 3 > /proc/sys/vm/drop_caches
> > > > > > > > >
> > > > > > > > > fio -name=mglru --numjobs=12 --directory=/mnt --size=1024m \
> > > > > > > > >           --buffered=1 --ioengine=io_uring --iodepth=128 \
> > > > > > > > >           --iodepth_batch_submit=32 --iodepth_batch_complete=32 \
> > > > > > > > >           --rw=randread --random_distribution=zipf:0.5 --norandommap \
> > > > > > > > >           --time_based --ramp_time=5m --runtime=5m --group_reporting
> > > > > > > > >
> > > > > > > > > zipf:0.5 is used here to simulate a cached read with slight bias
> > > > > > > > > towards certain pages.
> > > > > > > > > Unpatched 6.7-rc4:
> > > > > > > > > Run status group 0 (all jobs):
> > > > > > > > >    READ: bw=6548MiB/s (6866MB/s), 6548MiB/s-6548MiB/s
> > > > > > > > > (6866MB/s-6866MB/s), io=1918GiB (2060GB), run=300001-300001msec
> > > > > > > > >
> > > > > > > > > Patched with RFC v4:
> > > > > > > > > Run status group 0 (all jobs):
> > > > > > > > >    READ: bw=7270MiB/s (7623MB/s), 7270MiB/s-7270MiB/s
> > > > > > > > > (7623MB/s-7623MB/s), io=2130GiB (2287GB), run=300001-300001msec
> > > > > > > > >
> > > > > > > > > Patched with this series:
> > > > > > > > > Run status group 0 (all jobs):
> > > > > > > > >    READ: bw=7098MiB/s (7442MB/s), 7098MiB/s-7098MiB/s
> > > > > > > > > (7442MB/s-7442MB/s), io=2079GiB (2233GB), run=300002-300002msec
> > > > > > > > >
> > > > > > > > > MGLRU off:
> > > > > > > > > Run status group 0 (all jobs):
> > > > > > > > >    READ: bw=6525MiB/s (6842MB/s), 6525MiB/s-6525MiB/s
> > > > > > > > > (6842MB/s-6842MB/s), io=1912GiB (2052GB), run=300002-300002msec
> > > > > > > > >
> > > > > > > > > - If I change zipf:0.5 to random:
> > > > > > > > > Unpatched 6.7-rc4:
> > > > > > > > > Patched with this series:
> > > > > > > > > Run status group 0 (all jobs):
> > > > > > > > >    READ: bw=5975MiB/s (6265MB/s), 5975MiB/s-5975MiB/s
> > > > > > > > > (6265MB/s-6265MB/s), io=1750GiB (1879GB), run=300002-300002msec
> > > > > > > > >
> > > > > > > > > Patched with RFC v4:
> > > > > > > > > Run status group 0 (all jobs):
> > > > > > > > >    READ: bw=5987MiB/s (6278MB/s), 5987MiB/s-5987MiB/s
> > > > > > > > > (6278MB/s-6278MB/s), io=1754GiB (1883GB), run=300001-300001msec
> > > > > > > > >
> > > > > > > > > Patched with this series:
> > > > > > > > > Run status group 0 (all jobs):
> > > > > > > > >    READ: bw=5839MiB/s (6123MB/s), 5839MiB/s-5839MiB/s
> > > > > > > > > (6123MB/s-6123MB/s), io=1711GiB (1837GB), run=300001-300001msec
> > > > > > > > >
> > > > > > > > > MGLRU off:
> > > > > > > > > Run status group 0 (all jobs):
> > > > > > > > >    READ: bw=5689MiB/s (5965MB/s), 5689MiB/s-5689MiB/s
> > > > > > > > > (5965MB/s-5965MB/s), io=1667GiB (1790GB), run=300003-300003msec
> > > > > > > > >
> > > > > > > > > fio uses ramdisk so LRU accuracy will have smaller impact. The Mongodb
> > > > > > > > > test I provided before uses a SATA SSD so it will have a much higher
> > > > > > > > > impact. I'll provides a script to setup the test case and run it, it's
> > > > > > > > > more complex to setup than fio since involving setting up multiple
> > > > > > > > > replicas and auth and hundreds of GB of test fixtures, I'm currently
> > > > > > > > > occupied by some other tasks but will try best to send them out as
> > > > > > > > > soon as possible.
> > > > > > > >
> > > > > > > > Thanks! Apparently your RFC did show better IOPS with both access
> > > > > > > > patterns, which was a surprise to me because it had higher refaults
> > > > > > > > and usually higher refautls result in worse performance.
> > > >
> > > > And thanks for providing the refaults I requested for -- your data
> > > > below confirms what I mentioned above:
> > > >
> > > > For fio:
> > > >                            Your RFC   This series   Change
> > > >   workingset_refault_file  628192729  596790506     -5%
> > > >   IOPS                     1862k      1830k         -2%
> > > >
> > > > For MongoDB:
> > > >                            Your RFC   This series   Change
> > > >   workingset_refault_anon  10512      35277         +30%
> > > >   workingset_refault_file  22751782   20335355      -11%
> > > >   total                    22762294   20370632      -11%
> > > >   TPS                      0.09       0.06          -33%
> > > >
> > > > For MongoDB, this series should be a big win (but apparently it's not),
> > > > especially when using zram, since an anon refault should be a lot
> > > > cheaper than a file refault.
> > > >
> > > > So, I'm baffled...
> > > >
> > > > One important detail I forgot to mention: based on your data from
> > > > lru_gen_full, I think there is another difference between our Kconfigs:
> > > >
> > > >                   Your Kconfig  My Kconfig  Max possible
> > > >   LRU_REFS_WIDTH  1             2           2
> > >
> > > Hi Yu,
> > >
> > > Thanks for the info, my fault, I forgot to update my config as I was
> > > testing some other features.
> > > Buf after I changed LRU_REFS_WIDTH to 2 by disabling IDLE_PAGE, thing
> > > got much worse for MongoDB test:
> > >
> > > With LRU_REFS_WIDTH == 2:
> > >
> > > This patch:
> > > ==================================================================
> > > Execution Results after 919 seconds
> > > ------------------------------------------------------------------
> > >                   Executed        Time (µs)       Rate
> > >   STOCK_LEVEL     488             27598136201.9   0.02 txn/s
> > > ------------------------------------------------------------------
> > >   TOTAL           488             27598136201.9   0.02 txn/s
> > >
> > > memcg    86 /system.slice/docker-1c3a90be9f0a072f5719332419550cd0e1455f2cd5863bc2780ca4d3f913ece5.scope
> > >  node     0
> > >           1     948187          0x          0x
> > >                      0          0           0           0           0
> > >          0           0·
> > >                      1          0           0           0           0
> > >          0           0·
> > >                      2          0           0           0           0
> > >          0           0·
> > >                      3          0           0           0           0
> > >          0           0·
> > >                                 0           0           0           0
> > >          0           0·
> > >           2     948187          0     6051788·
> > >                      0          0r          0e          0p      11916r
> > >      66442e          0p
> > >                      1          0r          0e          0p        903r
> > >      16888e          0p
> > >                      2          0r          0e          0p        459r
> > >       9764e          0p
> > >                      3          0r          0e          0p          0r
> > >          0e       2874p
> > >                                 0           0           0           0
> > >          0           0·
> > >           3     948187    1353160        6351·
> > >                      0          0           0           0           0
> > >          0           0·
> > >                      1          0           0           0           0
> > >          0           0·
> > >                      2          0           0           0           0
> > >          0           0·
> > >                      3          0           0           0           0
> > >          0           0·
> > >                                 0           0           0           0
> > >          0           0·
> > >           4      73045      23573          12·
> > >                      0          0R          0T          0     3498607R
> > >    4868605T          0·
> > >                      1          0R          0T          0     3012246R
> > >    3270261T          0·
> > >                      2          0R          0T          0     2498608R
> > >    2839104T          0·
> > >                      3          0R          0T          0           0R
> > >    1983947T          0·
> > >                           1486579L          0O    1380614Y       2945N
> > >       2945F       2734A
> > >
> > > workingset_refault_anon 0
> > > workingset_refault_file 18130598
> > >
> > >               total        used        free      shared  buff/cache   available
> > > Mem:          31978        6705         312          20       24960       24786
> > > Swap:         31977           4       31973
> > >
> > > RFC:
> > > ==================================================================
> > > Execution Results after 908 seconds
> > > ------------------------------------------------------------------
> > >                   Executed        Time (µs)       Rate
> > >   STOCK_LEVEL     2252            27159962888.2   0.08 txn/s
> > > ------------------------------------------------------------------
> > >   TOTAL           2252            27159962888.2   0.08 txn/s
> > >
> > > workingset_refault_anon 22585
> > > workingset_refault_file 22715256
> > >
> > > memcg    66 /system.slice/docker-0989446ff78106e32d3f400a0cf371c9a703281bded86d6d6bb1af706ebb25da.scope
> > >  node     0
> > >          22     563007       2274     1198225·
> > >                      0          0r          1e          0p          0r
> > >     697076e          0p
> > >                      1          0r          0e          0p          0r
> > >          0e     325661p
> > >                      2          0r          0e          0p          0r
> > >          0e     888728p
> > >                      3          0r          0e          0p          0r
> > >          0e    3602238p
> > >                                 0           0           0           0
> > >          0           0·
> > >          23     532222       7525     4948747·
> > >                      0          0           0           0           0
> > >          0           0·
> > >                      1          0           0           0           0
> > >          0           0·
> > >                      2          0           0           0           0
> > >          0           0·
> > >                      3          0           0           0           0
> > >          0           0·
> > >                                 0           0           0           0
> > >          0           0·
> > >          24     500367    1214667        3292·
> > >                      0          0           0           0           0
> > >          0           0·
> > >                      1          0           0           0           0
> > >          0           0·
> > >                      2          0           0           0           0
> > >          0           0·
> > >                      3          0           0           0           0
> > >          0           0·
> > >                                 0           0           0           0
> > >          0           0·
> > >          25     469692      40797         466·
> > >                      0          0R        271T          0           0R
> > >    1162165T          0·
> > >                      1          0R          0T          0      774028R
> > >    1205332T          0·
> > >                      2          0R          0T          0           0R
> > >     932484T          0·
> > >                      3          0R          1T          0           0R
> > >    4252158T          0·
> > >                          25178380L     156515O   23953602Y      59234N
> > >      49391F      48664A
> > >
> > >               total        used        free      shared  buff/cache   available
> > > Mem:          31978        6968         338           5       24671       24555
> > > Swap:         31977        1533       30444
> > >
> > > Using same mongodb config (a 3 replica cluster using the same config):
> > > {
> > >     "net": {
> > >         "bindIpAll": true,
> > >         "ipv6": false,
> > >         "maxIncomingConnections": 10000,
> > >     },
> > >     "setParameter": {
> > >         "disabledSecureAllocatorDomains": "*"
> > >     },
> > >     "replication": {
> > >         "oplogSizeMB": 10480,
> > >         "replSetName": "issa-tpcc_0"
> > >     },
> > >     "security": {
> > >         "keyFile": "/data/db/keyfile"
> > >     },
> > >     "storage": {
> > >         "dbPath": "/data/db/",
> > >         "syncPeriodSecs": 60,
> > >         "directoryPerDB": true,
> > >         "wiredTiger": {
> > >             "engineConfig": {
> > >                 "cacheSizeGB": 5
> > >             }
> > >         }
> > >     },
> > >     "systemLog": {
> > >         "destination": "file",
> > >         "logAppend": true,
> > >         "logRotate": "rename",
> > >         "path": "/data/db/mongod.log",
> > >         "verbosity": 0
> > >     }
> > > }
> > >
> > > The test environment have 32g memory and 16 core.
> > >
> > > Per my analyze, the access pattern for the mongodb test is that page
> > > will be re-access long after it's evicted so PID controller won't
> > > protect higher tier. That RFC will make use of the long existing
> > > shadow to do feedback to PID/Gen so the result will be much better.
> > > Still need more adjusting though, will try to do a rebase on top of
> > > mm-unstable which includes your patch.
> > >
> > > I've no idea why the workingset_refault_* is higher in the better
> > > case, this a clearly an IO bound workload, Memory and IO is busy while
> > > CPU is not full...
> > >
> > > I've uploaded my local reproducer here:
> > > https://github.com/ryncsn/emm-test-project/tree/master/mongo-cluster
> > > https://github.com/ryncsn/py-tpcc
> >
> > Thanks for the repos -- I'm trying them right now. Which MongoDB
> > version did you use? setup.sh didn't seem to install it.
> >
> > Also do you have a QEMU image? It'd be a lot easier for me to
> > duplicate the exact environment by looking into it.
>
> I ended up using docker.io/mongodb/mongodb-community-server:latest,
> and it's not working:
>
> # docker exec -it mongo-r1 mongosh --eval \
> '"rs.initiate({
>     _id: "issa-tpcc_0",
>     members: [
>       {_id: 0, host: "mongo-r1"},
>       {_id: 1, host: "mongo-r2"},
>       {_id: 2, host: "mongo-r3"}
>     ]
> })"'
> Emulate Docker CLI using podman. Create /etc/containers/nodocker to quiet msg.
> Error: can only create exec sessions on running containers: container
> state improper

Hi Yu,

I've updated the test repo:
https://github.com/ryncsn/emm-test-project/tree/master/mongo-cluster

I've tested it on top of latest Fedora Cloud Image 39 and it worked
well for me, the README now contains detailed and not hard to follow
steps to reproduce this test.

Also I've updated the patch series, I plan to sent out maybe RFC v4
later but need a another or couple days to tidy up and collect test
result:
https://github.com/ryncsn/linux/commits/kasong/devel/refault-distance-v4/

You may want to do test on top of it, I'll be very grateful if there
are any feedback.

It's on top of current mm-unstable to make it work well with your fix
too. I managed to tweak it to be compatible with this series, but it
seems it might cause over-protection of pages and so the performance
is slightly worse than RFC v3.

And this commit message contains my latest test result on the MongoDB case:
https://github.com/ryncsn/linux/commit/cd84e5c8e2449d33d411bce1d863bc391f36d7c8
And you can see it's a IO bound task (100% ioutil and low CPU usage)
and anon pages are really idle, using ZRAM/Same disk as swap result in
similar performance on patched kernel.

And about the aging overhead issue I suspected before (regression in
FIO due to more aging), I think it's true, and I added two patches:
https://github.com/ryncsn/linux/commit/f80cc280752da59272870378947aad6c822be2b4
https://github.com/ryncsn/linux/commit/01d091c98077a74bc70153cc7a0179a17da4f26f

In the test cases we talked about above, where > ~100 generations are
generated during FIO test, I suspected that the aging overhead is
large and causing performance drain.
After these two patches, for a similar test cases, FIO improved from this:

Run status group 0 (all jobs):
   READ: bw=7593MiB/s (7962MB/s), 7593MiB/s-7593MiB/s
(7962MB/s-7962MB/s), io=2225GiB (2389GB), run=300002-300002msec
workingset_refault_anon 0
workingset_refault_file 641594126

To this:
Run status group 0 (all jobs):
   READ: bw=7747MiB/s (8124MB/s), 7747MiB/s-7747MiB/s
(8124MB/s-8124MB/s), io=2270GiB (2437GB), run=300001-300001msec
workingset_refault_anon 0
workingset_refault_file 641511205

lru_gen stat is similar for both case:
memcg    66 /benchmark
 node     0
        119     155874          0           0x
                     0          0r          0e          0p          0
         0           0·
                     1          0r          0e          0p          0
         0           0·
                     2          0r          0e          0p          0
         0           0·
                     3          0r          0e          0p          0
         0           0·
                                0           0           0           0
         0           0·

        120     151024          0       71410·
                     0          0           0           0           0r
    587382e          0p
                     1          0           0           0           0r
         0e     117796p
                     2          0           0           0           0r
         0e     193086p
                     3          0           0           0           0r
         0e     371926p
                                0           0           0           0
         0           0·

        121     146375          0      682854·
                     0          0           0           0           0
         0           0·
                     1          0           0           0           0
         0           0·
                     2          0           0           0           0
         0           0·
                     3          0           0           0           0
         0           0·
                                0           0           0           0
         0           0·

        122     141469          0        1348·
                     0          0R          0T          0           0R
   5132602T          0·
                     1          0R          0T          0       86010R
    244504T          0·
                     2          0R          0T          0           0R
    196061T          0·
                     3          0R          0T          0           0R
    397253T          0·
                           367101L      15850O      15820Y      93396N
      1275F        459A

The overhead of cmpxchg on page flag update is unavoidable though. I
think I could send out the two bulk update patch first for a proper
review first?