[RFC][PATCH v11 0/2] mm: Support for page hinting

Nitesh Narayan Lal <nitesh@xxxxxxxxxx> · Wed, 10 Jul 2019 15:51:56 -0400

This patch series proposes an efficient mechanism for reporting free memory
from a guest to its hypervisor. It especially enables guests with no page cache
(e.g., nvdimm, virtio-pmem) or with small page caches (e.g., ram > disk) to
rapidly hand back free memory to the hypervisor.
This approach has a minimal impact on the existing core-mm infrastructure.

Measurement results (measurement details appended to this email):
*Number of 5GB guests (each touching 4GB memory) that can be launched
without swap usage on a system with 15GB:
unmodified kernel - 2, 3rd with 2.5GB   
v11 page hinting - 6, 7th with 26MB    
v1 bubble hinting[1] - 6, 7th with 1.8GB (bubble hinting is another series
proposed to solve the same problems)

*Memhog execution time (For 3 guests each of 6GB on a system with 15GB):
unmodified kernel - Guest1:21s, Guest2:27s, Guest3:2m37s swap used = 3.7GB       
v11 page hinting - Guest1:23s, Guest2:26s, Guest3:21s swap used = 0           
v1 bubble hinting - Guest1:23, Guest2:11s, Guest3:26s swap used = 0           

This approach tracks all freed pages of the order MAX_ORDER - 2 in bitmaps.
A new hook after buddy merging is used to set the bits in the bitmap.
Currently, the bits are only cleared when pages are hinted, not when pages are
re-allocated.

Bitmaps are stored on a per-zone basis and are protected by the zone lock. A
workqueue asynchronously processes the bitmaps as soon as a pre-defined memory
threshold is met, trying to isolate and report pages that are still free.

The isolated pages are reported via virtio-balloon, which is responsible for
sending batched pages to the host synchronously. Once the hypervisor processed
the hinting request, the isolated pages are returned back to the buddy.

Changelog in v11:
* Added logic to take care of multiple NUMA nodes scenarios.
* Simplified the logic for reporting isolated pages to the host. (Eg. replaced
dynamically allocated arrays with static ones, introduced wait event instead of
the loop in order to wait for a response from the host)
* Added a mutex to prevent race condition when page hinting is enabled by
multiple drivers.
* Simplified the logic responsible for decrementing free page counter for each
zone.
* Simplified code structuring/naming.

Known work items for the future:
* Test device assigned guests to ensure that hinting doesn't break it.
* Follow up on VIRTIO_BALLOON_F_PAGE_POISON's device-side support.
* Decide between MADV_DONTNEED and MADV_FREE.
* Look into memory hotplug, more efficient locking, better naming conventions to
avoid confusion with VIRTIO_BALLOON_F_FREE_PAGE_HINT support.
* Come up with proper/traceable error-message/logs and look into other code
simplifications. (If necessary).

Benefit analysis:
1. Number of 5GB guests (each touching 4GB memory) that can be launched without
swap usage on a system with 15GB:
unmodified kernel - 2, 3rd with 2.5GB   
v11 page hinting - 6, 7th with 26MB    
v1 bubble hinting - 6, 7th with 1.8GB   

Conclusion - In this particular testcase on using v11 page hinting and
v1 bubble-hinting 4 more guests could be launched without swapping compared
to an unmodified kernel.
For the 7th guest launch, v11 page hinting is slightly better than v1 Bubble
hinting as it touches lesser swap space.

Setup & procedure - 
Total NUMA Node Memory ~ 15 GB (All guests are run on a single NUMA node)
Guest Memory = 5GB
Number of CPUs in the guest = 1
Host swap = 4GB
Workload = test allocation program that allocates 4GB memory, touches it via
memset and exits.
The first guest is launched and once its console is up, the test allocation
program is executed with 4 GB memory request (Due to this the guest occupies
almost 4-5 GB of memory in the host in a system without page hinting). Once
this program exits at that time another guest is launched in the host and the
same process is followed. It is continued until the swap is not used.

2. Memhog execution time (For 3 guests each of 6GB on a system with 15GB):
unmodified kernel - Guest1:21s, Guest2:27s, Guest3:2m37s swap used = 3.7GB       
v11 page hinting - Guest1:23s, Guest2:26s, Guest3:21s swap used = 0           
v1 bubble hinting - Guest1:23, Guest2:11s, Guest3:26s swap used = 0           

For this particular test-case in a guest which doesn't require swap access
"memhog 6G" execution time lies within a range of 15-30s.
Conclusion -
In the above test case for an unmodified kernel on executing memhog in the
third guest execution time rises to above 2minutes due to swap access.
Using either page-hinting or bubble hinting brings this execution time to a
a normal range of 15-30s.

Setup & procedure -
Total NUMA Node Memory ~ 15 GB (All guests are run on a single NUMA node)
Guest Memory = 6GB
Number of CPUs in the guest = 4
Process = 3 Guests are launched and the ‘memhog 6G’ execution time is monitored
one after the other in each of them.
Host swap = 4GB

Performance Analysis:
1. will-it-scale's page_faul1
Setup -
Guest Memory = 6GB
Number of cores = 24

Unmodified kernel -
0,0,100,0,100,0
1,514453,95.84,519502,95.83,519502
2,991485,91.67,932268,91.68,1039004
3,1381237,87.36,1264214,87.64,1558506
4,1789116,83.36,1597767,83.88,2078008
5,2181552,79.20,1889489,80.08,2597510
6,2452416,75.05,2001879,77.10,3117012
7,2671047,70.90,2263866,73.22,3636514
8,2930081,66.75,2333813,70.60,4156016
9,3126431,62.60,2370108,68.28,4675518
10,3211937,58.44,2454093,65.74,5195020
11,3162172,54.32,2450822,63.21,5714522
12,3154261,50.14,2272290,58.98,6234024
13,3115174,46.02,2369679,57.74,6753526
14,3150511,41.86,2470837,54.02,7273028
15,3134158,37.71,2428129,51.98,7792530
16,3143067,33.57,2340469,49.54,8312032
17,3112457,29.43,2263627,44.81,8831534
18,3089724,25.29,2181879,38.69,9351036
19,3076878,21.15,2236505,40.01,9870538
20,3091978,16.95,2266327,35.00,10390040
21,3082927,12.84,2172578,28.12,10909542
22,3055282,8.73,2176269,29.14,11429044
23,3081144,4.56,2138442,24.87,11948546
24,3075509,0.45,2173753,21.62,12468048

page hinting -
0,0,100,0,100,0
1,491683,95.83,494366,95.82,494366
2,988415,91.67,919660,91.68,988732
3,1344829,87.52,1244608,87.69,1483098
4,1797933,83.37,1625797,83.70,1977464
5,2179009,79.21,1881534,80.13,2471830
6,2449858,75.07,2078137,76.82,2966196
7,2732122,70.90,2178105,73.75,3460562
8,2910965,66.75,2340901,70.28,3954928
9,3006665,62.61,2353748,67.91,4449294
10,3164752,58.46,2377936,65.08,4943660
11,3234846,54.32,2510149,63.14,5438026
12,3165477,50.17,2412007,59.91,5932392
13,3141457,46.05,2421548,57.85,6426758
14,3135839,41.90,2378021,53.81,6921124
15,3109113,37.75,2269290,51.76,7415490
16,3093613,33.62,2346185,48.73,7909856
17,3086542,29.49,2352140,46.19,8404222
18,3048991,25.36,2217144,41.52,8898588
19,2965500,21.18,2313614,38.18,9392954
20,2928977,17.05,2175316,35.67,9887320
21,2896667,12.91,2141311,28.90,10381686
22,3047782,8.76,2177664,28.24,10876052
23,2994503,4.58,2160976,22.97,11370418
24,3038762,0.47,2053533,22.39,11864784

bubble-hinting v1 - 
0,0,100,0,100,0
1,515272,95.83,492355,95.81,515272
2,985903,91.66,919653,91.68,1030544
3,1475300,87.51,1353723,87.65,1545816
4,1783938,83.36,1586307,83.78,2061088
5,2093307,79.20,1867395,79.95,2576360
6,2441370,75.05,2055421,76.65,3091632
7,2650471,70.89,2246014,72.93,3606904
8,2926782,66.75,2333601,70.41,4122176
9,3107617,62.60,2383112,68.46,4637448
10,3192332,58.44,2441626,65.84,5152720
11,3268043,54.32,2235964,62.92,5667992
12,3191105,50.18,2449045,60.49,6183264
13,3145317,46.05,2377317,57.80,6698536
14,3161552,41.91,2395814,53.26,7213808
15,3140443,37.77,2333200,51.42,7729080
16,3130866,33.65,2150967,46.11,8244352
17,3112894,29.52,2372068,45.93,8759624
18,3078424,25.39,2336211,39.85,9274896
19,3036457,21.27,2224821,35.25,9790168
20,3046330,17.13,2199755,37.43,10305440
21,2981130,12.98,2214862,28.67,10820712
22,3017481,8.84,2195996,29.69,11335984
23,2979906,4.68,2173395,25.90,11851256
24,2971170,0.52,2134311,21.89,12366528

Conclusion -
For an unmodified kernel, with every fresh boot, there is 3-4% delta observed
in the results wrt the numbers mentioned above. For both bubble-hinting and
page-hinting, there was no noticeable degradation observed other than the
expected variability mentioned earlier. 

Page hinting vs bubble hinting:
>From the benefits and performance perspective, both solutions look quite similar
so far. However, unlike bubble-hinting which is more invasive, the overall core
mm changes required for page hinting are minimal.

[1] https://lkml.org/lkml/2019/6/19/926

Nitesh Narayan Lal (2):
  mm: page_hinting: core infrastructure
  virtio-balloon: page_hinting: reporting to the host

 drivers/virtio/Kconfig              |   1 +
 drivers/virtio/virtio_balloon.c     |  91 +++++++++-
 include/linux/page_hinting.h        |  45 +++++
 include/uapi/linux/virtio_balloon.h |  11 ++
 mm/Kconfig                          |   6 +
 mm/Makefile                         |   1 +
 mm/page_alloc.c                     |  18 +-
 mm/page_hinting.c                   | 250 ++++++++++++++++++++++++++++
 8 files changed, 414 insertions(+), 9 deletions(-)
 create mode 100644 include/linux/page_hinting.h
 create mode 100644 mm/page_hinting.c

--