Re: [RFC][Patch v10 0/2] mm: Support for page hinting

Nitesh Narayan Lal <nitesh@xxxxxxxxxx> · Tue, 25 Jun 2019 10:48:53 -0400

On 6/3/19 2:04 PM, Michael S. Tsirkin wrote:
> On Mon, Jun 03, 2019 at 01:03:04PM -0400, Nitesh Narayan Lal wrote:
>> This patch series proposes an efficient mechanism for communicating free memory
>> from a guest to its hypervisor. It especially enables guests with no page cache
>> (e.g., nvdimm, virtio-pmem) or with small page caches (e.g., ram > disk) to
>> rapidly hand back free memory to the hypervisor.
>> This approach has a minimal impact on the existing core-mm infrastructure.
> Could you help us compare with Alex's series?
> What are the main differences?
Results on comparing the benefits/performance of Alexander's v1
(bubble-hinting)[1], Page-Hinting (includes some of the upstream
suggested changes on v10) over an unmodified Kernel.

Test1 - Number of guests that can be launched without swap usage.
Guest size: 5GB
Cores: 4
Total NUMA Node Memory ~ 15 GB (All guests are running on a single node)
Process: Guest is launched sequentially after running an allocation
program with 4GB request.

Results:
unmodified kernel: 2 guests without swap usage and 3rd guest with a swap
usage of 2.3GB.
bubble-hinting v1: 4 guests without swap usage and 5th guest with a swap
usage of 1MB.
Page-hinting: 5 guests without swap usage and 6th guest with a swap
usage of 8MB.

Test2 - Memhog execution time
Guest size: 6GB
Cores: 4
Total NUMA Node Memory ~ 15 GB (All guests are running on a single node)
Process: 3 guests are launched and "time memhog 6G" is launched in each
of them sequentially.

Results:
unmodified kernel: Guest1-40s, Guest2-1m5s, Guest3-6m38s (swap usage at
the end-3.6G)
bubble-hinting v1: Guest1-32s, Guest2-58s, Guest3-35s (swap usage at the
end-0)
Page-hinting: Guest1-42s, Guest2-47s, Guest3-32s (swap usage at the end-0)

Test3 - Will-it-scale's page_fault1
Guest size: 6GB
Cores: 24
Total NUMA Node Memory ~ 15 GB (All guests are running on a single node)

unmodified kernel:
tasks,processes,processes_idle,threads,threads_idle,linear
0,0,100,0,100,0
1,459168,95.83,459315,95.83,459315
2,956272,91.68,884643,91.72,918630
3,1407811,87.53,1267948,87.69,1377945
4,1755744,83.39,1562471,83.73,1837260
5,2056741,79.24,1812309,80.00,2296575
6,2393759,75.09,2025719,77.02,2755890
7,2754403,70.95,2238180,73.72,3215205
8,2947493,66.81,2369686,70.37,3674520
9,3063579,62.68,2321148,68.84,4133835
10,3229023,58.54,2377596,65.84,4593150
11,3337665,54.40,2429818,64.01,5052465
12,3255140,50.28,2395070,61.63,5511780
13,3260721,46.11,2402644,59.77,5971095
14,3210590,42.02,2390806,57.46,6430410
15,3164811,37.88,2265352,51.39,6889725
16,3144764,33.77,2335028,54.07,7349040
17,3128839,29.63,2328662,49.52,7808355
18,3133344,25.50,2301181,48.01,8267670
19,3135979,21.38,2343003,43.66,8726985
20,3136448,17.27,2306109,40.81,9186300
21,3130324,13.16,2403688,35.84,9645615
22,3109883,9.04,2290808,36.24,10104930
23,3136805,4.94,2263818,35.43,10564245
24,3118949,0.78,2252891,31.03,11023560

bubble-hinting v1:
tasks,processes,processes_idle,threads,threads_idle,linear
0,0,100,0,100,0
1,292183,95.83,292428,95.83,292428
2,540606,91.67,501887,91.91,584856
3,821748,87.53,735244,88.31,877284
4,1033782,83.38,839925,85.59,1169712
5,1261352,79.25,896464,83.86,1462140
6,1459544,75.12,1050094,80.93,1754568
7,1686537,70.97,1112202,79.23,2046996
8,1866892,66.83,1083571,78.48,2339424
9,2056887,62.72,1101660,77.94,2631852
10,2252955,58.57,1097439,77.36,2924280
11,2413907,54.40,1088583,76.72,3216708
12,2596504,50.35,1117474,76.01,3509136
13,2715338,46.21,1087666,75.32,3801564
14,2861697,42.08,1084692,74.35,4093992
15,2964620,38.02,1087910,73.40,4386420
16,3065575,33.84,1099406,71.07,4678848
17,3107674,29.76,1056948,71.36,4971276
18,3144963,25.71,1094883,70.14,5263704
19,3173468,21.61,1073049,66.21,5556132
20,3173233,17.55,1072417,67.16,5848560
21,3209710,13.37,1079147,65.64,6140988
22,3182958,9.37,1085872,65.95,6433416
23,3200747,5.23,1076414,59.40,6725844
24,3181699,1.04,1051233,65.62,7018272

Page-hinting:
tasks,processes,processes_idle,threads,threads_idle,linear
0,0,100,0,100,0
1,467693,95.83,467970,95.83,467970
2,967860,91.68,895883,91.70,935940
3,1408191,87.53,1279602,87.68,1403910
4,1766250,83.39,1557224,83.93,1871880
5,2124689,79.24,1834625,80.35,2339850
6,2413514,75.10,1989557,77.00,2807820
7,2644648,70.95,2158055,73.73,3275790
8,2896483,66.81,2305785,70.85,3743760
9,3157796,62.67,2304083,69.49,4211730
10,3251633,58.53,2379589,66.43,4679700
11,3313704,54.41,2349310,64.76,5147670
12,3285612,50.30,2362013,62.63,5615640
13,3207275,46.17,2377760,59.94,6083610
14,3221727,42.02,2416278,56.70,6551580
15,3194781,37.91,2334552,54.96,7019550
16,3211818,33.78,2399077,52.75,7487520
17,3172664,29.65,2337660,50.27,7955490
18,3177152,25.49,2349721,47.02,8423460
19,3149924,21.36,2319286,40.16,8891430
20,3166910,17.30,2279719,43.23,9359400
21,3159464,13.19,2342849,34.84,9827370
22,3167091,9.06,2285156,37.97,10295340
23,3174137,4.96,2365448,33.74,10763310
24,3161629,0.86,2253813,32.38,11231280

Test4: Netperf
Guest size: 5GB
Cores: 4
Total NUMA Node Memory ~ 15 GB (All guests are running on a single node)
Netserver: Running on core 0
Netperf: Running on core 1
Recv Socket Size bytes: 131072
Send Socket Size bytes:16384
Send Message Size bytes:1000000000
Time: 900s
Process: netperf is run 3 times sequentially in the same guest with the
same inputs mentioned above and throughput (10^6bits/sec) is observed.
unmodified kernel: 1st Run-14769.60, 2nd Run-14849.18, 3rd Run-14842.02
bubble-hinting v1: 1st Run-13441.77, 2nd Run-13487.81, 3rd Run-13503.87
Page-hinting: 1st Run-14308.20, 2nd Run-14344.36, 3rd Run-14450.07

Drawback with bubble-hinting:
More invasive.

Drawback with page-hinting:
Additional bitmap required, including growing/shrinking the bitmap on
memory hotplug.

[1] https://lkml.org/lkml/2019/6/19/926
>> Measurement results (measurement details appended to this email):
>> * With active page hinting, 3 more guests could be launched each of 5 GB(total 
>> 5 vs. 2) on a 15GB (single NUMA) system without swapping.
>> * With active page hinting, on a system with 15 GB of (single NUMA) memory and
>> 4GB of swap, the runtime of "memhog 6G" in 3 guests (run sequentially) resulted
>> in the last invocation to only need 37s compared to 3m35s without page hinting.
>>
>> This approach tracks all freed pages of the order MAX_ORDER - 2 in bitmaps.
>> A new hook after buddy merging is used to set the bits in the bitmap.
>> Currently, the bits are only cleared when pages are hinted, not when pages are
>> re-allocated.
>>
>> Bitmaps are stored on a per-zone basis and are protected by the zone lock. A
>> workqueue asynchronously processes the bitmaps as soon as a pre-defined memory
>> threshold is met, trying to isolate and report pages that are still free.
>>
>> The isolated pages are reported via virtio-balloon, which is responsible for
>> sending batched pages to the host synchronously. Once the hypervisor processed
>> the hinting request, the isolated pages are returned back to the buddy.
>>
>> The key changes made in this series compared to v9[1] are:
>> * Pages only in the chunks of "MAX_ORDER - 2" are reported to the hypervisor to
>> not break up the THP.
>> * At a time only a set of 16 pages can be isolated and reported to the host to
>> avoids any false OOMs.
>> * page_hinting.c is moved under mm/ from virt/kvm/ as the feature is dependent
>> on virtio and not on KVM itself. This would enable any other hypervisor to use
>> this feature by implementing virtio devices.
>> * The sysctl variable is replaced with a virtio-balloon parameter to
>> enable/disable page-hinting.
>>
>> Pending items:
>> * Test device assigned guests to ensure that hinting doesn't break it.
>> * Follow up on VIRTIO_BALLOON_F_PAGE_POISON's device side support.
>> * Compare reporting free pages via vring with vhost.
>> * Decide between MADV_DONTNEED and MADV_FREE.
>> * Look into memory hotplug, more efficient locking, possible races when
>> disabling.
>> * Come up with proper/traceable error-message/logs.
>> * Minor reworks and simplifications (e.g., virtio protocol).
>>
>> Benefit analysis:
>> 1. Use-case - Number of guests that can be launched without swap usage
>> NUMA Nodes = 1 with 15 GB memory
>> Guest Memory = 5 GB
>> Number of cores in guest = 1
>> Workload = test allocation program allocates 4GB memory, touches it via memset
>> and exits.
>> Procedure =
>> The first guest is launched and once its console is up, the test allocation
>> program is executed with 4 GB memory request (Due to this the guest occupies
>> almost 4-5 GB of memory in the host in a system without page hinting). Once
>> this program exits at that time another guest is launched in the host and the
>> same process is followed. It is continued until the swap is not used.
>>
>> Results:
>> Without hinting = 3, swap usage at the end 1.1GB.
>> With hinting = 5, swap usage at the end 0.
>>
>> 2. Use-case - memhog execution time
>> Guest Memory = 6GB
>> Number of cores = 4
>> NUMA Nodes = 1 with 15 GB memory
>> Process: 3 Guests are launched and the ‘memhog 6G’ execution time is monitored
>> one after the other in each of them.
>> Without Hinting - Guest1:47s, Guest2:53s, Guest3:3m35s, End swap usage: 3.5G
>> With Hinting - Guest1:40s, Guest2:44s, Guest3:37s, End swap usage: 0
>>
>> Performance analysis:
>> 1. will-it-scale's page_faul1:
>> Guest Memory = 6GB
>> Number of cores = 24
>>
>> Without Hinting:
>> tasks,processes,processes_idle,threads,threads_idle,linear
>> 0,0,100,0,100,0
>> 1,315890,95.82,317633,95.83,317633
>> 2,570810,91.67,531147,91.94,635266
>> 3,826491,87.54,713545,88.53,952899
>> 4,1087434,83.40,901215,85.30,1270532
>> 5,1277137,79.26,916442,83.74,1588165
>> 6,1503611,75.12,1113832,79.89,1905798
>> 7,1683750,70.99,1140629,78.33,2223431
>> 8,1893105,66.85,1157028,77.40,2541064
>> 9,2046516,62.50,1179445,76.48,2858697
>> 10,2291171,58.57,1209247,74.99,3176330
>> 11,2486198,54.47,1217265,75.13,3493963
>> 12,2656533,50.36,1193392,74.42,3811596
>> 13,2747951,46.21,1185540,73.45,4129229
>> 14,2965757,42.09,1161862,72.20,4446862
>> 15,3049128,37.97,1185923,72.12,4764495
>> 16,3150692,33.83,1163789,70.70,5082128
>> 17,3206023,29.70,1174217,70.11,5399761
>> 18,3211380,25.62,1179660,69.40,5717394
>> 19,3202031,21.44,1181259,67.28,6035027
>> 20,3218245,17.35,1196367,66.75,6352660
>> 21,3228576,13.26,1129561,66.74,6670293
>> 22,3207452,9.15,1166517,66.47,6987926
>> 23,3153800,5.09,1172877,61.57,7305559
>> 24,3184542,0.99,1186244,58.36,7623192
>>
>> With Hinting:
>> 0,0,100,0,100,0
>> 1,306737,95.82,305130,95.78,306737
>> 2,573207,91.68,530453,91.92,613474
>> 3,810319,87.53,695281,88.58,920211
>> 4,1074116,83.40,880602,85.48,1226948
>> 5,1308283,79.26,1109257,81.23,1533685
>> 6,1501987,75.12,1093661,80.19,1840422
>> 7,1695300,70.99,1104207,79.03,2147159
>> 8,1901523,66.85,1193613,76.90,2453896
>> 9,2051288,62.73,1200913,76.22,2760633
>> 10,2275771,58.60,1192992,75.66,3067370
>> 11,2435016,54.48,1191472,74.66,3374107
>> 12,2623114,50.35,1196911,74.02,3680844
>> 13,2766071,46.22,1178589,73.02,3987581
>> 14,2932163,42.10,1166414,72.96,4294318
>> 15,3000853,37.96,1177177,72.62,4601055
>> 16,3113738,33.85,1165444,70.54,4907792
>> 17,3132135,29.77,1165055,68.51,5214529
>> 18,3175121,25.69,1166969,69.27,5521266
>> 19,3205490,21.61,1159310,65.65,5828003
>> 20,3220855,17.52,1171827,62.04,6134740
>> 21,3182568,13.48,1138918,65.05,6441477
>> 22,3130543,9.30,1128185,60.60,6748214
>> 23,3087426,5.15,1127912,55.36,7054951
>> 24,3099457,1.04,1176100,54.96,7361688
>>
>> [1] https://lkml.org/lkml/2019/3/6/413
>>
-- 
Regards
Nitesh

Attachment:
signature.asc

Description: OpenPGP digital signature