The following patch-set proposes an efficient mechanism for handing freed memory between the guest and the host. It enables the guests to rapidly free and reclaim memory to and from the host respectively. Changelog in V7: * The patch-series is moved back to RFC for the following reasons: * An issue in which page hinting enabled guest crashes followed by a segmentation fault in QEMU has been observed occasionally. * The HYPERLIST_THRESHOLD is changed to 1 to incorporate scenarios where hinting is required for just one hyperlist entry. This will be replaced by a better approach in the upcoming patch-series. Virtio interface changes are picked up from Wei's patch-set for Virtio-balloon enhancement[2]. "Wei, How would you like me to credit you in the final patch?") Test results on a single core: 1. Swap test case results: The intent of this test case is to show that with this patch series, as the host runs out memory it can reclaim the guest freed memory dynamically for its use. I have been going through the Wei's patch-series and it may not solve such use cases. Following are the two results which shows without page hinting as the host runs out of memory swap memory is used: i) Host memory before running the guest total used free shared buff/cache available Mem: 11G 2.3G 8.0G 841M 1.1G 8.1G Swap: 3.0G 0B 3.0G Host memory after running the guest and exhaustion of memory Mem: 11G 10G 132M 274M 537M 82M Swap: 3.0G 1.0G 2.0G ii) Host memory before running the guest total used free shared buff/cache available Mem: 11G 2.2G 8.0G 862M 1.2G 8.1G Swap: 3.0G 0B 3.0G Host memory after running the guest and exhaustion of memory Mem: 11G 10G 126M 719M 1.0G 99M Swap: 3.0G 939M 2.1G Following are the two results which shows with page hinting as the host runs out of memory guest freed memory is used instead of the swap space: i) Host memory before running the guest total used free shared buff/cache available Mem: 11G 2.2G 8.1G 827M 1.1G 8.2G Swap: 3.0G 0B 3.0G Host memory after running the guest and exhaustion of memory Mem: 11G 10G 191M 851M 1.2G 2.6G Swap: 3.0G 0B 3.0G ii) Host memory before running the guest total used free shared buff/cache available Mem: 11G 2.2G 8.0G 836M 1.2G 8.1G Swap: 3.0G 0B 3.0G Host memory after running the guest and exhaustion of memory Mem: 11G 9.8G 167M 853M 1.5G 2.5G Swap: 3.0G 0B 3.0G 2. Netperf: Netperf and hackbench are used to analyze the impact of this series on guest throughput under these loads. Recv Socket Size bytes Send Socket Size bytes Send Message Size bytes Elapsed Time secs. Throughput 10^6 bits/sec Without Hinting i) 87380 16384 16384 100 23130.92 ii) 87380 16384 16384 100 26114.51 iii) 87380 16384 16384 100 22495.60 With Hinting i) 87380 16384 16384 100 20228.11 ii) 87380 16384 16384 100 25689.46 iii) 87380 16384 16384 100 19967.03 3. Hackbench: Number of process = 150 Without Hinting time: i) 10.208 ii) 9.879 iii) 9.404 With Hinting time: i) 11.292 ii) 11.057 iii) 10.688 Explaination: *To observe the swap space usage with and without guest page hinting, a guest with 6GB memory is booted. After which 4 GB memory is malloced and freed in the guest. In situation where there is no guest page hinting this memory will never be returned to the host resulting in the usage of host memory as the host runs more process or malloc's more memory resulting in the usage of swap space. However, on a guest with guest page hinting enabled the memory freed by the guest will be reclaimed by the host due to which host when runs out of memory could use that instead of the swap space. *This patch series enables the guest to prepare the list of free pages which will be sent to the host via hypercall. The patch-set leverages the existing arch_free_page() and arch_alloc_page() to add this functionality. It uses two lists one cpu-local and other cpu-global. Whenever a page is freed it is added to the respective cpu-local list until it is full. Once the list is full a seqlock is taken to prevent any further page allocations and the per cpu-local list is traversed in order to check for any fragmentation due to reallocations. If present those entries are defragmented and are added to the cpu-global list until it is full. Once the cpu-global list is full it is parsed and compressed. A hypercall is made only if the total number of entries are above the specified threshold value. A hypercall may affect the performance if done frequently and hence it needs to be minimized. This is the primary reason for compression, as it ensures replacement of multiple consecutive entries to a single one and removal of all duplicate entries causing frequent exhaustion of cpu-global list. After compressing the hyperlist there could be three following possibilities: *If the number of entries in this cpu-global list is greater than the threshold required for hypercall value then a hypercall is issued. *If the parsing of the cpu-local list is complete but the number of cpu-global list entries is less than the threshold then they are copied to a cpu-local list. *In case the parsing of the cpu-local list is yet not complete and the number of entries in the cpu-global list is less than the threshold then the parsing of the cpu-local list is continued and entries in the cpu-global list are added from the newly available index acquired after compression. [1] https://www.spinics.net/lists/kvm/msg159790.html [2] https://www.spinics.net/lists/kvm/msg152734.html