[RFC][Patch V7 0/7] KVM: Guest Page Hinting

nilal@xxxxxxxxxx · Mon, 11 Jun 2018 11:18:55 -0400

The following patch-set proposes an efficient mechanism for handing freed memory between the guest and the host. It enables the guests to rapidly free and reclaim memory to and from the host respectively.

Changelog in V7:

    * The patch-series is moved back to RFC for the following reasons:
        * An issue in which page hinting enabled guest crashes followed by a segmentation fault in QEMU has been observed occasionally.
    * The HYPERLIST_THRESHOLD is changed to 1 to incorporate scenarios where hinting is required for just one hyperlist entry. This will be replaced by a better approach in the upcoming patch-series.

Virtio interface changes are picked up from Wei's patch-set for Virtio-balloon enhancement[2]. "Wei, How would you like me to credit you in the final patch?")

Test results on a single core:

    1. Swap test case results:

        The intent of this test case is to show that with this patch series, as the host runs out memory it can reclaim the guest freed memory dynamically for its use. I have been going through the
        Wei's patch-series and it may not solve such use cases.
        Following are the two results which shows without page hinting as the host runs out of memory swap memory is used:

        i)
        Host memory before running the guest
                            total        used        free      shared  buff/cache   available
        Mem:            11G        2.3G        8.0G        841M        1.1G         8.1G
        Swap:          3.0G          0B         3.0G
        Host memory after running the guest and exhaustion of memory
        Mem:            11G         10G        132M        274M      537M         82M
        Swap:           3.0G        1.0G        2.0G

        ii)
        Host memory before running the guest
                            total        used        free      shared  buff/cache   available
        Mem:            11G        2.2G        8.0G        862M        1.2G        8.1G
        Swap:          3.0G          0B         3.0G
        Host memory after running the guest and exhaustion of memory
        Mem:            11G         10G        126M        719M      1.0G         99M
        Swap:          3.0G        939M        2.1G

        Following are the two results which shows with page hinting as the host runs out of memory guest freed memory is used instead of the swap space:

        i)
        Host memory before running the guest
                            total        used        free      shared  buff/cache   available
        Mem:            11G        2.2G        8.1G        827M        1.1G        8.2G
        Swap:          3.0G          0B         3.0G
        Host memory after running the guest and exhaustion of memory
        Mem:            11G         10G        191M        851M      1.2G        2.6G
        Swap:          3.0G          0B         3.0G

        ii)
        Host memory before running the guest
                            total        used        free      shared  buff/cache   available
        Mem:            11G        2.2G        8.0G        836M        1.2G        8.1G
        Swap:          3.0G          0B         3.0G
        Host memory after running the guest and exhaustion of memory
        Mem:            11G        9.8G        167M        853M      1.5G        2.5G
        Swap:          3.0G          0B         3.0G

    2. Netperf:
        Netperf and hackbench are used to analyze the impact of this series on guest throughput under these loads.

                             Recv Socket Size bytes    Send Socket Size bytes        Send Message Size bytes    Elapsed Time secs.    Throughput 10^6 bits/sec
        Without Hinting
                 i)              87380                               16384                                       16384                                 100             23130.92
                 ii)             87380                               16384                                       16384                                 100             26114.51
                 iii)            87380                               16384                                       16384                                 100             22495.60

        With Hinting
                 i)              87380                               16384                                       16384                                 100             20228.11
                 ii)             87380                               16384                                       16384                                 100             25689.46
                 iii)            87380                               16384                                       16384                                 100             19967.03

    3. Hackbench:
        Number of process = 150
        Without Hinting time:
            i)   10.208
            ii)   9.879
            iii)  9.404

        With Hinting time:
            i)   11.292
            ii)  11.057
            iii) 10.688

Explaination:

    *To observe the swap space usage with and without guest page hinting, a guest with 6GB memory is booted. After which 4 GB memory is malloced and freed in the guest. In situation where there is no guest
     page hinting this memory will never  be returned to the host resulting in the usage of host memory as the host runs more process or malloc's more memory resulting in the usage of swap space. However, on
     a guest with guest page hinting enabled the memory freed by the guest will be reclaimed by the host due to which host when runs out of memory could use that instead of the swap space.

    *This patch series enables the guest to prepare the list of free pages which will be sent to the host via hypercall. The patch-set leverages the existing arch_free_page() and arch_alloc_page() to add this
     functionality. It uses two lists one cpu-local and other cpu-global. Whenever a page is freed it is added to the respective cpu-local list until it is full. Once the list is full a seqlock is taken to
     prevent any further page allocations and the per cpu-local list is traversed in order to check for any fragmentation due to reallocations. If present those entries are defragmented and are added to the
     cpu-global list until it is full. Once the cpu-global list is full it is parsed and compressed.
     A hypercall is made only if the total number of entries are above the specified threshold value. A hypercall may affect the performance if done frequently and hence it needs to be minimized. This is the
     primary reason for compression, as it ensures replacement of multiple consecutive entries to a single one and removal of all duplicate entries causing frequent exhaustion of cpu-global list. After
     compressing the hyperlist there could be three following possibilities:
          *If the number of entries in this cpu-global list is greater than the threshold required for hypercall value then a hypercall is issued.
          *If the parsing of the cpu-local list is complete but the number of cpu-global list entries is less than the threshold then they are copied to a cpu-local list.
          *In case the parsing of the cpu-local list is yet not complete and the number of entries in the cpu-global list is less than the threshold then the parsing of the cpu-local list is continued and
           entries in the cpu-global list are added from the newly available index acquired after compression.

[1] https://www.spinics.net/lists/kvm/msg159790.html
[2] https://www.spinics.net/lists/kvm/msg152734.html