[RFC PATCH] balloon: add automatic ballooning support

Luiz Capitulino <lcapitulino@xxxxxxxxxx> · Thu, 16 Jan 2014 11:25:37 -0500

The current balloon device has an important drawback: it's
entirely manual. This largely limits the feature when used
to manage memory on a memory over-committed host.

In order to make the balloon device really useful for
memory-overcommit setups, we have to make it automatic.
This is what this patch is about.

The general idea is simple. When the host is under pressure
it notifies QEMU. QEMU then asks the guest to inflate its
balloon by some amount of pages. When the guest is under
pressure, it also notifies QEMU. QEMU then asks the guest
to deflate its balloon by some amount of pages.

The rest of this commit is divided into the following
sections: Usage, Algorithm, FIXMEs/TODOs, Testing and some
code in the end.

Usage
=====

This series depends on a kernel series for the virtio-balloon
driver called "add pressure notification via a new virtqueue".
You have to apply that series in your guest kernel to play
with automatic ballooning.

Then, on the QEMU side you can enable automatic ballooning with
the following command-line:

 $ qemu [...] -device virtio-balloon,automatic=true

Algorithm
=========

On host pressure:

 1. On boot QEMU registers for Linux kernel's vmpressure
    event "low". This event is sent by the kernel when it
    has started reclaiming memory. For more details, please
    read Documentation/cgroups/memory.txt in the kernel's
    source

 2. When QEMU is notified on host pressure, it first checks
    if the guest is currently in pressure, if it is then
    the event is skipped. If the guest is not in pressure
    QEMU asks the guest to inflate its balloon (32MB by
    default)

    NOTE: QEMU will update num_pages whenever an event
          is received and the guest is not in pressure.
          This means that if QEMU receives 10 events in
          a row, num_pages will be updated to 320MB.

On guest pressure:

 1. QEMU is notified by the virtio-balloon driver in the
    guest (via message virtqueue) that the guest is under
    pressure

 2. QEMU checks if there's an inflate going on. If true,
    QEMU rests num_pages to the current balloon value so
    that the guest stops inflating (IOW, QEMU cancels
    current inflation). QEMU returns

 3. If there's no on-going inflate, QEMU asks the guest
    to deflate (32MB by default)

 4. Everytime a guest pressure notification is received,
    QEMU sets a hysteresis period of 60 seconds. During
    this period the guest is defined to be under pressure
    (and inflates will be ignored)

FIXMEs/TODOs
============

 - The number of pages to inflate/deflate and the memcg path
   are harcoded. Will add command-line options for them

 - The default value of 32MB for inflates/deflates is what
   worked for me in my very specific test-case. This is
   probably not a good default, but I don't how to define
   a good one

 - QEMU register's for vmpressure's level "low" notification.
   The guest too will notify QEMU on "low" pressure in the
   guest. The "low" notification is sent whenever the kernel
   has started reclaiming memory. On the guest side this means
   that it will only give free memory to the host. On the host
   side this means that a host with lots of large freeable
   caches will be considered as being in pressure.

   There two ways to solve this:

    1. Register for "medium" pressure instead of low. This
       solves the problem above but it adds a different
       one: medium is sent when the kernel has started to
       swap, so it's a bit too late

    2. Add a new event to vmpressure which is between
       low and medium. The perfect event would be triggered
       before waking up kswapd

 - It would be nice (required?) to be able to dynamically
   enable/disable automatic ballooning. With this patch you
   enable it for the lifetime of the VM

 - I think manual ballooning should be disabled when
   automatic ballooning is enabled, but this is not done
   yet

 - This patch probably doesn't build on windows

Testing
=======

Testing is by far the most difficult aspect of this project to
me. It's been hard to find a good way to measure this work. So
take this with a grain of salt.

This is my test-case: a 2G host runs two VMs (guest A and
guest B), each with 1.3G of memory. When the VMs are fully
booted (but idle) the host has around 1.2G of free memory.
Then the VMs do the following:

 1. Guest A runs ebizzy five times in a row, with a chunk
    size of 1MB and the following number of chunks:
    1024, 824, 624, 424, 224. IOW, the memory usage of
    this VM is going down. Let's call it "vm-down"

 2. Guest B runs ebizzy in similar manner, but it runs
    ebizzy with the following number of chunks:
    224, 424, 624, 824, 1024. IOW, the memory usage of
    this VM is going up. Let's call it "vm-up"

Also, each ebizzy run takes 60 seconds. And the vm-up one
waits 60 seconds before running ebizzy for the first time.
This gives vm-down time to consume most of the host's pressure
and release it.

Here are the results. This is an avarage of three runs. We
measure host swap I/O, QEMU as a host process and perf.
info from the guest. Units:

 - swap in/out: number of pages swapped
 - Elapsed, user, sys: seconds
 - total recs: total number of ebizzy records/s. This is
               a sum of all ebizzy runs for a VM

vanilla
=======

Host
----

swap in:  36478.66
swap out: 372551.0

QEMU (as a process in the host)
-------------------------------

         Elapsed  user    sys   CPU%   major f. minor f.   total recs  swap in swap out
vm-down: 395.42    309.60  3.72  79     2772.66  120046.66 4692.33      0          0
vm-up:   396.40    310     4.04  79     2053.66  208394.33 4684         0          0

Guest (ebizzy run in the guest)
-------------------------------

         total recs  swap in swap out
vm-down: 4692.33      0          0
vm-up:   4684         0          0

automatic balloon
=================

Host
----

swap in: 2.66
swap out: 8225.33

QEMU (as a process in the host)
-------------------------------

         Elapsed  user    sys   CPU%   major f. minor f.   total recs  swap in swap out
vm-down: 387.95   309.66  3.43  80      106.66  29497.33   4710.66      0      0
vm-up:   388.79   310.98  4.35  81      63.66   110307     4704.33      2.67   822.66

Guest (ebizzy run in the guest)
-------------------------------

         total recs  swap in swap out
vm-down: 4710.66      0      0
vm-up:   4704.33      2.67   822.66

Some conclusions:

 - The number of pages swapped in the host and the number of
   QEMU's major faults is hugely reduced by automatic balloon

 - Elapsed time is also better for the automatic balloon VMs,
   vm-down run time as 1.89% lower and vm-up 1.92% lower

 - The records/s is about the same for both, which I guess means
   automatic balloon is not regressing this

 - vm-up did swap a bit, not sure if this is a problem

Now the code, and I think I deserve a coffee after having wrote
all this stuff...

Signed-off-by: Luiz capitulino <lcapitulino@xxxxxxxxxx>
---
 hw/virtio/virtio-balloon.c         | 180 +++++++++++++++++++++++++++++++++++++
 hw/virtio/virtio-pci.c             |   5 ++
 hw/virtio/virtio-pci.h             |   2 +
 include/hw/virtio/virtio-balloon.h |  21 ++++-
 4 files changed, 207 insertions(+), 1 deletion(-)

diff --git a/hw/virtio/virtio-balloon.c b/hw/virtio/virtio-balloon.c
index d9754db..3b3b6d2 100644
--- a/hw/virtio/virtio-balloon.c
+++ b/hw/virtio/virtio-balloon.c
@@ -31,6 +31,139 @@
 
 #include "hw/virtio/virtio-bus.h"
 
+#define LINUX_MEMCG_DEF_PATH "/sys/fs/cgroup/memory"
+#define AUTO_BALLOON_NR_PAGES ((32 * 1024 * 1024) >> VIRTIO_BALLOON_PFN_SHIFT)
+#define AUTO_BALLOON_PRESSURE_PERIOD 60
+
+void virtio_balloon_set_conf(DeviceState *dev, const VirtIOBalloonConf *bconf)
+{
+    VirtIOBalloon *s = VIRTIO_BALLOON(dev);
+    memcpy(&(s->bconf), bconf, sizeof(struct VirtIOBalloonConf));
+}
+
+static bool auto_balloon_enabled_cmdline(const VirtIOBalloon *s)
+{
+    return s->bconf.auto_balloon_enabled;
+}
+
+static bool guest_in_pressure(const VirtIOBalloon *s)
+{
+    time_t t = s->autob_last_guest_pressure;
+    return difftime(time(NULL), t) <= AUTO_BALLOON_PRESSURE_PERIOD;
+}
+
+static void inflate_guest(VirtIOBalloon *s)
+{
+    if (guest_in_pressure(s)) {
+        return;
+    }
+
+    s->num_pages += AUTO_BALLOON_NR_PAGES;
+    virtio_notify_config(VIRTIO_DEVICE(s));
+}
+
+static void deflate_guest(VirtIOBalloon *s)
+{
+    if (!s->autob_cur_size) {
+        return;
+    }
+
+    s->num_pages -= AUTO_BALLOON_NR_PAGES;
+    virtio_notify_config(VIRTIO_DEVICE(s));
+}
+
+static void virtio_balloon_handle_host_pressure(EventNotifier *ev)
+{
+    VirtIOBalloon *s = container_of(ev, VirtIOBalloon, event);
+
+    if (!event_notifier_test_and_clear(ev)) {
+        fprintf(stderr, "virtio-balloon: failed to drain the notify pipe\n");
+        return;
+    }
+
+    inflate_guest(s);
+}
+
+static void register_vmpressure(int cfd, int efd, int lfd, Error **errp)
+{
+    char *p;
+    ssize_t ret;
+
+    p = g_strdup_printf("%d %d low",  efd, lfd);
+    ret = write(cfd, p, strlen(p));
+    if (ret < 0) {
+        error_setg_errno(errp, errno, "failed to write to control fd: %d", cfd);
+    } else {
+        g_assert(ret == strlen(p)); /* XXX: this should be always true, right? */
+    }
+
+    g_free(p);
+}
+
+static int open_file_in_dir(const char *dir_path, const char *file, mode_t mode,
+                            Error **errp)
+{
+    char *p;
+    int fd;
+
+    p = g_strjoin("/", dir_path, file, NULL);
+    fd = qemu_open(p, mode);
+    if (fd < 0) {
+        error_setg_errno(errp, errno, "can't open '%s'", p);
+    }
+
+    g_free(p);
+    return fd;
+}
+
+static void automatic_balloon_init(VirtIOBalloon *s, const char *memcg_path,
+                                   Error **errp)
+{
+    Error *local_err = NULL;
+    int ret;
+
+    if (!memcg_path) {
+        memcg_path = LINUX_MEMCG_DEF_PATH;
+    }
+
+    s->lfd = open_file_in_dir(memcg_path, "memory.pressure_level", O_RDONLY,
+                              &local_err);
+    if (local_err) {
+        goto out;
+    }
+
+    s->cfd = open_file_in_dir(memcg_path, "cgroup.event_control", O_WRONLY,
+                              &local_err);
+    if (local_err) {
+        close(s->lfd);
+        goto out;
+    }
+
+    ret = event_notifier_init(&s->event, false);
+    if (ret < 0) {
+        error_setg_errno(&local_err, -ret, "failed to create event notifier");
+        goto out_err;
+    }
+
+    s->autob_last_guest_pressure = time(NULL) - (AUTO_BALLOON_PRESSURE_PERIOD+1);
+    event_notifier_set_handler(&s->event, virtio_balloon_handle_host_pressure);
+
+    register_vmpressure(s->cfd, event_notifier_get_fd(&s->event), s->lfd,
+                            &local_err);
+    if (local_err) {
+        event_notifier_cleanup(&s->event);
+        goto out_err;
+    }
+
+    return;
+
+out_err:
+    close(s->lfd);
+    close(s->cfd);
+out:
+    error_propagate(errp, local_err);
+}
+
 static void balloon_page(void *addr, int deflate)
 {
 #if defined(__linux__)
@@ -178,6 +311,34 @@ static void balloon_stats_set_poll_interval(Object *obj, struct Visitor *v,
     balloon_stats_change_timer(s, 0);
 }
 
+static void virtio_balloon_handle_msg(VirtIODevice *vdev, VirtQueue *vq)
+{
+    VirtIOBalloon *dev = VIRTIO_BALLOON(vdev);
+    VirtQueueElement elem;
+
+    while (virtqueue_pop(vq, &elem)) {
+        size_t offset = 0;
+        uint32_t msg;
+
+        while (iov_to_buf(elem.out_sg, elem.out_num, offset, &msg, 4) == 4) {
+            offset += 4;
+            msg = ldl_p(&msg);
+
+            if (msg == VIRTIO_BALLOON_MSG_PRESSURE) {
+                dev->autob_last_guest_pressure = time(NULL);
+                if (dev->num_pages > dev->autob_cur_size) {
+                    /* cancel on-going inflation */
+                    dev->num_pages = dev->autob_cur_size;
+                } else {
+                    deflate_guest(dev);
+                }
+            }
+        }
+        virtqueue_push(vq, &elem, offset);
+        virtio_notify(vdev, vq);
+    }
+}
+
 static void virtio_balloon_handle_output(VirtIODevice *vdev, VirtQueue *vq)
 {
     VirtIOBalloon *s = VIRTIO_BALLOON(vdev);
@@ -206,6 +367,12 @@ static void virtio_balloon_handle_output(VirtIODevice *vdev, VirtQueue *vq)
             balloon_page(memory_region_get_ram_ptr(section.mr) + addr,
                          !!(vq == s->dvq));
             memory_region_unref(section.mr);
+
+            if (vq == s->ivq) {
+                s->autob_cur_size++;
+            } else {
+                s->autob_cur_size--;
+            }
         }
 
         virtqueue_push(vq, &elem, offset);
@@ -283,6 +450,8 @@ static void virtio_balloon_set_config(VirtIODevice *vdev,
 static uint32_t virtio_balloon_get_features(VirtIODevice *vdev, uint32_t f)
 {
     f |= (1 << VIRTIO_BALLOON_F_STATS_VQ);
+    f |= (1 << VIRTIO_BALLOON_F_MESSAGE_VQ);
+
     return f;
 }
 
@@ -341,10 +510,20 @@ static void virtio_balloon_device_realize(DeviceState *dev, Error **errp)
 {
     VirtIODevice *vdev = VIRTIO_DEVICE(dev);
     VirtIOBalloon *s = VIRTIO_BALLOON(dev);
+    Error *local_err = NULL;
     int ret;
 
     virtio_init(vdev, "virtio-balloon", VIRTIO_ID_BALLOON, 8);
 
+    if (auto_balloon_enabled_cmdline(s)) {
+        automatic_balloon_init(s, NULL /* default root memcg path */, &local_err);
+        if (local_err) {
+            virtio_cleanup(VIRTIO_DEVICE(s));
+            error_propagate(errp, local_err);
+            return;
+        }
+    }
+
     ret = qemu_add_balloon_handler(virtio_balloon_to_target,
                                    virtio_balloon_stat, s);
 
@@ -357,6 +536,7 @@ static void virtio_balloon_device_realize(DeviceState *dev, Error **errp)
     s->ivq = virtio_add_queue(vdev, 128, virtio_balloon_handle_output);
     s->dvq = virtio_add_queue(vdev, 128, virtio_balloon_handle_output);
     s->svq = virtio_add_queue(vdev, 128, virtio_balloon_receive_stats);
+    s->mvq = virtio_add_queue(vdev, 128, virtio_balloon_handle_msg);
 
     register_savevm(dev, "virtio-balloon", -1, 1,
                     virtio_balloon_save, virtio_balloon_load, s);
diff --git a/hw/virtio/virtio-pci.c b/hw/virtio/virtio-pci.c
index 30c9f2b..2ff4c09 100644
--- a/hw/virtio/virtio-pci.c
+++ b/hw/virtio/virtio-pci.c
@@ -1276,6 +1276,9 @@ static void balloon_pci_stats_set_poll_interval(Object *obj, struct Visitor *v,
 static Property virtio_balloon_pci_properties[] = {
     DEFINE_VIRTIO_COMMON_FEATURES(VirtIOPCIProxy, host_features),
     DEFINE_PROP_HEX32("class", VirtIOPCIProxy, class_code, 0),
+#ifdef __linux__
+    DEFINE_PROP_BIT("automatic", VirtIOBalloonPCI, bconf.auto_balloon_enabled, 0, false),
+#endif
     DEFINE_PROP_END_OF_LIST(),
 };
 
@@ -1289,6 +1292,8 @@ static int virtio_balloon_pci_init(VirtIOPCIProxy *vpci_dev)
         vpci_dev->class_code = PCI_CLASS_OTHERS;
     }
 
+    virtio_balloon_set_conf(vdev, &(dev->bconf));
+
     qdev_set_parent_bus(vdev, BUS(&vpci_dev->bus));
     if (qdev_init(vdev) < 0) {
         return -1;
diff --git a/hw/virtio/virtio-pci.h b/hw/virtio/virtio-pci.h
index dc332ae..b430f68 100644
--- a/hw/virtio/virtio-pci.h
+++ b/hw/virtio/virtio-pci.h
@@ -144,6 +144,7 @@ struct VirtIOBlkPCI {
 struct VirtIOBalloonPCI {
     VirtIOPCIProxy parent_obj;
     VirtIOBalloon vdev;
+    VirtIOBalloonConf bconf;
 };
 
 /*
@@ -156,6 +157,7 @@ struct VirtIOBalloonPCI {
 struct VirtIOSerialPCI {
     VirtIOPCIProxy parent_obj;
     VirtIOSerial vdev;
+    VirtIOBalloonConf bconf;
 };
 
 /*
diff --git a/include/hw/virtio/virtio-balloon.h b/include/hw/virtio/virtio-balloon.h
index f863bfe..37511ad 100644
--- a/include/hw/virtio/virtio-balloon.h
+++ b/include/hw/virtio/virtio-balloon.h
@@ -30,10 +30,19 @@
 /* The feature bitmap for virtio balloon */
 #define VIRTIO_BALLOON_F_MUST_TELL_HOST 0 /* Tell before reclaiming pages */
 #define VIRTIO_BALLOON_F_STATS_VQ 1       /* Memory stats virtqueue */
+#define VIRTIO_BALLOON_F_MESSAGE_VQ 2     /* Message virtqueue */
+
+/* Messages supported by the message virtqueue */
+#define VIRTIO_BALLOON_MSG_PRESSURE 1
 
 /* Size of a PFN in the balloon interface. */
 #define VIRTIO_BALLOON_PFN_SHIFT 12
 
+typedef struct VirtIOBalloonConf
+{
+    uint32_t auto_balloon_enabled;
+} VirtIOBalloonConf;
+
 struct virtio_balloon_config
 {
     /* Number of pages host wants Guest to give up. */
@@ -58,7 +67,7 @@ typedef struct VirtIOBalloonStat {
 
 typedef struct VirtIOBalloon {
     VirtIODevice parent_obj;
-    VirtQueue *ivq, *dvq, *svq;
+    VirtQueue *ivq, *dvq, *svq, *mvq;
     uint32_t num_pages;
     uint32_t actual;
     uint64_t stats[VIRTIO_BALLOON_S_NR];
@@ -67,6 +76,16 @@ typedef struct VirtIOBalloon {
     QEMUTimer *stats_timer;
     int64_t stats_last_update;
     int64_t stats_poll_interval;
+
+    /* automatic ballooning */
+    int cfd;
+    int lfd;
+    EventNotifier event;
+    uint32_t autob_cur_size;
+    time_t autob_last_guest_pressure;
+    VirtIOBalloonConf bconf;
 } VirtIOBalloon;
 
+void virtio_balloon_set_conf(DeviceState *dev, const VirtIOBalloonConf *bconf);
+
 #endif
-- 
1.8.1.4

--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at  http://vger.kernel.org/majordomo-info.html