The current balloon device has an important drawback: it's entirely manual. This largely limits the feature when used to manage memory on a memory over-committed host. In order to make the balloon device really useful for memory-overcommit setups, we have to make it automatic. This is what this patch is about. The general idea is simple. When the host is under pressure it notifies QEMU. QEMU then asks the guest to inflate its balloon by some amount of pages. When the guest is under pressure, it also notifies QEMU. QEMU then asks the guest to deflate its balloon by some amount of pages. The rest of this commit is divided into the following sections: Usage, Algorithm, FIXMEs/TODOs, Testing and some code in the end. Usage ===== This series depends on a kernel series for the virtio-balloon driver called "add pressure notification via a new virtqueue". You have to apply that series in your guest kernel to play with automatic ballooning. Then, on the QEMU side you can enable automatic ballooning with the following command-line: $ qemu [...] -device virtio-balloon,automatic=true Algorithm ========= On host pressure: 1. On boot QEMU registers for Linux kernel's vmpressure event "low". This event is sent by the kernel when it has started reclaiming memory. For more details, please read Documentation/cgroups/memory.txt in the kernel's source 2. When QEMU is notified on host pressure, it first checks if the guest is currently in pressure, if it is then the event is skipped. If the guest is not in pressure QEMU asks the guest to inflate its balloon (32MB by default) NOTE: QEMU will update num_pages whenever an event is received and the guest is not in pressure. This means that if QEMU receives 10 events in a row, num_pages will be updated to 320MB. On guest pressure: 1. QEMU is notified by the virtio-balloon driver in the guest (via message virtqueue) that the guest is under pressure 2. QEMU checks if there's an inflate going on. If true, QEMU rests num_pages to the current balloon value so that the guest stops inflating (IOW, QEMU cancels current inflation). QEMU returns 3. If there's no on-going inflate, QEMU asks the guest to deflate (32MB by default) 4. Everytime a guest pressure notification is received, QEMU sets a hysteresis period of 60 seconds. During this period the guest is defined to be under pressure (and inflates will be ignored) FIXMEs/TODOs ============ - The number of pages to inflate/deflate and the memcg path are harcoded. Will add command-line options for them - The default value of 32MB for inflates/deflates is what worked for me in my very specific test-case. This is probably not a good default, but I don't how to define a good one - QEMU register's for vmpressure's level "low" notification. The guest too will notify QEMU on "low" pressure in the guest. The "low" notification is sent whenever the kernel has started reclaiming memory. On the guest side this means that it will only give free memory to the host. On the host side this means that a host with lots of large freeable caches will be considered as being in pressure. There two ways to solve this: 1. Register for "medium" pressure instead of low. This solves the problem above but it adds a different one: medium is sent when the kernel has started to swap, so it's a bit too late 2. Add a new event to vmpressure which is between low and medium. The perfect event would be triggered before waking up kswapd - It would be nice (required?) to be able to dynamically enable/disable automatic ballooning. With this patch you enable it for the lifetime of the VM - I think manual ballooning should be disabled when automatic ballooning is enabled, but this is not done yet - This patch probably doesn't build on windows Testing ======= Testing is by far the most difficult aspect of this project to me. It's been hard to find a good way to measure this work. So take this with a grain of salt. This is my test-case: a 2G host runs two VMs (guest A and guest B), each with 1.3G of memory. When the VMs are fully booted (but idle) the host has around 1.2G of free memory. Then the VMs do the following: 1. Guest A runs ebizzy five times in a row, with a chunk size of 1MB and the following number of chunks: 1024, 824, 624, 424, 224. IOW, the memory usage of this VM is going down. Let's call it "vm-down" 2. Guest B runs ebizzy in similar manner, but it runs ebizzy with the following number of chunks: 224, 424, 624, 824, 1024. IOW, the memory usage of this VM is going up. Let's call it "vm-up" Also, each ebizzy run takes 60 seconds. And the vm-up one waits 60 seconds before running ebizzy for the first time. This gives vm-down time to consume most of the host's pressure and release it. Here are the results. This is an avarage of three runs. We measure host swap I/O, QEMU as a host process and perf. info from the guest. Units: - swap in/out: number of pages swapped - Elapsed, user, sys: seconds - total recs: total number of ebizzy records/s. This is a sum of all ebizzy runs for a VM vanilla ======= Host ---- swap in: 36478.66 swap out: 372551.0 QEMU (as a process in the host) ------------------------------- Elapsed user sys CPU% major f. minor f. total recs swap in swap out vm-down: 395.42 309.60 3.72 79 2772.66 120046.66 4692.33 0 0 vm-up: 396.40 310 4.04 79 2053.66 208394.33 4684 0 0 Guest (ebizzy run in the guest) ------------------------------- total recs swap in swap out vm-down: 4692.33 0 0 vm-up: 4684 0 0 automatic balloon ================= Host ---- swap in: 2.66 swap out: 8225.33 QEMU (as a process in the host) ------------------------------- Elapsed user sys CPU% major f. minor f. total recs swap in swap out vm-down: 387.95 309.66 3.43 80 106.66 29497.33 4710.66 0 0 vm-up: 388.79 310.98 4.35 81 63.66 110307 4704.33 2.67 822.66 Guest (ebizzy run in the guest) ------------------------------- total recs swap in swap out vm-down: 4710.66 0 0 vm-up: 4704.33 2.67 822.66 Some conclusions: - The number of pages swapped in the host and the number of QEMU's major faults is hugely reduced by automatic balloon - Elapsed time is also better for the automatic balloon VMs, vm-down run time as 1.89% lower and vm-up 1.92% lower - The records/s is about the same for both, which I guess means automatic balloon is not regressing this - vm-up did swap a bit, not sure if this is a problem Now the code, and I think I deserve a coffee after having wrote all this stuff... Signed-off-by: Luiz capitulino <lcapitulino@xxxxxxxxxx> --- hw/virtio/virtio-balloon.c | 180 +++++++++++++++++++++++++++++++++++++ hw/virtio/virtio-pci.c | 5 ++ hw/virtio/virtio-pci.h | 2 + include/hw/virtio/virtio-balloon.h | 21 ++++- 4 files changed, 207 insertions(+), 1 deletion(-) diff --git a/hw/virtio/virtio-balloon.c b/hw/virtio/virtio-balloon.c index d9754db..3b3b6d2 100644 --- a/hw/virtio/virtio-balloon.c +++ b/hw/virtio/virtio-balloon.c @@ -31,6 +31,139 @@ #include "hw/virtio/virtio-bus.h" +#define LINUX_MEMCG_DEF_PATH "/sys/fs/cgroup/memory" +#define AUTO_BALLOON_NR_PAGES ((32 * 1024 * 1024) >> VIRTIO_BALLOON_PFN_SHIFT) +#define AUTO_BALLOON_PRESSURE_PERIOD 60 + +void virtio_balloon_set_conf(DeviceState *dev, const VirtIOBalloonConf *bconf) +{ + VirtIOBalloon *s = VIRTIO_BALLOON(dev); + memcpy(&(s->bconf), bconf, sizeof(struct VirtIOBalloonConf)); +} + +static bool auto_balloon_enabled_cmdline(const VirtIOBalloon *s) +{ + return s->bconf.auto_balloon_enabled; +} + +static bool guest_in_pressure(const VirtIOBalloon *s) +{ + time_t t = s->autob_last_guest_pressure; + return difftime(time(NULL), t) <= AUTO_BALLOON_PRESSURE_PERIOD; +} + +static void inflate_guest(VirtIOBalloon *s) +{ + if (guest_in_pressure(s)) { + return; + } + + s->num_pages += AUTO_BALLOON_NR_PAGES; + virtio_notify_config(VIRTIO_DEVICE(s)); +} + +static void deflate_guest(VirtIOBalloon *s) +{ + if (!s->autob_cur_size) { + return; + } + + s->num_pages -= AUTO_BALLOON_NR_PAGES; + virtio_notify_config(VIRTIO_DEVICE(s)); +} + +static void virtio_balloon_handle_host_pressure(EventNotifier *ev) +{ + VirtIOBalloon *s = container_of(ev, VirtIOBalloon, event); + + if (!event_notifier_test_and_clear(ev)) { + fprintf(stderr, "virtio-balloon: failed to drain the notify pipe\n"); + return; + } + + inflate_guest(s); +} + +static void register_vmpressure(int cfd, int efd, int lfd, Error **errp) +{ + char *p; + ssize_t ret; + + p = g_strdup_printf("%d %d low", efd, lfd); + ret = write(cfd, p, strlen(p)); + if (ret < 0) { + error_setg_errno(errp, errno, "failed to write to control fd: %d", cfd); + } else { + g_assert(ret == strlen(p)); /* XXX: this should be always true, right? */ + } + + g_free(p); +} + +static int open_file_in_dir(const char *dir_path, const char *file, mode_t mode, + Error **errp) +{ + char *p; + int fd; + + p = g_strjoin("/", dir_path, file, NULL); + fd = qemu_open(p, mode); + if (fd < 0) { + error_setg_errno(errp, errno, "can't open '%s'", p); + } + + g_free(p); + return fd; +} + +static void automatic_balloon_init(VirtIOBalloon *s, const char *memcg_path, + Error **errp) +{ + Error *local_err = NULL; + int ret; + + if (!memcg_path) { + memcg_path = LINUX_MEMCG_DEF_PATH; + } + + s->lfd = open_file_in_dir(memcg_path, "memory.pressure_level", O_RDONLY, + &local_err); + if (local_err) { + goto out; + } + + s->cfd = open_file_in_dir(memcg_path, "cgroup.event_control", O_WRONLY, + &local_err); + if (local_err) { + close(s->lfd); + goto out; + } + + ret = event_notifier_init(&s->event, false); + if (ret < 0) { + error_setg_errno(&local_err, -ret, "failed to create event notifier"); + goto out_err; + } + + s->autob_last_guest_pressure = time(NULL) - (AUTO_BALLOON_PRESSURE_PERIOD+1); + event_notifier_set_handler(&s->event, virtio_balloon_handle_host_pressure); + + register_vmpressure(s->cfd, event_notifier_get_fd(&s->event), s->lfd, + &local_err); + if (local_err) { + event_notifier_cleanup(&s->event); + goto out_err; + } + + return; + +out_err: + close(s->lfd); + close(s->cfd); +out: + error_propagate(errp, local_err); +} + static void balloon_page(void *addr, int deflate) { #if defined(__linux__) @@ -178,6 +311,34 @@ static void balloon_stats_set_poll_interval(Object *obj, struct Visitor *v, balloon_stats_change_timer(s, 0); } +static void virtio_balloon_handle_msg(VirtIODevice *vdev, VirtQueue *vq) +{ + VirtIOBalloon *dev = VIRTIO_BALLOON(vdev); + VirtQueueElement elem; + + while (virtqueue_pop(vq, &elem)) { + size_t offset = 0; + uint32_t msg; + + while (iov_to_buf(elem.out_sg, elem.out_num, offset, &msg, 4) == 4) { + offset += 4; + msg = ldl_p(&msg); + + if (msg == VIRTIO_BALLOON_MSG_PRESSURE) { + dev->autob_last_guest_pressure = time(NULL); + if (dev->num_pages > dev->autob_cur_size) { + /* cancel on-going inflation */ + dev->num_pages = dev->autob_cur_size; + } else { + deflate_guest(dev); + } + } + } + virtqueue_push(vq, &elem, offset); + virtio_notify(vdev, vq); + } +} + static void virtio_balloon_handle_output(VirtIODevice *vdev, VirtQueue *vq) { VirtIOBalloon *s = VIRTIO_BALLOON(vdev); @@ -206,6 +367,12 @@ static void virtio_balloon_handle_output(VirtIODevice *vdev, VirtQueue *vq) balloon_page(memory_region_get_ram_ptr(section.mr) + addr, !!(vq == s->dvq)); memory_region_unref(section.mr); + + if (vq == s->ivq) { + s->autob_cur_size++; + } else { + s->autob_cur_size--; + } } virtqueue_push(vq, &elem, offset); @@ -283,6 +450,8 @@ static void virtio_balloon_set_config(VirtIODevice *vdev, static uint32_t virtio_balloon_get_features(VirtIODevice *vdev, uint32_t f) { f |= (1 << VIRTIO_BALLOON_F_STATS_VQ); + f |= (1 << VIRTIO_BALLOON_F_MESSAGE_VQ); + return f; } @@ -341,10 +510,20 @@ static void virtio_balloon_device_realize(DeviceState *dev, Error **errp) { VirtIODevice *vdev = VIRTIO_DEVICE(dev); VirtIOBalloon *s = VIRTIO_BALLOON(dev); + Error *local_err = NULL; int ret; virtio_init(vdev, "virtio-balloon", VIRTIO_ID_BALLOON, 8); + if (auto_balloon_enabled_cmdline(s)) { + automatic_balloon_init(s, NULL /* default root memcg path */, &local_err); + if (local_err) { + virtio_cleanup(VIRTIO_DEVICE(s)); + error_propagate(errp, local_err); + return; + } + } + ret = qemu_add_balloon_handler(virtio_balloon_to_target, virtio_balloon_stat, s); @@ -357,6 +536,7 @@ static void virtio_balloon_device_realize(DeviceState *dev, Error **errp) s->ivq = virtio_add_queue(vdev, 128, virtio_balloon_handle_output); s->dvq = virtio_add_queue(vdev, 128, virtio_balloon_handle_output); s->svq = virtio_add_queue(vdev, 128, virtio_balloon_receive_stats); + s->mvq = virtio_add_queue(vdev, 128, virtio_balloon_handle_msg); register_savevm(dev, "virtio-balloon", -1, 1, virtio_balloon_save, virtio_balloon_load, s); diff --git a/hw/virtio/virtio-pci.c b/hw/virtio/virtio-pci.c index 30c9f2b..2ff4c09 100644 --- a/hw/virtio/virtio-pci.c +++ b/hw/virtio/virtio-pci.c @@ -1276,6 +1276,9 @@ static void balloon_pci_stats_set_poll_interval(Object *obj, struct Visitor *v, static Property virtio_balloon_pci_properties[] = { DEFINE_VIRTIO_COMMON_FEATURES(VirtIOPCIProxy, host_features), DEFINE_PROP_HEX32("class", VirtIOPCIProxy, class_code, 0), +#ifdef __linux__ + DEFINE_PROP_BIT("automatic", VirtIOBalloonPCI, bconf.auto_balloon_enabled, 0, false), +#endif DEFINE_PROP_END_OF_LIST(), }; @@ -1289,6 +1292,8 @@ static int virtio_balloon_pci_init(VirtIOPCIProxy *vpci_dev) vpci_dev->class_code = PCI_CLASS_OTHERS; } + virtio_balloon_set_conf(vdev, &(dev->bconf)); + qdev_set_parent_bus(vdev, BUS(&vpci_dev->bus)); if (qdev_init(vdev) < 0) { return -1; diff --git a/hw/virtio/virtio-pci.h b/hw/virtio/virtio-pci.h index dc332ae..b430f68 100644 --- a/hw/virtio/virtio-pci.h +++ b/hw/virtio/virtio-pci.h @@ -144,6 +144,7 @@ struct VirtIOBlkPCI { struct VirtIOBalloonPCI { VirtIOPCIProxy parent_obj; VirtIOBalloon vdev; + VirtIOBalloonConf bconf; }; /* @@ -156,6 +157,7 @@ struct VirtIOBalloonPCI { struct VirtIOSerialPCI { VirtIOPCIProxy parent_obj; VirtIOSerial vdev; + VirtIOBalloonConf bconf; }; /* diff --git a/include/hw/virtio/virtio-balloon.h b/include/hw/virtio/virtio-balloon.h index f863bfe..37511ad 100644 --- a/include/hw/virtio/virtio-balloon.h +++ b/include/hw/virtio/virtio-balloon.h @@ -30,10 +30,19 @@ /* The feature bitmap for virtio balloon */ #define VIRTIO_BALLOON_F_MUST_TELL_HOST 0 /* Tell before reclaiming pages */ #define VIRTIO_BALLOON_F_STATS_VQ 1 /* Memory stats virtqueue */ +#define VIRTIO_BALLOON_F_MESSAGE_VQ 2 /* Message virtqueue */ + +/* Messages supported by the message virtqueue */ +#define VIRTIO_BALLOON_MSG_PRESSURE 1 /* Size of a PFN in the balloon interface. */ #define VIRTIO_BALLOON_PFN_SHIFT 12 +typedef struct VirtIOBalloonConf +{ + uint32_t auto_balloon_enabled; +} VirtIOBalloonConf; + struct virtio_balloon_config { /* Number of pages host wants Guest to give up. */ @@ -58,7 +67,7 @@ typedef struct VirtIOBalloonStat { typedef struct VirtIOBalloon { VirtIODevice parent_obj; - VirtQueue *ivq, *dvq, *svq; + VirtQueue *ivq, *dvq, *svq, *mvq; uint32_t num_pages; uint32_t actual; uint64_t stats[VIRTIO_BALLOON_S_NR]; @@ -67,6 +76,16 @@ typedef struct VirtIOBalloon { QEMUTimer *stats_timer; int64_t stats_last_update; int64_t stats_poll_interval; + + /* automatic ballooning */ + int cfd; + int lfd; + EventNotifier event; + uint32_t autob_cur_size; + time_t autob_last_guest_pressure; + VirtIOBalloonConf bconf; } VirtIOBalloon; +void virtio_balloon_set_conf(DeviceState *dev, const VirtIOBalloonConf *bconf); + #endif -- 1.8.1.4 -- To unsubscribe from this list: send the line "unsubscribe kvm" in the body of a message to majordomo@xxxxxxxxxxxxxxx More majordomo info at http://vger.kernel.org/majordomo-info.html