This is the Linux driver side of virtio-mem. Compared to the QEMU side, it is in a pretty complete and clean state. virtio-mem is a paravirtualized mechanism of adding/removing memory to/from a VM. We can do this on a 4MB granularity right now. In Linux, all memory is added to the ZONE_NORMAL, so unplugging cannot be guaranteed - but will be more likely to succeed compared to unplugging 128MB+ chunks. We might implement some optimizations in that area in the future that will make memory unplug more reliable. For now, this is an easy way to give a VM access to more memory and eventually to remove some memory again. I am testing it on x86 and s390x (under QEMU TCG so far only). This is the follow up on [1], but the concept, user interface and virtio protocol has been heavily changed. I am only including the important parts in this cover letter (because otherwise nobody will read it). Please feel free to ask in case there are any questions. This series is based on [4] and shows how it is being used. It contains further information. Also have a look at the description of patch nr 4 in this series. This work is the result of the initital idea of Andrea Arcangeli to host enforce guest access to memory inflated in virtio-balloon using userfaultfd, which turned out to be problematic to implement. That's how I came up with virtio-mem. -------------------------------------------------------------------------- 1. High level concept -------------------------------------------------------------------------- Each virtio-mem device owns a memory region in the physical address space. The guest is allowed to plug and online up to 'requested_size' of memory. It will not be allowed to plug more than that size. Unplugged memory will be protected by configurable mechanisms (e.g. random discard, userfaultfd protection, etc.). virtio-mem is designed in a way that a guest may never assume to be able to even read unplugged memory. This is a big difference to classical balloon drivers. The usable memory region might grow over time, so not all parts of the device memory region might be usable from the start. This is an optimization to allow a smarter implementation in the hypervisor (reduce size of dirty bitmaps, size of memory regions ...). When the device driver starts up, it will query 'requested_size' and start to add memory to the system. This memory is not indicated e.g. via ACPI, so unmodified systems will not silently try to use unplugged memory that they are not supposed to touch. Updates on the 'requested_size' indicate hypervisor requests to plug or unplug memory. As each virtio-mem device can belong to a NUMA node, we can easily plug/unplug memory on a NUMA basis. And of course, we can have several independent virtio-mem devices for a VM. The idea is *not* to add new virtio-mem devices when hotplugging memory, the idea is to resize (grow/shrink) virtio-mem devices. -------------------------------------------------------------------------- 2. Benefits -------------------------------------------------------------------------- Guest side: - Increase memory usable by Linux in 4MB steps (vs. section size like 128MB on x86 or 2GB on e.g. some arm if I'm not mistaking) - Remove struct pages once all 4MB chunks of a section are offline (in contrast to all balloon drivers where this never happens) - Don't fragment memory, while still being able to unplug smaller chunks than ordinary DIMM sizes. - Memory hotplug support for architectures that have no proper interface (e.g. s390x misses the external notification part) or e.g. QEMU/Linux support is complicated to implement. - Automatic management of onlining/offlining in the device driver - no manual interaction from an admin/tool necessary. QEMU side: - Resizing (plug/unplug) has a single interface - in contrast to a mixture of ACPI and virtio-balloon. See the example below. - Migration works out of the box - no need to specify new DIMMs or new sizes on the migration target. It simply works. - We can resize in arbitrary steps and sizes (in contrast to e.g. ACPI, where we have to know upfront in which granularity we later on want to remove memory or even how much memory we eventually want to add to our guest) - One interface to rule them (architectures) all :) -------------------------------------------------------------------------- 3. Reboot handling -------------------------------------------------------------------------- After a reboot, all memory is unplugged. This allows the hypervisor to see if support for virtio-mem is available in the freshly booted system. This way we could charge only for the actually "plugged" memory size. And it avoids to sense for plugged memory in the guest. E.g. on every size change of a virtio-mem device, we can notify management layers. So we can track how much memory a VM has plugged. -------------------------------------------------------------------------- 4. Example -------------------------------------------------------------------------- (not including resizable memory regions on the QEMU side yet, so don't focus on that part - it will consume a lot of memory right now for e.g. dirty bitmaps and memory slot tracking data) Start QEMU with two virtio-mem devices that provide little memory inititally. $ qemu-system-x86_64 -m 4G,maxmem=504G \ -smp sockets=2,cores=2 \ [...] -object memory-backend-ram,id=mem0,size=256G \ -device virtio-mem-pci,id=vm0,memdev=mem0,node=0,size=4160M \ -object memory-backend-ram,id=mem1,size=256G \ -device virtio-mem-pci,id=vm1,memdev=mem1,node=1,size=3G Query the configuration ('size' tells us the guest driver is active): (qemu) info memory-devices info memory-devices Memory device [virtio-mem]: "vm0" phys-addr: 0x140000000 node: 0 requested-size: 4362076160 size: 4362076160 max-size: 274877906944 block-size: 4194304 memdev: /objects/mem0 Memory device [virtio-mem]: "vm1" phys-addr: 0x4140000000 node: 1 requested-size: 3221225472 size: 3221225472 max-size: 274877906944 block-size: 4194304 memdev: /objects/mem1 Change the size of a virtio-mem device: (qemu) memory-device-resize vm0 40960 memory-device-resize vm0 40960 ... (qemu) info memory-devices info memory-devices Memory device [virtio-mem]: "vm0" phys-addr: 0x140000000 node: 0 requested-size: 42949672960 size: 42949672960 max-size: 274877906944 block-size: 4194304 memdev: /objects/mem0 ... Try to unplug memory (KASAN active in the guest - a lot of memory wasted): (qemu) memory-device-resize vm0 1024 memory-device-resize vm0 1024 ... (qemu) info memory-devices info memory-devices Memory device [virtio-mem]: "vm0" phys-addr: 0x140000000 node: 0 requested-size: 1073741824 size: 6169821184 max-size: 274877906944 block-size: 4194304 memdev: /objects/mem0 ... I am sharing for now only the linux driver side. The current code can be found at [2]. The QEMU side is still heavily WIP, the current QEMU prototype can be found at [3]. [1] https://lists.gnu.org/archive/html/qemu-devel/2017-06/msg03870.html [2] https://github.com/davidhildenbrand/linux/tree/virtio-mem [3] https://github.com/davidhildenbrand/qemu/tree/virtio-mem [4] https://www.mail-archive.com/linux-kernel@xxxxxxxxxxxxxxx/msg1698014.html David Hildenbrand (4): ACPI: NUMA: export pxm_to_node s390: mm: support removal of memory s390: numa: implement memory_add_physaddr_to_nid() virtio-mem: paravirtualized memory arch/s390/mm/init.c | 18 +- arch/s390/numa/numa.c | 12 + drivers/acpi/numa.c | 1 + drivers/virtio/Kconfig | 15 + drivers/virtio/Makefile | 1 + drivers/virtio/virtio_mem.c | 1040 +++++++++++++++++++++++++++++++ include/uapi/linux/virtio_ids.h | 1 + include/uapi/linux/virtio_mem.h | 134 ++++ 8 files changed, 1216 insertions(+), 6 deletions(-) create mode 100644 drivers/virtio/virtio_mem.c create mode 100644 include/uapi/linux/virtio_mem.h -- 2.17.0