On 13.10.21 21:03, Dr. David Alan Gilbert wrote: > * David Hildenbrand (david@xxxxxxxxxx) wrote: >> Based-on: 20211011175346.15499-1-david@xxxxxxxxxx >> >> A virtio-mem device is represented by a single large RAM memory region >> backed by a single large mmap. >> >> Right now, we map that complete memory region into guest physical addres >> space, resulting in a very large memory mapping, KVM memory slot, ... >> although only a small amount of memory might actually be exposed to the VM. >> >> For example, when starting a VM with a 1 TiB virtio-mem device that only >> exposes little device memory (e.g., 1 GiB) towards the VM initialliy, >> in order to hotplug more memory later, we waste a lot of memory on metadata >> for KVM memory slots (> 2 GiB!) and accompanied bitmaps. Although some >> optimizations in KVM are being worked on to reduce this metadata overhead >> on x86-64 in some cases, it remains a problem with nested VMs and there are >> other reasons why we would want to reduce the total memory slot to a >> reasonable minimum. >> >> We want to: >> a) Reduce the metadata overhead, including bitmap sizes inside KVM but also >> inside QEMU KVM code where possible. >> b) Not always expose all device-memory to the VM, to reduce the attack >> surface of malicious VMs without using userfaultfd. >> >> So instead, expose the RAM memory region not by a single large mapping >> (consuming one memslot) but instead by multiple mappings, each consuming >> one memslot. To do that, we divide the RAM memory region via aliases into >> separate parts and only map the aliases into a device container we actually >> need. We have to make sure that QEMU won't silently merge the memory >> sections corresponding to the aliases (and thereby also memslots), >> otherwise we lose atomic updates with KVM and vhost-user, which we deeply >> care about when adding/removing memory. Further, to get memslot accounting >> right, such merging is better avoided. >> >> Within the memslots, virtio-mem can (un)plug memory in smaller granularity >> dynamically. So memslots are a pure optimization to tackle a) and b) above. >> >> Memslots are right now mapped once they fall into the usable device region >> (which grows/shrinks on demand right now either when requesting to >> hotplug more memory or during/after reboots). In the future, with >> VIRTIO_MEM_F_UNPLUGGED_INACCESSIBLE, we'll be able to (un)map aliases even >> more dynamically when (un)plugging device blocks. >> >> >> Adding a 500GiB virtio-mem device and not hotplugging any memory results in: >> 0000000140000000-000001047fffffff (prio 0, i/o): device-memory >> 0000000140000000-0000007e3fffffff (prio 0, i/o): virtio-mem-memslots >> >> Requesting the VM to consume 2 GiB results in (note: the usable region size >> is bigger than 2 GiB, so 3 * 1 GiB memslots are required): >> 0000000140000000-000001047fffffff (prio 0, i/o): device-memory >> 0000000140000000-0000007e3fffffff (prio 0, i/o): virtio-mem-memslots >> 0000000140000000-000000017fffffff (prio 0, ram): alias virtio-mem-memslot-0 @mem0 0000000000000000-000000003fffffff >> 0000000180000000-00000001bfffffff (prio 0, ram): alias virtio-mem-memslot-1 @mem0 0000000040000000-000000007fffffff >> 00000001c0000000-00000001ffffffff (prio 0, ram): alias virtio-mem-memslot-2 @mem0 0000000080000000-00000000bfffffff > > I've got a vague memory that there were some devices that didn't like > doing split IO across a memory region (or something) - some virtio > devices? Do you know if that's still true and if that causes a problem? Interesting point! I am not aware of any such issues, and I'd be surprised if we'd still have such buggy devices, because the layout virtio-mem now creates is just very similar to the layout we'll automatically create with ordinary DIMMs. If we hotplug DIMMs they will end up consecutive in guest physical address space, however, having separate memory regions and requiring separate memory slots. So, very similar to a virtio-mem device now. Maybe the catch is that it's hard to cross memory regions that are e.g., >- 128 MiB aligned, because ordinary allocations (e.g., via the buddy in Linux which supports <= 4 MiB pages) in won't cross these blocks. -- Thanks, David / dhildenb