On 17.09.23 13:47, Maciej S. Szmigiero wrote:
On 8.09.2023 16:21, David Hildenbrand wrote:
Having large virtio-mem devices that only expose little memory to a VM
is currently a problem: we map the whole sparse memory region into the
guest using a single memslot, resulting in one gigantic memslot in KVM.
KVM allocates metadata for the whole memslot, which can result in quite
some memory waste.
Assuming we have a 1 TiB virtio-mem device and only expose little (e.g.,
1 GiB) memory, we would create a single 1 TiB memslot and KVM has to
allocate metadata for that 1 TiB memslot: on x86, this implies allocating
a significant amount of memory for metadata:
(1) RMAP: 8 bytes per 4 KiB, 8 bytes per 2 MiB, 8 bytes per 1 GiB
-> For 1 TiB: 2147483648 + 4194304 + 8192 = ~ 2 GiB (0.2 %)
With the TDP MMU (cat /sys/module/kvm/parameters/tdp_mmu) this gets
allocated lazily when required for nested VMs
(2) gfn_track: 2 bytes per 4 KiB
-> For 1 TiB: 536870912 = ~512 MiB (0.05 %)
(3) lpage_info: 4 bytes per 2 MiB, 4 bytes per 1 GiB
-> For 1 TiB: 2097152 + 4096 = ~2 MiB (0.0002 %)
(4) 2x dirty bitmaps for tracking: 2x 1 bit per 4 KiB page
-> For 1 TiB: 536870912 = 64 MiB (0.006 %)
So we primarily care about (1) and (2). The bad thing is, that the
memory consumption *doubles* once SMM is enabled, because we create the
memslot once for !SMM and once for SMM.
Having a 1 TiB memslot without the TDP MMU consumes around:
* With SMM: 5 GiB
* Without SMM: 2.5 GiB
Having a 1 TiB memslot with the TDP MMU consumes around:
* With SMM: 1 GiB
* Without SMM: 512 MiB
... and that's really something we want to optimize, to be able to just
start a VM with small boot memory (e.g., 4 GiB) and a virtio-mem device
that can grow very large (e.g., 1 TiB).
Consequently, using multiple memslots and only mapping the memslots we
really need can significantly reduce memory waste and speed up
memslot-related operations. Let's expose the sparse RAM memory region using
multiple memslots, mapping only the memslots we currently need into our
device memory region container.
* With VIRTIO_MEM_F_UNPLUGGED_INACCESSIBLE, we only map the memslots that
actually have memory plugged, and dynamically (un)map when
(un)plugging memory blocks.
* Without VIRTIO_MEM_F_UNPLUGGED_INACCESSIBLE, we always map the memslots
covered by the usable region, and dynamically (un)map when resizing the
usable region.
We'll auto-detect the number of memslots to use based on the memslot limit
provided by the core. We'll use at most 1 memslot per gigabyte. Note that
our global limit of memslots accross all memory devices is currently set to
256: even with multiple large virtio-mem devices, we'd still have a sane
limit on the number of memslots used.
The default is a single memslot for now ("multiple-memslots=off"). The
optimization must be enabled manually using "multiple-memslots=on", because
some vhost setups (e.g., hotplug of vhost-user devices) might be
problematic until we support more memslots especially in vhost-user
backends.
Note that "multiple-memslots=on" is just a hint that multiple memslots
*may* be used for internal optimizations, not that multiple memslots
*must* be used. The actual number of memslots that are used is an
internal detail: for example, once memslot metadata is no longer an
issue, we could simply stop optimizing for that. Migration source and
destination can differ on the setting of "multiple-memslots".
Signed-off-by: David Hildenbrand <david@xxxxxxxxxx>
---
Reviewed-by: Maciej S. Szmigiero <maciej.szmigiero@xxxxxxxxxx>
Hope this patch was well-tested, especially on corner cases, since
it's very easy to make an off-by-one somewhere (like v1 had) and
much harder to spot it when doing a static code review.
I did test this series reasonably well indeed. Especially, also
exercising the corner case of the last memslot having a different size.
Thanks for all the review!
--
Cheers,
David / dhildenb