Re: [PATCH v1 00/12] virtio-mem: Expose device memory via multiple memslots

"Michael S. Tsirkin" <mst@xxxxxxxxxx> · Sun, 7 Nov 2021 03:14:44 -0500

On Tue, Nov 02, 2021 at 06:10:13PM +0100, David Hildenbrand wrote:
> On 02.11.21 18:06, Michael S. Tsirkin wrote:
> > On Tue, Nov 02, 2021 at 12:55:17PM +0100, David Hildenbrand wrote:
> >> On 02.11.21 12:35, Michael S. Tsirkin wrote:
> >>> On Tue, Nov 02, 2021 at 09:33:55AM +0100, David Hildenbrand wrote:
> >>>> On 01.11.21 23:15, Michael S. Tsirkin wrote:
> >>>>> On Wed, Oct 27, 2021 at 02:45:19PM +0200, David Hildenbrand wrote:
> >>>>>> This is the follow-up of [1], dropping auto-detection and vhost-user
> >>>>>> changes from the initial RFC.
> >>>>>>
> >>>>>> Based-on: 20211011175346.15499-1-david@xxxxxxxxxx
> >>>>>>
> >>>>>> A virtio-mem device is represented by a single large RAM memory region
> >>>>>> backed by a single large mmap.
> >>>>>>
> >>>>>> Right now, we map that complete memory region into guest physical addres
> >>>>>> space, resulting in a very large memory mapping, KVM memory slot, ...
> >>>>>> although only a small amount of memory might actually be exposed to the VM.
> >>>>>>
> >>>>>> For example, when starting a VM with a 1 TiB virtio-mem device that only
> >>>>>> exposes little device memory (e.g., 1 GiB) towards the VM initialliy,
> >>>>>> in order to hotplug more memory later, we waste a lot of memory on metadata
> >>>>>> for KVM memory slots (> 2 GiB!) and accompanied bitmaps. Although some
> >>>>>> optimizations in KVM are being worked on to reduce this metadata overhead
> >>>>>> on x86-64 in some cases, it remains a problem with nested VMs and there are
> >>>>>> other reasons why we would want to reduce the total memory slot to a
> >>>>>> reasonable minimum.
> >>>>>>
> >>>>>> We want to:
> >>>>>> a) Reduce the metadata overhead, including bitmap sizes inside KVM but also
> >>>>>>    inside QEMU KVM code where possible.
> >>>>>> b) Not always expose all device-memory to the VM, to reduce the attack
> >>>>>>    surface of malicious VMs without using userfaultfd.
> >>>>>
> >>>>> I'm confused by the mention of these security considerations,
> >>>>> and I expect users will be just as confused.
> >>>>
> >>>> Malicious VMs wanting to consume more memory than desired is only
> >>>> relevant when running untrusted VMs in some environments, and it can be
> >>>> caught differently, for example, by carefully monitoring and limiting
> >>>> the maximum memory consumption of a VM. We have the same issue already
> >>>> when using virtio-balloon to logically unplug memory. For me, it's a
> >>>> secondary concern ( optimizing a is much more important ).
> >>>>
> >>>> Some users showed interest in having QEMU disallow access to unplugged
> >>>> memory, because coming up with a maximum memory consumption for a VM is
> >>>> hard. This is one step into that direction without having to run with
> >>>> uffd enabled all of the time.
> >>>
> >>> Sorry about missing the memo - is there a lot of overhead associated
> >>> with uffd then?
> >>
> >> When used with huge/gigantic pages, we don't particularly care.
> >>
> >> For other memory backends, we'll have to route any population via the
> >> uffd handler: guest accesses a 4k page -> place a 4k page from user
> >> space. Instead of the kernel automatically placing a THP, we'd be
> >> placing single 4k pages and have to hope the kernel will collapse them
> >> into a THP later.
> > 
> > How much value there is in a THP given it's not present?
> 
> If you don't place a THP right during the first page fault inside the
> THP region, you'll have to rely on khugepagd to eventually place a huge
> page later -- and manually fault in each and every 4k page. I haven't
> done any performance measurements so far. Going via userspace on every
> 4k fault will most certainly hurt performance when first touching memory.

So, if the focus is performance improvement, maybe show the speedup?

> -- 
> Thanks,
> 
> David / dhildenb