On 14.10.21 15:17, David Hildenbrand wrote: > On 14.10.21 13:45, Dr. David Alan Gilbert wrote: >> * David Hildenbrand (david@xxxxxxxxxx) wrote: >>> KVM nowadays supports a lot of memslots. We want to exploit that in >>> virtio-mem, exposing device memory via separate memslots to the guest >>> on demand, essentially reducing the total size of KVM slots >>> significantly (and thereby metadata in KVM and in QEMU for KVM memory >>> slots) especially when exposing initially only a small amount of memory >>> via a virtio-mem device to the guest, to hotplug more later. Further, >>> not always exposing the full device memory region to the guest reduces >>> the attack surface in many setups without requiring other mechanisms >>> like uffd for protection of unplugged memory. >>> >>> So split the original RAM region via memory region aliases into separate >>> chunks (ending up as individual memslots), and dynamically map the >>> required chunks (falling into the usable region) into the container. >>> >>> For now, we always map the memslots covered by the usable region. In the >>> future, with VIRTIO_MEM_F_UNPLUGGED_INACCESSIBLE, we'll be able to map >>> memslots on actual demand and optimize further. >>> >>> Users can specify via the "max-memslots" property how much memslots the >>> virtio-mem device is allowed to use at max. "0" translates to "auto, no >>> limit" and is determinded automatically using a heuristic. When a maximum >>> (> 1) is specified, that auto-determined value is capped. The parameter >>> doesn't have to be migrated and can differ between source and destination. >>> The only reason the parameter exists is not make some corner case setups >>> (multiple large virtio-mem devices assigned to a single virtual NUMA node >>> with only very limited available memslots, hotplug of vhost devices) work. >>> The parameter will be set to be "0" as default soon, whereby it will remain >>> to be "1" for compat machines. >>> >>> The properties "memslots" and "used-memslots" are read-only. >>> >>> Signed-off-by: David Hildenbrand <david@xxxxxxxxxx> >> >> I think you need to move this patch after the vhost-user patches so that >> you don't break a bisect including vhost-user. > > As the default is set to 1 and is set to 0 ("auto") in the last patch in > this series, there should be (almost) no difference regarding vhost-user. > >> >> But I do worry about the effect on vhost-user: > > The 4096 limit was certainly more "let's make it extreme so we raise > some eyebrows and we can talk about the implications". I'd be perfectly > happy with 256 or better 512. Anything that's bigger than 32 in case of > virtiofsd :) > >> a) What about external programs like dpdk? > > At least initially virtio-mem won't apply to dpdk and similar workloads > (RT). For example, virtio-mem is incompatible with mlock. So I think the > most important use case to optimize for is virtio-mem+virtiofsd > (especially kata). > >> b) I worry if you end up with a LOT of slots you end up with a lot of >> mmap's and fd's in vhost-user; I'm not quite sure what all the effects >> of that will be. > > At least for virtio-mem, there will be a small number of fd's, as many > memslots share the same fd, so with virtio-mem it's not an issue. > > #VMAs is indeed worth discussing. Usually we can have up to 64k VMAs in > a process. The downside of having many is some reduce pagefault > performance. It really also depends on the target application. Maybe > there should be some libvhost-user toggle, where the application can opt > in to allow more? > I just did a simple test with memfds. The 1024 open fds limit does not apply to fds we already closed again. So the 1024 limit does not apply when done via fd = open() addr = mmap(fd) close(fd) For example, I did a simple test by creating 4096 memfds, mapping them, and then closing the file. The end result is $ ls -la /proc/38113/map_files/ | wc -l 4115 $ ls -la /proc/38113/fd/ | wc -l 6 Meaning there are many individual mappings, but only very limited open files Which should be precisely what we are doing in libvhost-user code (and should be doing in any other vhost-user code -- once we did the mmap(), we should let go of the fd). -- Thanks, David / dhildenb