On 7/7/21 12:30 PM, David Hildenbrand wrote: > On 23.06.21 12:12, Michal Privoznik wrote: >> v4 of: >> >> https://listman.redhat.com/archives/libvir-list/2021-April/msg01138.html >> >> diff to v3: >> - Rebased code on the top of master >> - Tried to work in all Peter's review suggestions >> - Fixed a bug where adjusting <requested/> was viewed as hotplug of new >> <memory/> by XML validator and thus if <maxMemory/> was close >> enough to >> <currentMemory/> the validator reported an error (this was reported by >> David). >> > > Hi Michal, Hi, sorry for replying this late. > > I just retested with this version and it mostly works as expected. I > tested quite some memory configurations and have some comments / reports :) > > I tested successfully: > - 1 node with one device > - 2 nodes with one device on each node > - 2 nodes with two devices on one node > - "virsh update-memory-device" on live domains -- works great > - huge pages and anonymous memory with access=private and access=shared. > There is only one issue with hugepages and memfd (prealloc=on gets > set). > - shared memory on memfd and anonymous memory (-> shared file) with > access=shared > > I only tested on a single host NUMA node so far, but don't expect > surprises with host numa policies. > > > 1. "virsh update-memory-device" and stopped domains > > Once I have more than one virtio-mem device defined for a VM, "virsh > update-memory-device" cannot be used anymore as aliases don't seem to be > available on stopped VMs. If I manually define an alias on a stopped VM, > the alias silently gets dropped. Is there any way to identify a > virtio-mem device on a stopped domain? Yes. You want what we call user aliases. They have to have "ua-" prefix so that they don't clash with whatever libvirt generates. Something like this: <memory model="virtio-mem"> <source> <pagesize unit="KiB">2048</pagesize> </source> <target> <size unit="KiB">2097152</size> <node>0</node> <block unit="KiB">2048</block> <requested unit="KiB">1048576</requested> </target> <alias name="ua-virtiomem0"/> <address type="pci" domain="0x0000" bus="0x00" slot="0x06" function="0x0"/> </memory> But then you get to the fact that I haven't implemented update for inactive domains. Will do in v5. > > > 2. "virsh update-memory-device" with --config on a running domain > > # virsh update-memory-device "Fedora34" --config --alias "virtiomem1" > --requested-size 16G > error: no memory device found > > I guess the issue is again, that alias don't apply to the "!live" XML. > So the "--config" option doesn't really work when having more than one > virtio-mem device defined for a VM. Good point. I wonder what piece of input XML I can use to look up corresponding virtio-mem. Since it has a PCI address maybe I can use that instead of alias? But then we are back to the old problem - in general inactive and active XMLs can be different (due to hot(un-)plug). So even when I'd find a device on the same PCI address it may be different, actually. Therefore, I think the safest is to use aliases. At anyrate - this can be implemented afterwards. > > > 3. "virsh update-memory-device" and nodes > > In addition to "--alias", something like "--node" would also be nice to > have -- assuming there is only a single virtio-mem device per NUMA node, > which is usually the case. For example: > > "virsh update-memory-device "Fedora34" --node 1 --requested-size 16G" > could come in handy. This would also work on "!live" domains. Yes, makes sense. > > > 4. "actual" vs. "current" > > "<actual unit='KiB'>16777216</actual>" I wonder if "current" instead of > "actual" would be more in line with "currentMemory". But no strong opinion. Yeah, I don't have any opinion either. I can change it. > > > 5. Slot handling. > > As already discussed, virtio-mem and virtio-pmem don't need slots. Yet, > the "slots" definition is required and libvirt reserves once slot for > each such device ("error: unsupported configuration: memory device count > '2' exceeds slots count '1'"). This is certainly future work, if we ever > want to change that. I can look into this. > > > 6. 4k source results in an error > > <source> > <pagesize unit='KiB'>4096</pagesize> > <nodemask>0-1</nodemask> > </source> > > "error: internal error: Unable to find any usable hugetlbfs mount for > 4096 KiB" > > This example is taken from https://libvirt.org/formatdomain.html for > DIMMs. Not sure what the expected behavior is. Ouch, this is a clear bug. Let me investigate and fix in next version. > > > 7. File source gets silently dropped > > <source> > <path>/dev/shmem/vm0</path> > </source> > > The statement gets silently dropped, which is somewhat surprising. > However, I did not test what happens with DIMMs, maybe it's the same. Yeah, this is somewhat expected. I mean, the part that's expected is that libvirt drops parts it doesn't parse. Sometimes it is pretty obvious (<source someRandomAttribute='value'/>) and sometimes it's not so (like in your example when <path/> makes sense for other memory models like virtio-pmem). But just to be sure - since virtio-mem can be backed by any memory-backing-* backend, does it make sense to have <path/> there? So far my code would use memory-backend-file for hugepages only. > > > 8. Global preallocation of memory > > With > > <memoryBacking> > <allocation mode="immediate"\> > </memoryBacking> > > we also get "prealloc=on" set for the memory backends of the virito-mem > devices, which is sub-optimal, because we end up preallocating all > memory of the memory backend (which is unexpected for a virtio-mem > device) and virtio-mem will then discard all memory immediately again. > So it's essentially a dangerous NOP -- dangerous because we temporarily > consume a lot of memory. > > In an ideal world, we would not set this for the memory backend used for > the virtio-mem devices, but for the virtio-mem devices themselves, such > that preallocation happens when new memory blocks are actually exposed > to the VM. > > As virtio-mem does not support "prealloc=on" for virtio-mem devices yet, > this is future work. We might want to error out, though, if <allocation > mode="immediate"\> is used along with virtio-mem devices for now. I'm > planning on implementing this in QEMU soon. Until then, it might also be > good enough to simply document that this setup should be avoided. Right. Meanwhile this was implemented in QEMU and thus I can drop prealloc=on. But then my question is what happens when user wants to expose additional memory to the guest but doesn't have enough free hugepages in the pool? Libvirt's using prealloc=on so that QEMU doesn't get killed later, after the guest booted up. > > > 9. Memfd and huge pages > > <memoryBacking> > <source type="memfd"/> > </memoryBacking> > > and > > <memory model='virtio-mem' access='shared'> > <source> > <pagesize unit='KiB'>2048</pagesize> > </source> > ... > </memory> > > > I get on the QEMU cmdline > > "-object > {"qom-type":"memory-backend-memfd","id":"memvirtiomem0","hugetlb":true,"hugetlbsize":2097152,"share":true,"prealloc":true,"size":17179869184}" > > > Dropping "the memfd" source I get on the QEMU cmdline: > > -object^@{"qom-type":"memory-backend-file","id":"memvirtiomem0","mem-path":"/dev/hugepages/libvirt/qemu/2-Fedora34-2","share":true,"size":17179869184} > > > "prealloc":true should not have been added for virtio-mem in case of > memfd. !memfd does what's expected. > Okay, I will fix this. But can you shed more light here? I mean, why the difference? > > 10. Memory locking > > With > > <memoryBacking> > <locked/> > </memoryBacking> > > virtio-mem fails with > > "qemu-system-x86_64: -device > virtio-mem-pci,node=0,block-size=2097152,requested-size=0,memdev=memvirtiomem0,id=virtiomem0,bus=pci.0,addr=0x2: > Incompatible with mlock" > > Unfortunately,for example, on shmem like: > > <memoryBacking> > <locked/> > <access mode="shared"/> > <source type="memfd"/> > </memoryBacking> > > it seems to fail after essentially (temporarily) preallocating all > memory for the memory backend of the virtio-mem device. In the future, > virtio-mem might be able to support mlock, until then, this is > suboptimal but at least fails at some point. > > > 11. Reservation of memory > > With new QEMU versions we'll want to pass "reserve=off" for the memory > backend used, especially with hugepages and private mappings. While this > change was merged into QEMU, it's not part of an official release yet. > Future work. > > https://lore.kernel.org/qemu-devel/20210510114328.21835-1-david@xxxxxxxxxx/ > > Otherwise, when we don't have the "size" currently in free and > "unreserved" hugepages, we'll fail with "qemu-system-x86_64: unable to > map backing store for guest RAM: Cannot allocate memory". The same thing > can easily happen on anonymous memory when memory overcommit isn't > disabled. > > So this is future work, but at least the QEMU part is already upstream. So what's the difference between reserve and prealloc? Michal