On Fri, Apr 12, 2019 at 01:15:05PM +0200, Michal Privoznik wrote: > On 4/12/19 12:11 PM, Daniel Henrique Barboza wrote: > > > > > > On 4/12/19 6:10 AM, Michal Privoznik wrote: > > > On 4/11/19 7:29 PM, Daniel Henrique Barboza wrote: > > > > > > > > > > > > On 4/11/19 11:56 AM, Michal Privoznik wrote: > > > > > On 4/11/19 4:23 PM, Daniel Henrique Barboza wrote: > > > > > > Hi, > > > > > > > > > > > > I've tested these patches again, twice, in similar > > > > > > setups like I tested > > > > > > the first version (first in a Power8, then in a Power9 server). > > > > > > > > > > > > Same results, though. Libvirt will not avoid the launch > > > > > > of a pseries guest, > > > > > > with numanode=strict, even if the numa node does not have available > > > > > > RAM. If I stress test the memory of the guest to force the allocation, > > > > > > QEMU exits with an error as soon as the memory of the host numa node > > > > > > is exhausted. > > > > > > > > > > Yes, this is expected. I mean, by default qemu doesn't > > > > > allocate memory for the guest fully. You'd have to force it: > > > > > > > > > > <memoryBacking> > > > > > <allocation mode='immediate'/> > > > > > </memoryBacking> > > > > > > > > > > > > > Tried with this extra setting, still no good. Domain still > > > > boots, even if > > > > there is not enough memory to load up all its ram in the NUMA node > > > > I am setting. For reference, this is the top of the guest XML: > > > > > > > > > > > > <name>vm1</name> > > > > <uuid>f48e9e35-8406-4784-875f-5185cb4d47d7</uuid> > > > > <memory unit='KiB'>314572800</memory> > > > > <currentMemory unit='KiB'>314572800</currentMemory> > > > > <memoryBacking> > > > > <allocation mode='immediate'/> > > > > </memoryBacking> > > > > <vcpu placement='static'>16</vcpu> > > > > <numatune> > > > > <memory mode='strict' nodeset='0'/> > > > > </numatune> > > > > <os> > > > > <type arch='ppc64' machine='pseries'>hvm</type> > > > > <boot dev='hd'/> > > > > </os> > > > > <clock offset='utc'/> > > > > > > > > While doing this test, I recalled that some of my IBM peers recently > > > > mentioned that they were unable to do a pre-allocation of the RAM > > > > of a pseries guest using Libvirt, but they were able to do it using QEMU > > > > directly (using -realtime mlock=on). In fact, I just tried it > > > > out with command > > > > line QEMU and the guest allocated all the memory at boot. > > > > > > Ah, so looks like -mem-prealloc doesn't work at Power? Can you > > > please check: > > > > > > 1) that -mem-prealloc is on the qemu command line > > > > Yes. This is the cmd line generated: > > > > /usr/bin/qemu-system-ppc64 \ > > -name guest=vm1,debug-threads=on \ > > -S \ > > -object secret,id=masterKey0,format=raw,file=/home/user/usr/var/lib/libvirt/qemu/domain-2-vm1/master-key.aes > > \ > > -machine pseries-2.11,accel=kvm,usb=off,dump-guest-core=off \ > > -bios /home/user/boot_rom.bin \ > > -m 307200 \ > > -mem-prealloc \ > > -realtime mlock=off \ > > This looks correct. > > > -smp 16,sockets=16,cores=1,threads=1 \ > > -uuid f48e9e35-8406-4784-875f-5185cb4d47d7 \ > > -display none \ > > -no-user-config \ > > -nodefaults \ > > -chardev socket,id=charmonitor,path=/home/user/usr/var/lib/libvirt/qemu/domain-2-vm1/monitor.sock,server,nowait > > \ > > -mon chardev=charmonitor,id=monitor,mode=control \ > > -rtc base=utc \ > > -no-shutdown \ > > -boot strict=on \ > > -device spapr-pci-host-bridge,index=1,id=pci.1 \ > > -device qemu-xhci,id=usb,bus=pci.0,addr=0x3 \ > > -drive > > file=/home/user/nv2-vm1.qcow2,format=qcow2,if=none,id=drive-virtio-disk0 > > \ > > -device virtio-blk-pci,scsi=off,bus=pci.0,addr=0x4,drive=drive-virtio-disk0,id=virtio-disk0,bootindex=1 > > \ > > -chardev pty,id=charserial0 \ > > -device spapr-vty,chardev=charserial0,id=serial0,reg=0x30000000 \ > > -device virtio-balloon-pci,id=balloon0,bus=pci.0,addr=0x1 \ > > -sandbox > > on,obsolete=deny,elevateprivileges=deny,spawn=deny,resourcecontrol=deny > > \ > > -msg timestamp=on > > > > > > > > > > > 2) how much memory qemu allocates right after it started the guest? > > > I mean, before you start some mem stress test which causes it to > > > allocate the memory fully. > > > > It starts with 300Gb. It depletes its assigned NUMA node (that has 256Gb), > > then it takes ~70Gb from another NUMA node to complete the 300Gb. > > Huh, than -mem-prealloc is working but something else is not. What strikes > me is that once guest starts using the memory then host kernel kills the > guest. So host kernel knows about the limits we've set but doesn't enforce > them when allocating the memory. The way QEMU implemnetings -mem-prealloc is a bit of a hack. Essentially it tries to write a single byte in each page of memory, on the belief that this will cause the kernel to allocate that page. See do_touch_pages() in qemu's util/oslib-posix.c: for (i = 0; i < numpages; i++) { /* * Read & write back the same value, so we don't * corrupt existing user/app data that might be * stored. * * 'volatile' to stop compiler optimizing this away * to a no-op * * TODO: get a better solution from kernel so we * don't need to write at all so we don't cause * wear on the storage backing the region... */ *(volatile char *)addr = *addr; addr += hpagesize; } I wonder if the compiler on PPC is optimizing this in some way that turns it into a no-op unexpectedly. Regards, Daniel -- |: https://berrange.com -o- https://www.flickr.com/photos/dberrange :| |: https://libvirt.org -o- https://fstop138.berrange.com :| |: https://entangle-photo.org -o- https://www.instagram.com/dberrange :| -- libvir-list mailing list libvir-list@xxxxxxxxxx https://www.redhat.com/mailman/listinfo/libvir-list