Re: [PATCH v2 0/2] qemu: Honor memory mode='strict'

Daniel P. Berrangé <berrange@xxxxxxxxxx> · Fri, 12 Apr 2019 12:24:54 +0100

On Fri, Apr 12, 2019 at 01:15:05PM +0200, Michal Privoznik wrote:
> On 4/12/19 12:11 PM, Daniel Henrique Barboza wrote:
> > 
> > 
> > On 4/12/19 6:10 AM, Michal Privoznik wrote:
> > > On 4/11/19 7:29 PM, Daniel Henrique Barboza wrote:
> > > > 
> > > > 
> > > > On 4/11/19 11:56 AM, Michal Privoznik wrote:
> > > > > On 4/11/19 4:23 PM, Daniel Henrique Barboza wrote:
> > > > > > Hi,
> > > > > > 
> > > > > > I've tested these patches again, twice, in similar
> > > > > > setups like I tested
> > > > > > the first version (first in a Power8, then in a Power9 server).
> > > > > > 
> > > > > > Same results, though. Libvirt will not avoid the launch
> > > > > > of a pseries guest,
> > > > > > with numanode=strict, even if the numa node does not have available
> > > > > > RAM. If I stress test the memory of the guest to force the allocation,
> > > > > > QEMU exits with an error as soon as the memory of the host numa node
> > > > > > is exhausted.
> > > > > 
> > > > > Yes, this is expected. I mean, by default qemu doesn't
> > > > > allocate memory for the guest fully. You'd have to force it:
> > > > > 
> > > > > <memoryBacking>
> > > > >   <allocation mode='immediate'/>
> > > > > </memoryBacking>
> > > > > 
> > > > 
> > > > Tried with this extra setting, still no good. Domain still
> > > > boots, even if
> > > > there is not enough memory to load up all its ram in the NUMA node
> > > > I am setting. For reference, this is the top of the guest XML:
> > > > 
> > > > 
> > > >    <name>vm1</name>
> > > >    <uuid>f48e9e35-8406-4784-875f-5185cb4d47d7</uuid>
> > > >    <memory unit='KiB'>314572800</memory>
> > > >    <currentMemory unit='KiB'>314572800</currentMemory>
> > > >    <memoryBacking>
> > > >      <allocation mode='immediate'/>
> > > >    </memoryBacking>
> > > >    <vcpu placement='static'>16</vcpu>
> > > >    <numatune>
> > > >      <memory mode='strict' nodeset='0'/>
> > > >    </numatune>
> > > >    <os>
> > > >      <type arch='ppc64' machine='pseries'>hvm</type>
> > > >      <boot dev='hd'/>
> > > >    </os>
> > > >    <clock offset='utc'/>
> > > > 
> > > > While doing this test, I recalled that some of my IBM peers recently
> > > > mentioned that they were unable to do a pre-allocation of the RAM
> > > > of a pseries guest using Libvirt, but they were able to do it using QEMU
> > > > directly (using -realtime mlock=on). In fact, I just tried it
> > > > out with command
> > > > line QEMU and the guest allocated all the memory at boot.
> > > 
> > > Ah, so looks like -mem-prealloc doesn't work at Power? Can you
> > > please check:
> > > 
> > > 1) that -mem-prealloc is on the qemu command line
> > 
> > Yes. This is the cmd line generated:
> > 
> > /usr/bin/qemu-system-ppc64 \
> > -name guest=vm1,debug-threads=on \
> > -S \
> > -object secret,id=masterKey0,format=raw,file=/home/user/usr/var/lib/libvirt/qemu/domain-2-vm1/master-key.aes
> > \
> > -machine pseries-2.11,accel=kvm,usb=off,dump-guest-core=off \
> > -bios /home/user/boot_rom.bin \
> > -m 307200 \
> > -mem-prealloc \
> > -realtime mlock=off \
> 
> This looks correct.
> 
> > -smp 16,sockets=16,cores=1,threads=1 \
> > -uuid f48e9e35-8406-4784-875f-5185cb4d47d7 \
> > -display none \
> > -no-user-config \
> > -nodefaults \
> > -chardev socket,id=charmonitor,path=/home/user/usr/var/lib/libvirt/qemu/domain-2-vm1/monitor.sock,server,nowait
> > \
> > -mon chardev=charmonitor,id=monitor,mode=control \
> > -rtc base=utc \
> > -no-shutdown \
> > -boot strict=on \
> > -device spapr-pci-host-bridge,index=1,id=pci.1 \
> > -device qemu-xhci,id=usb,bus=pci.0,addr=0x3 \
> > -drive
> > file=/home/user/nv2-vm1.qcow2,format=qcow2,if=none,id=drive-virtio-disk0
> > \
> > -device virtio-blk-pci,scsi=off,bus=pci.0,addr=0x4,drive=drive-virtio-disk0,id=virtio-disk0,bootindex=1
> > \
> > -chardev pty,id=charserial0 \
> > -device spapr-vty,chardev=charserial0,id=serial0,reg=0x30000000 \
> > -device virtio-balloon-pci,id=balloon0,bus=pci.0,addr=0x1 \
> > -sandbox
> > on,obsolete=deny,elevateprivileges=deny,spawn=deny,resourcecontrol=deny
> > \
> > -msg timestamp=on
> > 
> > 
> > 
> > 
> > > 2) how much memory qemu allocates right after it started the guest?
> > > I mean, before you start some mem stress test which causes it to
> > > allocate the memory fully.
> > 
> > It starts with 300Gb. It depletes its assigned NUMA node (that has 256Gb),
> > then it takes ~70Gb from another NUMA node to complete the 300Gb.
> 
> Huh, than -mem-prealloc is working but something else is not. What strikes
> me is that once guest starts using the memory then host kernel kills the
> guest. So host kernel knows about the limits we've set but doesn't enforce
> them when allocating the memory.

The way QEMU implemnetings -mem-prealloc is a bit of a hack.

Essentially it tries to write a single byte in each page of
memory, on the belief that this will cause the kernel to
allocate that page.

See do_touch_pages() in qemu's  util/oslib-posix.c:

        for (i = 0; i < numpages; i++) {
            /*
             * Read & write back the same value, so we don't
             * corrupt existing user/app data that might be
             * stored.
             *
             * 'volatile' to stop compiler optimizing this away
             * to a no-op
             *
             * TODO: get a better solution from kernel so we
             * don't need to write at all so we don't cause
             * wear on the storage backing the region...
             */
            *(volatile char *)addr = *addr;
            addr += hpagesize;
        }

I wonder if the compiler on PPC is optimizing this in some
way that turns it into a no-op unexpectedly.

Regards,
Daniel
-- 
|: https://berrange.com      -o-    https://www.flickr.com/photos/dberrange :|
|: https://libvirt.org         -o-            https://fstop138.berrange.com :|
|: https://entangle-photo.org    -o-    https://www.instagram.com/dberrange :|

--
libvir-list mailing list
libvir-list@xxxxxxxxxx
https://www.redhat.com/mailman/listinfo/libvir-list