Re: [RFC PATCH 0/6] option to not remove files inside -mem-path dir (v2)

"Daniel P. Berrange" <berrange@xxxxxxxxxx> · Mon, 2 Jul 2012 19:56:58 +0100

On Mon, Jul 02, 2012 at 03:06:32PM -0300, Eduardo Habkost wrote:
> Resending series, after fixing some coding style issues. Does anybody has any
> feedback about this proposal?
> 
> Changes v1 -> v2:
>  - Coding style fixes
> 
> Original cover letter:
> 
> I was investigating if there are any mechanisms that allow manually pinning of
> guest RAM to specific host NUMA nodes, in the case of multi-node KVM guests, and
> noticed that -mem-path could be used for that, except that it currently removes
> any files it creates (using mkstemp()) immediately, not allowing numactl to be
> used on the backing files, as a result. This patches add a -keep-mem-path-files
> option to make QEMU create the files inside -mem-path with more predictable
> names, and not remove them after creation.
> 
> Some previous discussions about the subject, for reference:
>  - Message-ID: <1281534738-8310-1-git-send-email-andre.przywara@xxxxxxx>
>    http://article.gmane.org/gmane.comp.emulators.kvm.devel/57684
>  - Message-ID: <4C7D7C2A.7000205@xxxxxxxxxxxxx>
>    http://article.gmane.org/gmane.comp.emulators.kvm.devel/58835
> 
> A more recent thread can be found at:
>  - Message-ID: <20111029184502.GH11038@xxxxxxxxxx>
>    http://article.gmane.org/gmane.comp.emulators.qemu/123001
> 
> Note that this is just a mechanism to facilitate manual static binding using
> numactl on hugetlbfs later, for optimization. This may be especially useful for
> single large multi-node guests use-cases (and, of course, has to be used with
> care).
> 
> I don't know if it is a good idea to use the memory range names as a publicly-
> visible interface. Another option may be to use a single file instead, and mmap
> different regions inside the same file for each memory region. I an open to
> comments and suggestions.
> 
> Example (untested) usage to bind manually each half of the RAM of a guest to a
> different NUMA node:
> 
>  $ qemu-system-x86_64 [...] -m 2048 -smp 4 \
>    -numa node,cpus=0-1,mem=1024 -numa node,cpus=2-3,mem=1024 \
>    -mem-prealloc -keep-mem-path-files -mem-path /mnt/hugetlbfs/FOO
>  $ numactl --offset=1G --length=1G --membind=1 --file /mnt/hugetlbfs/FOO/pc.ram
>  $ numactl --offset=0  --length=1G --membind=2 --file /mnt/hugetlbfs/FOO/pc.ram

I'd suggest that instead of making the memory file name into a
public ABI QEMU needs to maintain, QEMU could expose the info
via a monitor command. eg

   $ qemu-system-x86_64 [...] -m 2048 -smp 4 \
     -numa node,cpus=0-1,mem=1024 -numa node,cpus=2-3,mem=1024 \
     -mem-prealloc -mem-path /mnt/hugetlbfs/FOO \
     -monitor stdio
   (qemu) info mem-nodes
    node0: file=/proc/self/fd/3, offset=0G, length=1G
    node1: file=/proc/self/fd/3, offset=1G, length=1G

This example takes advantage of the fact that with Linux, you can
still access a deleted file via /proc/self/fd/NNN, which AFAICT,
would avoid the need for a --keep-mem-path-files.

By returning info via a monitor command you also avoid hardcoding
the use of 1 single file for all of memory. You also avoid hardcoding
the fact that QEMU stores the nodes in contiguous order inside the
node. eg QEMU could easily return data like this

   $ qemu-system-x86_64 [...] -m 2048 -smp 4 \
     -numa node,cpus=0-1,mem=1024 -numa node,cpus=2-3,mem=1024 \
     -mem-prealloc -mem-path /mnt/hugetlbfs/FOO \
     -monitor stdio
   (qemu) info mem-nodes
    node0: file=/proc/self/fd/3, offset=0G, length=1G
    node1: file=/proc/self/fd/4, offset=0G, length=1G

or more ingeneous options

Regards,
Daniel
-- 
|: http://berrange.com      -o-    http://www.flickr.com/photos/dberrange/ :|
|: http://libvirt.org              -o-             http://virt-manager.org :|
|: http://autobuild.org       -o-         http://search.cpan.org/~danberr/ :|
|: http://entangle-photo.org       -o-       http://live.gnome.org/gtk-vnc :|
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at  http://vger.kernel.org/majordomo-info.html