On Mon, Jul 02, 2012 at 04:54:03PM -0300, Eduardo Habkost wrote: > On Mon, Jul 02, 2012 at 07:56:58PM +0100, Daniel P. Berrange wrote: > > On Mon, Jul 02, 2012 at 03:06:32PM -0300, Eduardo Habkost wrote: > > > Resending series, after fixing some coding style issues. Does anybody has any > > > feedback about this proposal? > > > > > > Changes v1 -> v2: > > > - Coding style fixes > > > > > > Original cover letter: > > > > > > I was investigating if there are any mechanisms that allow manually pinning of > > > guest RAM to specific host NUMA nodes, in the case of multi-node KVM guests, and > > > noticed that -mem-path could be used for that, except that it currently removes > > > any files it creates (using mkstemp()) immediately, not allowing numactl to be > > > used on the backing files, as a result. This patches add a -keep-mem-path-files > > > option to make QEMU create the files inside -mem-path with more predictable > > > names, and not remove them after creation. > > > > > > Some previous discussions about the subject, for reference: > > > - Message-ID: <1281534738-8310-1-git-send-email-andre.przywara@xxxxxxx> > > > http://article.gmane.org/gmane.comp.emulators.kvm.devel/57684 > > > - Message-ID: <4C7D7C2A.7000205@xxxxxxxxxxxxx> > > > http://article.gmane.org/gmane.comp.emulators.kvm.devel/58835 > > > > > > A more recent thread can be found at: > > > - Message-ID: <20111029184502.GH11038@xxxxxxxxxx> > > > http://article.gmane.org/gmane.comp.emulators.qemu/123001 > > > > > > Note that this is just a mechanism to facilitate manual static binding using > > > numactl on hugetlbfs later, for optimization. This may be especially useful for > > > single large multi-node guests use-cases (and, of course, has to be used with > > > care). > > > > > > I don't know if it is a good idea to use the memory range names as a publicly- > > > visible interface. Another option may be to use a single file instead, and mmap > > > different regions inside the same file for each memory region. I an open to > > > comments and suggestions. > > > > > > Example (untested) usage to bind manually each half of the RAM of a guest to a > > > different NUMA node: > > > > > > $ qemu-system-x86_64 [...] -m 2048 -smp 4 \ > > > -numa node,cpus=0-1,mem=1024 -numa node,cpus=2-3,mem=1024 \ > > > -mem-prealloc -keep-mem-path-files -mem-path /mnt/hugetlbfs/FOO > > > $ numactl --offset=1G --length=1G --membind=1 --file /mnt/hugetlbfs/FOO/pc.ram > > > $ numactl --offset=0 --length=1G --membind=2 --file /mnt/hugetlbfs/FOO/pc.ram > > > > I'd suggest that instead of making the memory file name into a > > public ABI QEMU needs to maintain, QEMU could expose the info > > via a monitor command. eg > > > > $ qemu-system-x86_64 [...] -m 2048 -smp 4 \ > > -numa node,cpus=0-1,mem=1024 -numa node,cpus=2-3,mem=1024 \ > > -mem-prealloc -mem-path /mnt/hugetlbfs/FOO \ > > -monitor stdio > > (qemu) info mem-nodes > > node0: file=/proc/self/fd/3, offset=0G, length=1G > > node1: file=/proc/self/fd/3, offset=1G, length=1G > > > > This example takes advantage of the fact that with Linux, you can > > still access a deleted file via /proc/self/fd/NNN, which AFAICT, > > would avoid the need for a --keep-mem-path-files. > > I like the suggestion. > > But other processes still need to be able to open those files if we want > to do anything useful with them. In this case, I guess it's better to > let QEMU itself build a "/proc/<getpid()>/fd/<fd>" string instead of > using "/proc/self" and forcing the client to find out what's the right > PID? > > Anyway, even if we want to avoid file-descriptor and /proc tricks, we > can still use the interface you suggest. Then we wouldn't need to have > any filename assumptions: the filenames could be completly random, as > they would be reported using the new monitor command. Opps, yes of course. I did intend that client apps could use the files, so I should have used /proc/$PID and not /proc/self > > > > > By returning info via a monitor command you also avoid hardcoding > > the use of 1 single file for all of memory. You also avoid hardcoding > > the fact that QEMU stores the nodes in contiguous order inside the > > node. eg QEMU could easily return data like this > > > > > > $ qemu-system-x86_64 [...] -m 2048 -smp 4 \ > > -numa node,cpus=0-1,mem=1024 -numa node,cpus=2-3,mem=1024 \ > > -mem-prealloc -mem-path /mnt/hugetlbfs/FOO \ > > -monitor stdio > > (qemu) info mem-nodes > > node0: file=/proc/self/fd/3, offset=0G, length=1G > > node1: file=/proc/self/fd/4, offset=0G, length=1G > > > > or more ingeneous options > > Sounds good. > > -- > Eduardo -- |: http://berrange.com -o- http://www.flickr.com/photos/dberrange/ :| |: http://libvirt.org -o- http://virt-manager.org :| |: http://autobuild.org -o- http://search.cpan.org/~danberr/ :| |: http://entangle-photo.org -o- http://live.gnome.org/gtk-vnc :| -- To unsubscribe from this list: send the line "unsubscribe kvm" in the body of a message to majordomo@xxxxxxxxxxxxxxx More majordomo info at http://vger.kernel.org/majordomo-info.html