On Mon, Jul 02, 2012 at 07:56:58PM +0100, Daniel P. Berrange wrote: > On Mon, Jul 02, 2012 at 03:06:32PM -0300, Eduardo Habkost wrote: > > Resending series, after fixing some coding style issues. Does anybody has any > > feedback about this proposal? > > > > Changes v1 -> v2: > > - Coding style fixes > > > > Original cover letter: > > > > I was investigating if there are any mechanisms that allow manually pinning of > > guest RAM to specific host NUMA nodes, in the case of multi-node KVM guests, and > > noticed that -mem-path could be used for that, except that it currently removes > > any files it creates (using mkstemp()) immediately, not allowing numactl to be > > used on the backing files, as a result. This patches add a -keep-mem-path-files > > option to make QEMU create the files inside -mem-path with more predictable > > names, and not remove them after creation. > > > > Some previous discussions about the subject, for reference: > > - Message-ID: <1281534738-8310-1-git-send-email-andre.przywara@xxxxxxx> > > http://article.gmane.org/gmane.comp.emulators.kvm.devel/57684 > > - Message-ID: <4C7D7C2A.7000205@xxxxxxxxxxxxx> > > http://article.gmane.org/gmane.comp.emulators.kvm.devel/58835 > > > > A more recent thread can be found at: > > - Message-ID: <20111029184502.GH11038@xxxxxxxxxx> > > http://article.gmane.org/gmane.comp.emulators.qemu/123001 > > > > Note that this is just a mechanism to facilitate manual static binding using > > numactl on hugetlbfs later, for optimization. This may be especially useful for > > single large multi-node guests use-cases (and, of course, has to be used with > > care). > > > > I don't know if it is a good idea to use the memory range names as a publicly- > > visible interface. Another option may be to use a single file instead, and mmap > > different regions inside the same file for each memory region. I an open to > > comments and suggestions. > > > > Example (untested) usage to bind manually each half of the RAM of a guest to a > > different NUMA node: > > > > $ qemu-system-x86_64 [...] -m 2048 -smp 4 \ > > -numa node,cpus=0-1,mem=1024 -numa node,cpus=2-3,mem=1024 \ > > -mem-prealloc -keep-mem-path-files -mem-path /mnt/hugetlbfs/FOO > > $ numactl --offset=1G --length=1G --membind=1 --file /mnt/hugetlbfs/FOO/pc.ram > > $ numactl --offset=0 --length=1G --membind=2 --file /mnt/hugetlbfs/FOO/pc.ram > > I'd suggest that instead of making the memory file name into a > public ABI QEMU needs to maintain, QEMU could expose the info > via a monitor command. eg > > $ qemu-system-x86_64 [...] -m 2048 -smp 4 \ > -numa node,cpus=0-1,mem=1024 -numa node,cpus=2-3,mem=1024 \ > -mem-prealloc -mem-path /mnt/hugetlbfs/FOO \ > -monitor stdio > (qemu) info mem-nodes > node0: file=/proc/self/fd/3, offset=0G, length=1G > node1: file=/proc/self/fd/3, offset=1G, length=1G > > This example takes advantage of the fact that with Linux, you can > still access a deleted file via /proc/self/fd/NNN, which AFAICT, > would avoid the need for a --keep-mem-path-files. I like the suggestion. But other processes still need to be able to open those files if we want to do anything useful with them. In this case, I guess it's better to let QEMU itself build a "/proc/<getpid()>/fd/<fd>" string instead of using "/proc/self" and forcing the client to find out what's the right PID? Anyway, even if we want to avoid file-descriptor and /proc tricks, we can still use the interface you suggest. Then we wouldn't need to have any filename assumptions: the filenames could be completly random, as they would be reported using the new monitor command. > > By returning info via a monitor command you also avoid hardcoding > the use of 1 single file for all of memory. You also avoid hardcoding > the fact that QEMU stores the nodes in contiguous order inside the > node. eg QEMU could easily return data like this > > > $ qemu-system-x86_64 [...] -m 2048 -smp 4 \ > -numa node,cpus=0-1,mem=1024 -numa node,cpus=2-3,mem=1024 \ > -mem-prealloc -mem-path /mnt/hugetlbfs/FOO \ > -monitor stdio > (qemu) info mem-nodes > node0: file=/proc/self/fd/3, offset=0G, length=1G > node1: file=/proc/self/fd/4, offset=0G, length=1G > > or more ingeneous options Sounds good. -- Eduardo -- To unsubscribe from this list: send the line "unsubscribe kvm" in the body of a message to majordomo@xxxxxxxxxxxxxxx More majordomo info at http://vger.kernel.org/majordomo-info.html