On Wed, Nov 23, 2011 at 07:34:37PM +0100, Alexander Graf wrote: > On 11/23/2011 04:03 PM, Andrea Arcangeli wrote: > >Hi! > > > > > >In my view the trouble of the numa hard bindings is not the fact > >they're hard and qemu has to also decide the location (in fact it > >doesn't need to decide the location if you use cpusets and relative > >mbinds). The bigger problem is the fact either the admin or the app > >developer has to explicitly scan the numa physical topology (both cpus > >and memory) and tell the kernel how much memory to bind to each > >thread. ms_mbind/ms_tbind only partially solve that problem. They're > >similar to the mbind MPOL_F_RELATIVE_NODES with cpusets, except you > >don't need an admin or a cpuset-job-scheduler (or a perl script) to > >redistribute the hardware resources. > > Well yeah, of course the guest needs to see some topology. I don't > see why we'd have to actually scan the host for this though. All we > need to tell the kernel is "this memory region is close to that > thread". > > So if you define "-numa node,mem=1G,cpus=0" then QEMU should be able > to tell the kernel that this GB of RAM actually is close to that > vCPU thread. > > Of course the admin still needs to decide how to split up memory. > That's the deal with emulating real hardware. You get the interfaces > hardware gets :). However, if you follow a reasonable default > strategy such as numa splitting your RAM into equal chunks between > guest vCPUs you're probably close enough to optimal usage models. Or > at least you could have a close enough approximation of how this > mapping could work for the _guest_ regardless of the host and when > you migrate it somewhere else it should also work reasonably well. Allowing specification of the numa nodes to qemu, allowing qemu to create cpu+mem grouping (without binding) and letting the kernel decide how to manage them seems like a reasonable incremental step between no guest/host NUMA awareness and automatic NUMA configuration in the host kernel. It would be suffice for the current needs we see. Besides migration, we also have use cases where we may want to have large multi-node VMs that are static (like LPARs), having the guest aware of the topology there is helpful. Also, if at all topology changes due to migration or host kernel decisions, we can make use of something like VPHN (virtual processor home node) capability on Power systems to have guest kernel update its topology knowledge. You can refer to that in arch/powerpc/mm/numa.c. Otherwise, as long as the host kernel maintains mappings requested by ms_tbind()/ms_mbind(), we can create the guest topology correctly and optimize for NUMA. This would work for us. Thanks Dipankar -- To unsubscribe from this list: send the line "unsubscribe kvm" in the body of a message to majordomo@xxxxxxxxxxxxxxx More majordomo info at http://vger.kernel.org/majordomo-info.html