On Tue, 2010-08-31 at 17:03 -0500, Anthony Liguori wrote: > On 08/31/2010 03:54 PM, Andrew Theurer wrote: > > On Mon, 2010-08-23 at 16:27 -0500, Anthony Liguori wrote: > > > >> On 08/23/2010 04:16 PM, Andre Przywara wrote: > >> > >>> Anthony Liguori wrote: > >>> > >>>> On 08/23/2010 01:59 PM, Marcelo Tosatti wrote: > >>>> > >>>>> On Wed, Aug 11, 2010 at 03:52:18PM +0200, Andre Przywara wrote: > >>>>> > >>>>>> According to the user-provided assignment bind the respective part > >>>>>> of the guest's memory to the given host node. This uses Linux' > >>>>>> mbind syscall (which is wrapped only in libnuma) to realize the > >>>>>> pinning right after the allocation. > >>>>>> Failures are not fatal, but produce a warning. > >>>>>> > >>>>>> Signed-off-by: Andre Przywara<andre.przywara@xxxxxxx> > >>>>>> ... > >>>>>> > >>>>> Why is it not possible (or perhaps not desired) to change the binding > >>>>> after the guest is started? > >>>>> > >>>>> Sounds unflexible. > >>>>> > >>> The solution is to introduce a monitor interface to later adjust the > >>> pinning, allowing both changing the affinity only (only valid for > >>> future fault-ins) and actually copying the memory (more costly). > >>> > >> This is just duplicating numactl. > >> > >> > >>> Actually this is the next item on my list, but I wanted to bring up > >>> the basics first to avoid recoding parts afterwards. Also I am not > >>> (yet) familiar with the QMP protocol. > >>> > >>>> We really need a solution that lets a user use a tool like numactl > >>>> outside of the QEMU instance. > >>>> > >>> I fear that is not how it's meant to work with the Linux' NUMA API. In > >>> opposite to the VCPU threads, which are externally visible entities > >>> (PIDs), the memory should be private to the QEMU process. While you > >>> can change the NUMA allocation policy of the _whole_ process, there is > >>> no way to externally distinguish parts of the process' memory. > >>> Although you could later (and externally) migrate already faulted > >>> pages (via move_pages(2) and by looking in /proc/$$/numa_maps), you > >>> would let an external tool interfere with QEMUs internal memory > >>> management. Take for instance the change of the allocation policy > >>> regarding the 1MB and 3.5-4GB holes. An external tool would have to > >>> either track such changes or you simply could not change such things > >>> in QEMU. > >>> > >> It's extremely likely that if you're doing NUMA pinning, you're also > >> doing large pages via hugetlbfs. numactl can already set policies for > >> files in hugetlbfs so all you need to do is have a separate hugetlbfs > >> file for each numa node. > >> > > Why would we resort to hugetlbfs when we have transparent hugepages? > > > > If you care about NUMA pinning, I can't believe you don't want > guaranteed large page allocation which THP does not provide. I personally want a more automatic approach to placing VMs in NUMA nodes (not directed by the qemu process itself), but I'd also like to support a user's desire to pin and place cpus and memory, especially for large VMs that need to be defined as multi-node. For user defined pinning, libhugetlbfs will probably be fine, but for most VMs, I'd like to ensure we can do things like ballooning well, and I am not so sure that will be easy with libhugetlbfs. > The general point though is that we should find a way to partition > memory in qemu such that an external process can control the actual NUMA > placement. This gives us maximum flexibility. > > Otherwise, what do we implement in QEMU? Direct pinning of memory to > nodes? Can we migrate memory between nodes? Should we support > interleaving memory between two virtual nodes? Why pick and choose when > we can have it all. If there were a better way to do this than hugetlbfs, then I don't think I would shy away from this. Is there another way to change NUMA policies on mappings from a user tool? We can already inspect with /proc/<pid>/numamaps. Is this something that could be added to numactl? > > > FWIW, large apps like databases have set a precedent for managing their > > own NUMA policies. > > Of course because they know what their NUMA policy should be. They live > in a simple world where they assume they're the only application in the > system, they read the distance tables, figure they'll use XX% of all > physical memory, and then pin how they see fit. > > But an individual QEMU process lives in a complex world. It's almost > never the only thing on the system and it's only allowed to use a subset > of resources. It's not sure what set of resources it can and can't use > and that's often times changing. The topology chosen for a guest is > static but it's host topology may be dynamic due to thinks like live > migration. True, that's why this would require support to change in the monitor. > In short, QEMU absolutely cannot implement a NUMA policy in a vacuum. > Instead, it needs to let something with a larger view of the system > determine a NUMA policy that makes sense overall. I agree. > There are two ways we can do this. We can implement monitor commands > that attempt to expose every single NUMA tunable possible. Or, we can > tie into the existing commands which guarantee that we support every > possible tunable and that as NUMA support in Linux evolves, we get all > the new features for free. Assuming there's no new thing one needs to expose in qemu to work with whatever new feature numactl/libnuma gets. But perhaps that's a lot less likely. > And, since numactl already supports setting policies on files in > hugetlbfs, all we need is a simple change to qemu to allow -mem-path to > work per-node instead of globally. And it's useful to implement other > types of things like having one node be guaranteed large pages and > another node THP or some other fanciness. If it were not dependent on hugetlbfs, then I don't think I would have an issue. > Sounds awfully appealing to me. > > > I don't see why qemu should be any different. > > Numactl is great for small apps that need to be pinned in one node, or > > spread evenly on all nodes. Having to get hugetlbfs involved just to > > workaround a shortcoming of numactl just seems like a bad idea. > > > > You seem to be asserting that we should implement a full NUMA policy in > QEMU. What should it be when we don't (in QEMU) know what else is > running on the system? I don't think the qemu itself should decide where to "be" on the system. I would like to have -something- else make those decisions, either a user or some mgmt daemon that looks at the whole picture. Or <gulp> get the scheduler involved (with new algorithms). I am still quite curious of numactl/libnuma could be extended to set some policies on individual mappings. Then we will not even need to have multiple -mem-path's. -Andrew > > Regards, > > Anthony Liguori > -- To unsubscribe from this list: send the line "unsubscribe kvm" in the body of a message to majordomo@xxxxxxxxxxxxxxx More majordomo info at http://vger.kernel.org/majordomo-info.html