Hi! On Mon, Nov 21, 2011 at 07:51:21PM -0600, Anthony Liguori wrote: > Fundamentally, the entity that should be deciding what memory should be present > and where it should located is the kernel. I'm fundamentally opposed to trying > to make QEMU override the scheduler/mm by using cpu or memory pinning in QEMU. > > From what I can tell about ms_mbind(), it just uses process knowledge to bind > specific areas of memory to a memsched group and let's the kernel decide what to > do with that knowledge. This is exactly the type of interface that QEMU should > be using. > > QEMU should tell the kernel enough information such that the kernel can make > good decisions. QEMU should not be the one making the decisions. True, QEMU won't have to decide where the memory and vcpus should be located (but hey it wouldn't need to decide that even if you use cpusets, you can use relative mbind with cpusets, the admin or a cpuset job scheduler could decide) but it's still QEMU making the decision of what memory and which vcpus threads to ms_mbind/ms_tbind. Think how you're going to create the input of those syscalls... If it wasn't qemu to decide that, qemu wouldn't be required to scan the whole host physical numa (cpu/memory) topology in order to create the "input" arguments of "ms_mbind/ms_tbind". And when you migrate the VM to another host, the whole vtopology may be counter-productive because the kernel isn't automatically detecting the numa affinity between threads and the guest vtopology will stick to whatever numa _physical_ topology that was seen on the first node where the VM was created. I doubt that the assumption that all cloud nodes will have the same physical numa topology is reasonable. Furthermore to get the same benefits that qemu gets on host by using ms_mbind/ms_tbind, every single guest application should be modified to scan the guest vtopology and call ms_mbind/ms_tbind too (or use the hard bindings which is what we try to avoid). I think it's unreasonable to expect all applications to use ms_mbind/ms_tbind in the guest, at best guest apps will use cpusets or wrappers, few apps will be modified for sys_ms_tbind/mbind. You can always have the supercomputer case with just one app that is optimized and a single VM spanning over the whole host, but in that scenarios hard bindings would work perfectly too. In my view the trouble of the numa hard bindings is not the fact they're hard and qemu has to also decide the location (in fact it doesn't need to decide the location if you use cpusets and relative mbinds). The bigger problem is the fact either the admin or the app developer has to explicitly scan the numa physical topology (both cpus and memory) and tell the kernel how much memory to bind to each thread. ms_mbind/ms_tbind only partially solve that problem. They're similar to the mbind MPOL_F_RELATIVE_NODES with cpusets, except you don't need an admin or a cpuset-job-scheduler (or a perl script) to redistribute the hardware resources. Now dealing with bindings isn't big deal for qemu, in fact this API is pretty much ideal for qemu, but it won't make life substantially easier than if compared to hard bindings. Simply the management code that is now done with a perl script will have to be moved in the kernel. It looks an incremental improvement compared to the relative mbind+cpuset, but I'm unsure if it's the best we could aim for and what we really need in virt considering we deal with VM migration too. The real long term design to me is not to add more syscalls, and initially handling the case of a process/VM spanning not more than one node in thread number and amount of memory. That's not too hard an in fact I've benchmarks for the scheduler already showing it to work pretty well (it's creating a too strict affinity but it can be relaxed to be more useful). Then later add some mechanism (simplest is the page fault at low frequency) to create a guest_vcpu_thread<->host_memory affinity and have a parvirtualized interface that tells the guest scheduler to group CPUs. If the guest scheduler runs free and is allowed to move threads randomly without any paravirtualized interface that controls the CPU thread migration in the guest scheduler, the thread<->memory affinity on host will be hopeless. But with a parvirtualized interface to make a guest thread stick to vcpu0/1/2/3 and not going into vcpu4/5/6/7, will allow to create a more meaningful guest_thread<->physical_ram affinity on host through KVM page faults. And then this will work also with VM migration and without having to create a vtopology in guest. And for apps running in guest no paravirt will be needed of course. The reason paravirt would be needed for qemu-kvm with a full automatic thread<->memory affinity is that the vcpu threads are magic. What runs in the vcpu thread are guest threads. And those can move through the guest CPU scheduler from vcpu0 to vcpu7. If that happens and we've 4 physical cpu for each physical node, any affinity we measure in the host will be meaningless. Normal threads using NPTL won't behave like that. Maybe some other thread library could have a "scheduler" inside that would make it behave like a vcpu thread (it's one thread really with several threads inside) but those existed mostly to simulate multiple threads in a single thread so they don't matter. And in this respect sys_tbind also requires the tid to have meaningful memory affinity. sys_tbind/mbind gets away with it by creating a vtopology in the guest, so the guest scheduler would then follow the vtopology (but vtopology breaks across VM migration and to really be followed well with sys_mbind/tbind it'd require all apps to be modified). grouping guest threads to stick into some vcpu sounds immensely simpler than changing the whole guest vtopology at runtime that would involve changing memory layout too. NOTE: the paravirt cpu grouping interface would also handle the case of 3 guests of 2.5G on a 8G guest (4G per node). One of the three guests will have memory spanning over two nodes, and the guest vtopology created by sys_mbind/tbind can't handle it. While paravirt cpu grouping and automatic thread<->memory affinity on host will handle it, like it will handle VM migration across nodes with different physical topology. The problem is to create a thread<->memory affinity we'll have to issue some page fault in KVM in the background. How harmful that is I don't know at this point. So the full automatic thread<->memory affinity is a bit of a vapourware concept at this point (process<->memory affinity seems to work already though). But Peter's migration code was driven by page faults already (not included in the patch he posted) and the other patch that exists called migrate-on-fault also depended on page faults. So I am optimistic we could have a thread<->memory affinity working too in the longer term. The plan would be to run them at low frequency and only if we can't fit a process into one node (in terms of both number of threads and memory). If the process fits in one node, we wouldn't even need any page fault and the information in the pagetables will be enough to do a best decision. The downside is it significantly more difficult to implement the thread<->memory affinity. And that's why I'm focusing initially on the simpler case of considering only the process<->memory affinity. That's fairly easy. So for the time being this incremental improvement may be justified, it moves the logic from a perl script to the kernel but I'm just skeptical it provides a big advantage compared to the numa bindings we already have in the kernel, especially if in the long term we can get rid of a vtopology completely. The vtopology in the guest may seem appealing, it solves the problem when you use bindings everywhere (be them hard bindings, or cpuset relative bindings, or the dynamic sys_mbind/tbind). But there is no much hope to alter the vtopology at runtime, so when a guest must be split across two nodes (3 VM of 2.5G ram running in a 8G host with 2 4G nodes) or through VM migration across different cloud nodes, I think the vtopology is trouble and would be best if it's avoided. The memory side of the vtopology is absolute trouble if it doesn't match the host physical topology exactly. -- To unsubscribe from this list: send the line "unsubscribe kvm" in the body of a message to majordomo@xxxxxxxxxxxxxxx More majordomo info at http://vger.kernel.org/majordomo-info.html