Re: [PATCH 4/4] NUMA: realize NUMA memory pinning

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 



On Tue, 2010-08-31 at 17:03 -0500, Anthony Liguori wrote:
> On 08/31/2010 03:54 PM, Andrew Theurer wrote:
> > On Mon, 2010-08-23 at 16:27 -0500, Anthony Liguori wrote:
> >    
> >> On 08/23/2010 04:16 PM, Andre Przywara wrote:
> >>      
> >>> Anthony Liguori wrote:
> >>>        
> >>>> On 08/23/2010 01:59 PM, Marcelo Tosatti wrote:
> >>>>          
> >>>>> On Wed, Aug 11, 2010 at 03:52:18PM +0200, Andre Przywara wrote:
> >>>>>            
> >>>>>> According to the user-provided assignment bind the respective part
> >>>>>> of the guest's memory to the given host node. This uses Linux'
> >>>>>> mbind syscall (which is wrapped only in libnuma) to realize the
> >>>>>> pinning right after the allocation.
> >>>>>> Failures are not fatal, but produce a warning.
> >>>>>>
> >>>>>> Signed-off-by: Andre Przywara<andre.przywara@xxxxxxx>
> >>>>>> ...
> >>>>>>              
> >>>>> Why is it not possible (or perhaps not desired) to change the binding
> >>>>> after the guest is started?
> >>>>>
> >>>>> Sounds unflexible.
> >>>>>            
> >>> The solution is to introduce a monitor interface to later adjust the
> >>> pinning, allowing both changing the affinity only (only valid for
> >>> future fault-ins) and actually copying the memory (more costly).
> >>>        
> >> This is just duplicating numactl.
> >>
> >>      
> >>> Actually this is the next item on my list, but I wanted to bring up
> >>> the basics first to avoid recoding parts afterwards. Also I am not
> >>> (yet) familiar with the QMP protocol.
> >>>        
> >>>> We really need a solution that lets a user use a tool like numactl
> >>>> outside of the QEMU instance.
> >>>>          
> >>> I fear that is not how it's meant to work with the Linux' NUMA API. In
> >>> opposite to the VCPU threads, which are externally visible entities
> >>> (PIDs), the memory should be private to the QEMU process. While you
> >>> can change the NUMA allocation policy of the _whole_ process, there is
> >>> no way to externally distinguish parts of the process' memory.
> >>> Although you could later (and externally) migrate already faulted
> >>> pages (via move_pages(2) and by looking in /proc/$$/numa_maps), you
> >>> would let an external tool interfere with QEMUs internal memory
> >>> management. Take for instance the change of the allocation policy
> >>> regarding the 1MB and 3.5-4GB holes. An external tool would have to
> >>> either track such changes or you simply could not change such things
> >>> in QEMU.
> >>>        
> >> It's extremely likely that if you're doing NUMA pinning, you're also
> >> doing large pages via hugetlbfs.  numactl can already set policies for
> >> files in hugetlbfs so all you need to do is have a separate hugetlbfs
> >> file for each numa node.
> >>      
> > Why would we resort to hugetlbfs when we have transparent hugepages?
> >    
> 
> If you care about NUMA pinning, I can't believe you don't want 
> guaranteed large page allocation which THP does not provide.

I personally want a more automatic approach to placing VMs in NUMA nodes
(not directed by the qemu process itself), but I'd also like to support
a user's desire to pin and place cpus and memory, especially for large
VMs that need to be defined as multi-node.  For user defined pinning,
libhugetlbfs will probably be fine, but for most VMs, I'd like to ensure
we can do things like ballooning well, and I am not so sure that will be
easy with libhugetlbfs.  

> The general point though is that we should find a way to partition 
> memory in qemu such that an external process can control the actual NUMA 
> placement.  This gives us maximum flexibility.
> 
> Otherwise, what do we implement in QEMU?  Direct pinning of memory to 
> nodes?  Can we migrate memory between nodes?  Should we support 
> interleaving memory between two virtual nodes?  Why pick and choose when 
> we can have it all.

If there were a better way to do this than hugetlbfs, then I don't think
I would shy away from this.  Is there another way to change NUMA
policies on mappings from a user tool?  We can already inspect
with /proc/<pid>/numamaps.  Is this something that could be added to
numactl?

> 
> > FWIW, large apps like databases have set a precedent for managing their
> > own NUMA policies.
> 
> Of course because they know what their NUMA policy should be.  They live 
> in a simple world where they assume they're the only application in the 
> system, they read the distance tables, figure they'll use XX% of all 
> physical memory, and then pin how they see fit.
> 
> But an individual QEMU process lives in a complex world.  It's almost 
> never the only thing on the system and it's only allowed to use a subset 
> of resources.  It's not sure what set of resources it can and can't use 
> and that's often times changing.  The topology chosen for a guest is 
> static but it's host topology may be dynamic due to thinks like live 
> migration.

True, that's why this would require support to change in the monitor.

> In short, QEMU absolutely cannot implement a NUMA policy in a vacuum.  
> Instead, it needs to let something with a larger view of the system 
> determine a NUMA policy that makes sense overall.

I agree.

> There are two ways we can do this.  We can implement monitor commands 
> that attempt to expose every single NUMA tunable possible.  Or, we can 
> tie into the existing commands which guarantee that we support every 
> possible tunable and that as NUMA support in Linux evolves, we get all 
> the new features for free.

Assuming there's no new thing one needs to expose in qemu to work with
whatever new feature numactl/libnuma gets.  But perhaps that's a lot
less likely.

> And, since numactl already supports setting policies on files in 
> hugetlbfs, all we need is a simple change to qemu to allow -mem-path to 
> work per-node instead of globally.  And it's useful to implement other 
> types of things like having one node be guaranteed large pages and 
> another node THP or some other fanciness.

If it were not dependent on hugetlbfs, then I don't think I would have
an issue.

> Sounds awfully appealing to me.
> 
> >    I don't see why qemu should be any different.
> > Numactl is great for small apps that need to be pinned in one node, or
> > spread evenly on all nodes.  Having to get hugetlbfs involved just to
> > workaround a shortcoming of numactl just seems like a bad idea.
> >    
> 
> You seem to be asserting that we should implement a full NUMA policy in 
> QEMU.  What should it be when we don't (in QEMU) know what else is 
> running on the system?

I don't think the qemu itself should decide where to "be" on the system.
I would like to have -something- else make those decisions, either a
user or some mgmt daemon that looks at the whole picture.  Or <gulp> get
the scheduler involved (with new algorithms).

I am still quite curious of numactl/libnuma could be extended to set
some policies on individual mappings.  Then we will not even need to
have multiple -mem-path's.

-Andrew

> 
> Regards,
> 
> Anthony Liguori
> 


--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[Index of Archives]     [KVM ARM]     [KVM ia64]     [KVM ppc]     [Virtualization Tools]     [Spice Development]     [Libvirt]     [Libvirt Users]     [Linux USB Devel]     [Linux Audio Users]     [Yosemite Questions]     [Linux Kernel]     [Linux SCSI]     [XFree86]
  Powered by Linux