Re: How to determine the backing host physical memory for a given guest ?

Chegu Vinod <chegu_vinod@xxxxxx> · Fri, 11 May 2012 01:22:26 +0000 (UTC)

Andrew Theurer <habanero <at> linux.vnet.ibm.com> writes:

> 
> On 05/09/2012 08:46 AM, Avi Kivity wrote:
> > On 05/09/2012 04:05 PM, Chegu Vinod wrote:
> >> Hello,
> >>
> >> On an 8 socket Westmere host I am attempting to run a single guest and
> >> characterize the virtualization overhead for a system intensive
> >> workload (AIM7-high_systime) as the size of the guest scales (10way/64G,
> >> 20way/128G, ... 80way/512G).
> >>
> >> To do some comparisons between the native vs. guest runs. I have
> >> been using "numactl" to control the cpu node&  memory node bindings for
> >> the qemu instance.  For larger guest sizes I end up binding across multiple
> >> localities. for e.g. a 40 way guest :
> >>
> >> numactl --cpunodebind=0,1,2,3  --membind=0,1,2,3  \
> >> qemu-system-x86_64 -smp 40 -m 262144 \
> >> <....>
> >>
> >> I understand that actual mappings from a guest virtual address to host 
physical
> >> address could change.
> >>
> >> Is there a way to determine [at a given instant] which host's NUMA node is
> >> providing the backing physical memory for the active guest's kernel and
> >> also for the the apps actively running in the guest ?
> >>
> >> Guessing that there is a better way (some tool available?) than just
> >> diff'ng the per node memory usage...from the before and after output of
> >> "numactl --hardware" on the host.
> >>
> >
> > Not sure if that's what you want, but there's Documentation/vm/pagemap.txt.
> >
> 
> You can look at /proc/≤pid>/numa_maps and see all the mappings for the 
> qemu process.  There should be one really large mapping for the guest 
> memory, and in that line a number of dirty pages list potentially for 
> each NUMA node.  This will tell you how much from each node, but not 
> specifically "which page is mapped where".

Thanks . I will look at this in more detail.

> 
> Keep in mind with the current numactl you are using, you will likely not 
> get the benefits of NUMA enhancements found in the linux kernel from 
> your guest (or host).  There are a couple reasons: (1) your guest does 
> not have a NUMA topology defined (based on what I see from the qemu 
> command above), so it will not do anything special based on the host 
> topology.  Also, things that are broken down per-NUMA-node like some 
> spin-locks and sched-domains are now system-wide/flat.  This is a big 
> deal for scheduler and other things like kmem allocation.  With a single 
> 80way VM with no NUMA, you will likely have massive spin-lock contention 
> on some workloads. 

We had seen evidence of increased lock contentions (via lockstat etc.)as the 
guest size increased. 

[On a related note : Given the nature of the system intensive workload... the 
combination of the ticket based locks in the guest OS + PLE handling code in 
the host kernel was not helping. So temporarily worked around this. Hope to try 
out the PV locks changes soon....].

Regarding the -numa option :

I had earlier (about a ~month ago) tried the -numa option. The layout I 
specified didn't match the layout the guest saw. Haven't yet looked into the 
exact reason...but came to know that there was already an open issue: 

https://bugzilla.redhat.com/show_bug.cgi?id=816804 

Also remember noticing a warning message when the -numa option was used for a 
guest with more than 64VCPUs. (in my case with 80 VCPUs).  Will be looking at 
the code soon to see if there is any limitation...

I have been using [more or less the upstream version of] qemu directly to start 
the guest.  For guest sizes : 10way, 20way, 40way 60way I had been using numactl 
just to control the numa nodes where the guest ends up running. After the guest 
booted up I used to set the affinity of the the VCPUs (to the specific cores on 
the host) by via "taskset" (this was a touch painful... when compared to virsh 
vcpupin). For 80way guest (on an 80way host) I don't use numactl.  

Noticed that doing a "taskset" to pin the VCPUs didn't always give a better 
performance...perhaps this is due to the absence of a NUMA layout in the guest.

(2) Once the VM does have NUMA toplogy (via qemu 
> -numa), one still cannot manually set mempolicy for a portion of the VM 
> memory that represents each NUMA node in the VM (or have this done 
> automatically with something like autoNUMA).  Therefore, it's difficult 
> to forcefully map each of the VM's node's memory to the corresponding 
> host node.
> 
> There are a some things you can do to mitigate some of this.  Definitely 
> define the VM to match the NUMA topology found on the host.  

The native/host platform has multiple levels of NUMA...

node distances:
node   0   1   2   3   4   5   6   7 
  0:  10  14  23  23  27  27  27  27 
  1:  14  10  23  23  27  27  27  27 
  2:  23  23  10  14  27  27  27  27 
  3:  23  23  14  10  27  27  27  27 
  4:  27  27  27  27  10  14  23  23 
  5:  27  27  27  27  14  10  23  23 
  6:  27  27  27  27  23  23  10  14 
  7:  27  27  27  27  23  23  14  10 

The qemu's -numa option seems to only allow for one level (i.e. 
specify multiple sockets etc). Am I missing something ?

> at least allow good scaling wrt locks and scheduler in the guest.  As 
> for getting memory placement close (a page in VM node x actually resides 
> in host node x), you have to rely on vcpu pinning + guest NUMA topology, 
> combined with default mempolicy in the guest and host.  

I did recompile both the kernels with the SLUB allocator enabled...

> As pages are 
> faulted in the guest, the hope is that the vcpu which did the faulting 
> is running in the right node (guest and host), its guest OS mempolicy 
> ensures this page is to be allocated in the guest local node, and that 
> allocation cause a fault in qemu, which is -also- running on the -host- 
> node X.  The vcpu pinning is critical to get qemu to fault that memory 
> to the correct node.  

In the absence of a NUMA layout in the guest it doesn't look like pinning 
helped... but I think I understand what you are saying. Thanks!

> Make sure you do not use numactl for any of this. 
>   I would suggest using libvirt and define the vcpu-pinning and the numa 
> topology in the XML.

I will try this in the coming days (waiting to get back on the system :)). 

Thanks for the detailed response ! 

Vinod

> 
> -Andrew Theurer
> 
> --
> To unsubscribe from this list: send the line "unsubscribe kvm" in
> the body of a message to majordomo <at> vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
> 
> 

--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at  http://vger.kernel.org/majordomo-info.html