Andrew Theurer <habanero <at> linux.vnet.ibm.com> writes: > > On 05/09/2012 08:46 AM, Avi Kivity wrote: > > On 05/09/2012 04:05 PM, Chegu Vinod wrote: > >> Hello, > >> > >> On an 8 socket Westmere host I am attempting to run a single guest and > >> characterize the virtualization overhead for a system intensive > >> workload (AIM7-high_systime) as the size of the guest scales (10way/64G, > >> 20way/128G, ... 80way/512G). > >> > >> To do some comparisons between the native vs. guest runs. I have > >> been using "numactl" to control the cpu node& memory node bindings for > >> the qemu instance. For larger guest sizes I end up binding across multiple > >> localities. for e.g. a 40 way guest : > >> > >> numactl --cpunodebind=0,1,2,3 --membind=0,1,2,3 \ > >> qemu-system-x86_64 -smp 40 -m 262144 \ > >> <....> > >> > >> I understand that actual mappings from a guest virtual address to host physical > >> address could change. > >> > >> Is there a way to determine [at a given instant] which host's NUMA node is > >> providing the backing physical memory for the active guest's kernel and > >> also for the the apps actively running in the guest ? > >> > >> Guessing that there is a better way (some tool available?) than just > >> diff'ng the per node memory usage...from the before and after output of > >> "numactl --hardware" on the host. > >> > > > > Not sure if that's what you want, but there's Documentation/vm/pagemap.txt. > > > > You can look at /proc/≤pid>/numa_maps and see all the mappings for the > qemu process. There should be one really large mapping for the guest > memory, and in that line a number of dirty pages list potentially for > each NUMA node. This will tell you how much from each node, but not > specifically "which page is mapped where". Thanks . I will look at this in more detail. > > Keep in mind with the current numactl you are using, you will likely not > get the benefits of NUMA enhancements found in the linux kernel from > your guest (or host). There are a couple reasons: (1) your guest does > not have a NUMA topology defined (based on what I see from the qemu > command above), so it will not do anything special based on the host > topology. Also, things that are broken down per-NUMA-node like some > spin-locks and sched-domains are now system-wide/flat. This is a big > deal for scheduler and other things like kmem allocation. With a single > 80way VM with no NUMA, you will likely have massive spin-lock contention > on some workloads. We had seen evidence of increased lock contentions (via lockstat etc.)as the guest size increased. [On a related note : Given the nature of the system intensive workload... the combination of the ticket based locks in the guest OS + PLE handling code in the host kernel was not helping. So temporarily worked around this. Hope to try out the PV locks changes soon....]. Regarding the -numa option : I had earlier (about a ~month ago) tried the -numa option. The layout I specified didn't match the layout the guest saw. Haven't yet looked into the exact reason...but came to know that there was already an open issue: https://bugzilla.redhat.com/show_bug.cgi?id=816804 Also remember noticing a warning message when the -numa option was used for a guest with more than 64VCPUs. (in my case with 80 VCPUs). Will be looking at the code soon to see if there is any limitation... I have been using [more or less the upstream version of] qemu directly to start the guest. For guest sizes : 10way, 20way, 40way 60way I had been using numactl just to control the numa nodes where the guest ends up running. After the guest booted up I used to set the affinity of the the VCPUs (to the specific cores on the host) by via "taskset" (this was a touch painful... when compared to virsh vcpupin). For 80way guest (on an 80way host) I don't use numactl. Noticed that doing a "taskset" to pin the VCPUs didn't always give a better performance...perhaps this is due to the absence of a NUMA layout in the guest. (2) Once the VM does have NUMA toplogy (via qemu > -numa), one still cannot manually set mempolicy for a portion of the VM > memory that represents each NUMA node in the VM (or have this done > automatically with something like autoNUMA). Therefore, it's difficult > to forcefully map each of the VM's node's memory to the corresponding > host node. > > There are a some things you can do to mitigate some of this. Definitely > define the VM to match the NUMA topology found on the host. The native/host platform has multiple levels of NUMA... node distances: node 0 1 2 3 4 5 6 7 0: 10 14 23 23 27 27 27 27 1: 14 10 23 23 27 27 27 27 2: 23 23 10 14 27 27 27 27 3: 23 23 14 10 27 27 27 27 4: 27 27 27 27 10 14 23 23 5: 27 27 27 27 14 10 23 23 6: 27 27 27 27 23 23 10 14 7: 27 27 27 27 23 23 14 10 The qemu's -numa option seems to only allow for one level (i.e. specify multiple sockets etc). Am I missing something ? > at least allow good scaling wrt locks and scheduler in the guest. As > for getting memory placement close (a page in VM node x actually resides > in host node x), you have to rely on vcpu pinning + guest NUMA topology, > combined with default mempolicy in the guest and host. I did recompile both the kernels with the SLUB allocator enabled... > As pages are > faulted in the guest, the hope is that the vcpu which did the faulting > is running in the right node (guest and host), its guest OS mempolicy > ensures this page is to be allocated in the guest local node, and that > allocation cause a fault in qemu, which is -also- running on the -host- > node X. The vcpu pinning is critical to get qemu to fault that memory > to the correct node. In the absence of a NUMA layout in the guest it doesn't look like pinning helped... but I think I understand what you are saying. Thanks! > Make sure you do not use numactl for any of this. > I would suggest using libvirt and define the vcpu-pinning and the numa > topology in the XML. I will try this in the coming days (waiting to get back on the system :)). Thanks for the detailed response ! Vinod > > -Andrew Theurer > > -- > To unsubscribe from this list: send the line "unsubscribe kvm" in > the body of a message to majordomo <at> vger.kernel.org > More majordomo info at http://vger.kernel.org/majordomo-info.html > > -- To unsubscribe from this list: send the line "unsubscribe kvm" in the body of a message to majordomo@xxxxxxxxxxxxxxx More majordomo info at http://vger.kernel.org/majordomo-info.html