Re: KVM PMU virtualization

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 



On Fri, 2010-02-26 at 10:17 +0100, Ingo Molnar wrote:
> * Joerg Roedel <joro@xxxxxxxxxx> wrote:
> 
> > On Fri, Feb 26, 2010 at 10:55:17AM +0800, Zhang, Yanmin wrote:
> > > On Thu, 2010-02-25 at 18:34 +0100, Joerg Roedel wrote:
> > > > On Thu, Feb 25, 2010 at 04:04:28PM +0100, Jes Sorensen wrote:
> > > > 
> > > > > 1) Add support to perf to allow it to monitor a KVM guest from the
> > > > >    host.
> > > > 
> > > > This shouldn't be a big problem. The PMU of AMD Fam10 processors can be
> > > > configured to count only when in guest mode. Perf needs to be aware of
> > > > that and fetch the rip from a different place when monitoring a guest.
> > 
> > > The idea is we want to measure both host and guest at the same time, and
> > > compare all the hot functions fairly.
> > 
> > So you want to measure while the guest vcpu is running and the vmexit
> > path of that vcpu (including qemu userspace part) together? The
> > challenge here is to find out if a performance event originated in guest
> > mode or in host mode.
> > But we can check for that in the nmi-protected part of the vmexit path.
> 
> As far as instrumentation goes, virtualization is simply another 'PID 
> dimension' of measurement.
> 
> Today we can isolate system performance measurements/events to the following 
> domains:
> 
>  - per system
>  - per cpu
>  - per task
> 
> ( Note that PowerPC already supports certain sorts of 'hypervisor/kernel/user' 
>   domain separation, and we have some ABI details for all that but it's by no 
>   means complete. Anton is using the PowerPC bits AFAIK, so it already works 
>   to a certain degree. )
> 
> When extending measurements to KVM, we want two things:
> 
>  - user friendliness: instead of having to check 'ps' and figure out which 
>    Qemu thread is the KVM thread we want to profile, just give a convenience
>    namespace to access guest profiling info. -G ought to map to the first
>    currently running KVM guest it can find. (which would match like 90% of the
>    cases) - etc. No ifs and when. If 'perf kvm top' doesnt show something 
>    useful by default the whole effort is for naught.
> 
>  - Extend core facilities and enable the following measurement dimensions:
> 
>      host-kernel-space
>      host-user-space
>      guest-kernel-space
>      guest-user-space
> 
>    on a per guest basis. We want to be able to measure just what the guest 
>    does, and we want to be able to measure just what the host does.
> 
>    Some of this the hardware helps us with (say only measuring host kernel 
>    events is possible), some has to be done by fiddling with event 
>    enable/disable at vm-exit / vm-entry time.
> 
> My suggestion, as always, would be to start very simple and very minimal:
> 
> Enable 'perf kvm top' to show guest overhead. Use the exact same kernel image 
> both as a host and as guest (for testing), to not have to deal with the symbol 
> space transport problem initially. Enable 'perf kvm record' to only record 
> guest events by default. Etc.
> 
> This alone will be a quite useful result already - and gives a basis for 
> further work. No need to spend months to do the big grand design straight 
> away, all of this can be done gradually and in the order of usefulness - and 
> you'll always have something that actually works (and helps your other KVM 
> projects) along the way.
It took me for a couple of hours to read the emails on the topic.
Based on above idea, I worked out a prototype which is ugly, but does work
with top/record when both guest side and host side use the same kernel image,
while compiling most needed modules into kernel directly..

The commands are:
perf kvm top
perf kvm record
perf kvm report

They just collect guest kernel hot functions.

> 
> [ And, as so often, once you walk that path, that grand scheme you are 
>   thinking about right now might easily become last year's really bad idea ;-) ]
> 
> So please start walking the path and experience the challenges first-hand.
With my patch, I collected dbench data on Nehalem machine (2*4*2 logical cpu).
1) Vanilla host kernel (6G memory):
------------------------------------------------------------------------------------------------------------------------
   PerfTop:   15491 irqs/sec  kernel:93.6% [1000Hz cycles],  (all, 16 CPUs)
------------------------------------------------------------------------------------------------------------------------

             samples  pcnt function                        DSO
             _______ _____ _______________________________ ________________________________________

            99376.00 40.5% ext3_test_allocatable           /lib/modules/2.6.33-kvmymz/build/vmlinux
            41239.00 16.8% bitmap_search_next_usable_block /lib/modules/2.6.33-kvmymz/build/vmlinux
             7019.00  2.9% __ticket_spin_lock              /lib/modules/2.6.33-kvmymz/build/vmlinux
             5350.00  2.2% copy_user_generic_string        /lib/modules/2.6.33-kvmymz/build/vmlinux
             5208.00  2.1% do_get_write_access             /lib/modules/2.6.33-kvmymz/build/vmlinux
             4484.00  1.8% journal_dirty_metadata          /lib/modules/2.6.33-kvmymz/build/vmlinux
             4078.00  1.7% ext3_free_blocks_sb             /lib/modules/2.6.33-kvmymz/build/vmlinux
             3856.00  1.6% ext3_new_blocks                 /lib/modules/2.6.33-kvmymz/build/vmlinux
             3485.00  1.4% journal_get_undo_access         /lib/modules/2.6.33-kvmymz/build/vmlinux
             2803.00  1.1% ext3_try_to_allocate            /lib/modules/2.6.33-kvmymz/build/vmlinux
             2241.00  0.9% __find_get_block                /lib/modules/2.6.33-kvmymz/build/vmlinux
             1957.00  0.8% find_revoke_record              /lib/modules/2.6.33-kvmymz/build/vmlinux

2) guest os: start one guest os with 4GB memory.
------------------------------------------------------------------------------------------------------------------------
   PerfTop:     827 irqs/sec  kernel: 0.0% [1000Hz cycles],  (all, 16 CPUs)
------------------------------------------------------------------------------------------------------------------------

             samples  pcnt function                        DSO
             _______ _____ _______________________________ ________________________________________

            41701.00 28.1% __ticket_spin_lock              /lib/modules/2.6.33-kvmymz/build/vmlinux
            33843.00 22.8% ext3_test_allocatable           /lib/modules/2.6.33-kvmymz/build/vmlinux
            16862.00 11.4% bitmap_search_next_usable_block /lib/modules/2.6.33-kvmymz/build/vmlinux
             3278.00  2.2% native_flush_tlb_others         /lib/modules/2.6.33-kvmymz/build/vmlinux
             3200.00  2.2% copy_user_generic_string        /lib/modules/2.6.33-kvmymz/build/vmlinux
             3009.00  2.0% do_get_write_access             /lib/modules/2.6.33-kvmymz/build/vmlinux
             2834.00  1.9% journal_dirty_metadata          /lib/modules/2.6.33-kvmymz/build/vmlinux
             1965.00  1.3% journal_get_undo_access         /lib/modules/2.6.33-kvmymz/build/vmlinux
             1907.00  1.3% ext3_new_blocks                 /lib/modules/2.6.33-kvmymz/build/vmlinux
             1790.00  1.2% ext3_free_blocks_sb             /lib/modules/2.6.33-kvmymz/build/vmlinux
             1741.00  1.2% find_revoke_record              /lib/modules/2.6.33-kvmymz/build/vmlinux


With vanilla host kernel, perf top data is stable and spinlock doesn't take too much cpu time.
With guest os, __ticket_spin_lock consumes 28% cpu time, and sometimes it fluctuates between 9%~28%.

Another interesting finding is aim7. If I start aim7 on tmpfs testing in guest os with 1GB memory,
the login hangs and cpu is busy. With the new patch, I could check what happens in guest os, where
spinlock is busy and kernel is shrinking memory mostly from slab.



--- linux-2.6.33/arch/x86/kernel/cpu/perf_event.c	2010-02-25 02:52:17.000000000 +0800
+++ linux-2.6.33_perfkvm/arch/x86/kernel/cpu/perf_event.c	2010-03-01 15:57:51.672990615 +0800
@@ -1621,6 +1621,7 @@ static void intel_pmu_drain_bts_buffer(s
 	struct perf_event_header header;
 	struct perf_sample_data data;
 	struct pt_regs regs;
+	int ret;
 
 	if (!event)
 		return;
@@ -1647,7 +1648,9 @@ static void intel_pmu_drain_bts_buffer(s
 	 * We will overwrite the from and to address before we output
 	 * the sample.
 	 */
-	perf_prepare_sample(&header, &data, event, &regs);
+	ret = perf_prepare_sample(&header, &data, event, &regs);
+	if (ret)
+		return;
 
 	if (perf_output_begin(&handle, event,
 			      header.size * (top - at), 1, 1))
--- linux-2.6.33/arch/x86/kvm/vmx.c	2010-02-25 02:52:17.000000000 +0800
+++ linux-2.6.33_perfkvm/arch/x86/kvm/vmx.c	2010-03-02 10:21:57.588586179 +0800
@@ -26,6 +26,7 @@
 #include <linux/sched.h>
 #include <linux/moduleparam.h>
 #include <linux/ftrace_event.h>
+#include <linux/perf_event.h>
 #include "kvm_cache_regs.h"
 #include "x86.h"
 
@@ -3553,8 +3554,19 @@ static void vmx_complete_interrupts(stru
 
 	/* We need to handle NMIs before interrupts are enabled */
 	if ((exit_intr_info & INTR_INFO_INTR_TYPE_MASK) == INTR_TYPE_NMI_INTR &&
-	    (exit_intr_info & INTR_INFO_VALID_MASK))
+	    (exit_intr_info & INTR_INFO_VALID_MASK)) {
+		u64 rip = vmcs_readl(GUEST_RIP);
+		int user_mode = vmcs_read16(GUEST_CS_SELECTOR);
+
+#ifdef CONFIG_X86_32
+		user_mode = (user_mode & SEGMENT_RPL_MASK) == USER_RPL;
+#else
+		user_mode = !!(user_mode & 3);
+#endif
+		perf_save_virt_ip(user_mode, rip);
 		asm("int $2");
+		perf_reset_virt_ip();
+	}
 
 	idtv_info_valid = idt_vectoring_info & VECTORING_INFO_VALID_MASK;
 
--- linux-2.6.33/include/linux/perf_event.h	2010-02-25 02:52:17.000000000 +0800
+++ linux-2.6.33_perfkvm/include/linux/perf_event.h	2010-03-02 12:26:15.050947780 +0800
@@ -125,8 +125,9 @@ enum perf_event_sample_format {
 	PERF_SAMPLE_PERIOD			= 1U << 8,
 	PERF_SAMPLE_STREAM_ID			= 1U << 9,
 	PERF_SAMPLE_RAW				= 1U << 10,
+        PERF_SAMPLE_KVM                         = 1U << 11,
 
-	PERF_SAMPLE_MAX = 1U << 11,		/* non-ABI */
+        PERF_SAMPLE_MAX = 1U << 12,             /* non-ABI */
 };
 
 /*
@@ -798,7 +799,7 @@ extern void perf_output_sample(struct pe
 			       struct perf_event_header *header,
 			       struct perf_sample_data *data,
 			       struct perf_event *event);
-extern void perf_prepare_sample(struct perf_event_header *header,
+extern int perf_prepare_sample(struct perf_event_header *header,
 				struct perf_sample_data *data,
 				struct perf_event *event,
 				struct pt_regs *regs);
@@ -858,7 +859,6 @@ extern void perf_bp_event(struct perf_ev
 #ifndef perf_misc_flags
 #define perf_misc_flags(regs)	(user_mode(regs) ? PERF_RECORD_MISC_USER : \
 				 PERF_RECORD_MISC_KERNEL)
-#define perf_instruction_pointer(regs)	instruction_pointer(regs)
 #endif
 
 extern int perf_output_begin(struct perf_output_handle *handle,
@@ -905,6 +905,34 @@ static inline void perf_event_enable(str
 static inline void perf_event_disable(struct perf_event *event)		{ }
 #endif
 
+//#if defined(CONFIG_PERF_EVENTS && CONFIG_PERF_HAS_VIRT_IP)
+#if defined(CONFIG_PERF_EVENTS)
+struct virt_ip_info {
+	int	user_mode;
+	u64	ip;
+};
+
+DECLARE_PER_CPU(struct virt_ip_info, perf_virt_ip);
+extern void perf_save_virt_ip(int user_mode, u64 ip);
+extern void perf_reset_virt_ip(void);
+extern int perf_get_virt_user_mode(void);
+static inline u64 perf_instruction_pointer(struct perf_event *event, struct pt_regs *regs)
+{
+	u64 ip;
+	if (event->attr.sample_type & PERF_SAMPLE_KVM)
+		ip = percpu_read(perf_virt_ip.ip);
+	else
+		ip = instruction_pointer(regs);
+	return ip;
+}
+#else
+static inline void perf_save_virt_ip(int user_mode, u64 ip)	{ }
+static inline void perf_reset_virt_ip(void)	{ }
+static inline int perf_get_virt_user_mode(void)	{ return -1; }
+#define perf_instruction_pointer(event, regs)	instruction_pointer(regs))
+#endif
+
+
 #define perf_output_put(handle, x) \
 	perf_output_copy((handle), &(x), sizeof(x))
 
--- linux-2.6.33/kernel/perf_event.c	2010-02-25 02:52:17.000000000 +0800
+++ linux-2.6.33_perfkvm/kernel/perf_event.c	2010-03-02 12:30:41.236003180 +0800
@@ -3077,7 +3077,38 @@ void perf_output_sample(struct perf_outp
 	}
 }
 
-void perf_prepare_sample(struct perf_event_header *header,
+//#ifdef CONFIG_PERF_VIRT_IP
+DEFINE_PER_CPU(struct virt_ip_info, perf_virt_ip) = {0,0};
+EXPORT_PER_CPU_SYMBOL(perf_virt_ip);
+
+void perf_save_virt_ip(int user_mode, u64 ip)
+{
+	if (!atomic_read(&nr_events))
+		return;
+	percpu_write(perf_virt_ip.user_mode, ip);
+	percpu_write(perf_virt_ip.ip, ip);
+}
+EXPORT_SYMBOL_GPL(perf_save_virt_ip);
+
+void perf_reset_virt_ip(void)
+{
+	if (!percpu_read(perf_virt_ip.ip))
+		return;
+	percpu_write(perf_virt_ip.user_mode, 0);
+	percpu_write(perf_virt_ip.ip, 0);
+}
+EXPORT_SYMBOL_GPL(perf_reset_virt_ip);
+
+int perf_get_virt_user_mode(void)
+{
+	if (!percpu_read(perf_virt_ip.ip))
+		return -1;
+	return percpu_read(perf_virt_ip.user_mode);
+}
+
+//#endif
+
+int perf_prepare_sample(struct perf_event_header *header,
 			 struct perf_sample_data *data,
 			 struct perf_event *event,
 			 struct pt_regs *regs)
@@ -3090,10 +3121,15 @@ void perf_prepare_sample(struct perf_eve
 	header->size = sizeof(*header);
 
 	header->misc = 0;
-	header->misc |= perf_misc_flags(regs);
+	if (event->attr.sample_type & PERF_SAMPLE_KVM)
+		header->misc |= percpu_read(perf_virt_ip.user_mode)?PERF_RECORD_MISC_USER:PERF_RECORD_MISC_KERNEL;
+	else
+		header->misc |= perf_misc_flags(regs);
 
 	if (sample_type & PERF_SAMPLE_IP) {
-		data->ip = perf_instruction_pointer(regs);
+		data->ip = perf_instruction_pointer(event, regs);
+		if (!data->ip)
+			return -1;
 
 		header->size += sizeof(data->ip);
 	}
@@ -3162,6 +3198,8 @@ void perf_prepare_sample(struct perf_eve
 		WARN_ON_ONCE(size & (sizeof(u64)-1));
 		header->size += size;
 	}
+
+	return 0;
 }
 
 static void perf_event_output(struct perf_event *event, int nmi,
@@ -3170,8 +3208,11 @@ static void perf_event_output(struct per
 {
 	struct perf_output_handle handle;
 	struct perf_event_header header;
+	int ret;
 
-	perf_prepare_sample(&header, data, event, regs);
+	ret = perf_prepare_sample(&header, data, event, regs);
+	if (ret)
+		return;
 
 	if (perf_output_begin(&handle, event, header.size, nmi, 1))
 		return;
--- linux-2.6.33/tools/perf/builtin-record.c	2010-02-25 02:52:17.000000000 +0800
+++ linux-2.6.33_perfkvm/tools/perf/builtin-record.c	2010-03-02 13:19:53.564376291 +0800
@@ -251,6 +251,8 @@ static void create_counter(int counter, 
 				  PERF_FORMAT_ID;
 
 	attr->sample_type	|= PERF_SAMPLE_IP | PERF_SAMPLE_TID;
+	if (sample_kvm)
+		attr->sample_type	|= PERF_SAMPLE_KVM;
 
 	if (freq) {
 		attr->sample_type	|= PERF_SAMPLE_PERIOD;
--- linux-2.6.33/tools/perf/builtin-top.c	2010-02-25 02:52:17.000000000 +0800
+++ linux-2.6.33_perfkvm/tools/perf/builtin-top.c	2010-03-01 16:35:41.972067501 +0800
@@ -1091,6 +1091,8 @@ static void start_counter(int i, int cou
 	attr = attrs + counter;
 
 	attr->sample_type	= PERF_SAMPLE_IP | PERF_SAMPLE_TID;
+	if (sample_kvm)
+		attr->sample_type	|= PERF_SAMPLE_KVM;
 
 	if (freq) {
 		attr->sample_type	|= PERF_SAMPLE_PERIOD;
--- linux-2.6.33/tools/perf/perf.c	2010-02-25 02:52:17.000000000 +0800
+++ linux-2.6.33_perfkvm/tools/perf/perf.c	2010-03-02 09:57:03.164001069 +0800
@@ -28,6 +28,8 @@ struct pager_config {
 	int val;
 };
 
+int sample_kvm = 0;
+
 static char debugfs_mntpt[MAXPATHLEN];
 
 static int pager_command_config(const char *var, const char *value, void *data)
@@ -320,6 +322,13 @@ static void handle_internal_command(int 
 		argv[0] = cmd = "help";
 	}
 
+	if (argc > 1 && !strcmp(argv[0], "kvm")) {
+		sample_kvm = 1;
+		argv++;
+		argc--;
+		cmd = argv[0];
+	}
+
 	for (i = 0; i < ARRAY_SIZE(commands); i++) {
 		struct cmd_struct *p = commands+i;
 		if (strcmp(p->cmd, cmd))
--- linux-2.6.33/tools/perf/perf.h	2010-02-25 02:52:17.000000000 +0800
+++ linux-2.6.33_perfkvm/tools/perf/perf.h	2010-03-01 16:12:42.470082418 +0800
@@ -131,4 +131,6 @@ struct ip_callchain {
 	u64 ips[0];
 };
 
+extern int sample_kvm;
+
 #endif


--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at  http://vger.kernel.org/majordomo-info.html

[Index of Archives]     [KVM ARM]     [KVM ia64]     [KVM ppc]     [Virtualization Tools]     [Spice Development]     [Libvirt]     [Libvirt Users]     [Linux USB Devel]     [Linux Audio Users]     [Yosemite Questions]     [Linux Kernel]     [Linux SCSI]     [XFree86]
  Powered by Linux