On Fri, 2010-02-26 at 10:17 +0100, Ingo Molnar wrote: > * Joerg Roedel <joro@xxxxxxxxxx> wrote: > > > On Fri, Feb 26, 2010 at 10:55:17AM +0800, Zhang, Yanmin wrote: > > > On Thu, 2010-02-25 at 18:34 +0100, Joerg Roedel wrote: > > > > On Thu, Feb 25, 2010 at 04:04:28PM +0100, Jes Sorensen wrote: > > > > > > > > > 1) Add support to perf to allow it to monitor a KVM guest from the > > > > > host. > > > > > > > > This shouldn't be a big problem. The PMU of AMD Fam10 processors can be > > > > configured to count only when in guest mode. Perf needs to be aware of > > > > that and fetch the rip from a different place when monitoring a guest. > > > > > The idea is we want to measure both host and guest at the same time, and > > > compare all the hot functions fairly. > > > > So you want to measure while the guest vcpu is running and the vmexit > > path of that vcpu (including qemu userspace part) together? The > > challenge here is to find out if a performance event originated in guest > > mode or in host mode. > > But we can check for that in the nmi-protected part of the vmexit path. > > As far as instrumentation goes, virtualization is simply another 'PID > dimension' of measurement. > > Today we can isolate system performance measurements/events to the following > domains: > > - per system > - per cpu > - per task > > ( Note that PowerPC already supports certain sorts of 'hypervisor/kernel/user' > domain separation, and we have some ABI details for all that but it's by no > means complete. Anton is using the PowerPC bits AFAIK, so it already works > to a certain degree. ) > > When extending measurements to KVM, we want two things: > > - user friendliness: instead of having to check 'ps' and figure out which > Qemu thread is the KVM thread we want to profile, just give a convenience > namespace to access guest profiling info. -G ought to map to the first > currently running KVM guest it can find. (which would match like 90% of the > cases) - etc. No ifs and when. If 'perf kvm top' doesnt show something > useful by default the whole effort is for naught. > > - Extend core facilities and enable the following measurement dimensions: > > host-kernel-space > host-user-space > guest-kernel-space > guest-user-space > > on a per guest basis. We want to be able to measure just what the guest > does, and we want to be able to measure just what the host does. > > Some of this the hardware helps us with (say only measuring host kernel > events is possible), some has to be done by fiddling with event > enable/disable at vm-exit / vm-entry time. > > My suggestion, as always, would be to start very simple and very minimal: > > Enable 'perf kvm top' to show guest overhead. Use the exact same kernel image > both as a host and as guest (for testing), to not have to deal with the symbol > space transport problem initially. Enable 'perf kvm record' to only record > guest events by default. Etc. > > This alone will be a quite useful result already - and gives a basis for > further work. No need to spend months to do the big grand design straight > away, all of this can be done gradually and in the order of usefulness - and > you'll always have something that actually works (and helps your other KVM > projects) along the way. It took me for a couple of hours to read the emails on the topic. Based on above idea, I worked out a prototype which is ugly, but does work with top/record when both guest side and host side use the same kernel image, while compiling most needed modules into kernel directly.. The commands are: perf kvm top perf kvm record perf kvm report They just collect guest kernel hot functions. > > [ And, as so often, once you walk that path, that grand scheme you are > thinking about right now might easily become last year's really bad idea ;-) ] > > So please start walking the path and experience the challenges first-hand. With my patch, I collected dbench data on Nehalem machine (2*4*2 logical cpu). 1) Vanilla host kernel (6G memory): ------------------------------------------------------------------------------------------------------------------------ PerfTop: 15491 irqs/sec kernel:93.6% [1000Hz cycles], (all, 16 CPUs) ------------------------------------------------------------------------------------------------------------------------ samples pcnt function DSO _______ _____ _______________________________ ________________________________________ 99376.00 40.5% ext3_test_allocatable /lib/modules/2.6.33-kvmymz/build/vmlinux 41239.00 16.8% bitmap_search_next_usable_block /lib/modules/2.6.33-kvmymz/build/vmlinux 7019.00 2.9% __ticket_spin_lock /lib/modules/2.6.33-kvmymz/build/vmlinux 5350.00 2.2% copy_user_generic_string /lib/modules/2.6.33-kvmymz/build/vmlinux 5208.00 2.1% do_get_write_access /lib/modules/2.6.33-kvmymz/build/vmlinux 4484.00 1.8% journal_dirty_metadata /lib/modules/2.6.33-kvmymz/build/vmlinux 4078.00 1.7% ext3_free_blocks_sb /lib/modules/2.6.33-kvmymz/build/vmlinux 3856.00 1.6% ext3_new_blocks /lib/modules/2.6.33-kvmymz/build/vmlinux 3485.00 1.4% journal_get_undo_access /lib/modules/2.6.33-kvmymz/build/vmlinux 2803.00 1.1% ext3_try_to_allocate /lib/modules/2.6.33-kvmymz/build/vmlinux 2241.00 0.9% __find_get_block /lib/modules/2.6.33-kvmymz/build/vmlinux 1957.00 0.8% find_revoke_record /lib/modules/2.6.33-kvmymz/build/vmlinux 2) guest os: start one guest os with 4GB memory. ------------------------------------------------------------------------------------------------------------------------ PerfTop: 827 irqs/sec kernel: 0.0% [1000Hz cycles], (all, 16 CPUs) ------------------------------------------------------------------------------------------------------------------------ samples pcnt function DSO _______ _____ _______________________________ ________________________________________ 41701.00 28.1% __ticket_spin_lock /lib/modules/2.6.33-kvmymz/build/vmlinux 33843.00 22.8% ext3_test_allocatable /lib/modules/2.6.33-kvmymz/build/vmlinux 16862.00 11.4% bitmap_search_next_usable_block /lib/modules/2.6.33-kvmymz/build/vmlinux 3278.00 2.2% native_flush_tlb_others /lib/modules/2.6.33-kvmymz/build/vmlinux 3200.00 2.2% copy_user_generic_string /lib/modules/2.6.33-kvmymz/build/vmlinux 3009.00 2.0% do_get_write_access /lib/modules/2.6.33-kvmymz/build/vmlinux 2834.00 1.9% journal_dirty_metadata /lib/modules/2.6.33-kvmymz/build/vmlinux 1965.00 1.3% journal_get_undo_access /lib/modules/2.6.33-kvmymz/build/vmlinux 1907.00 1.3% ext3_new_blocks /lib/modules/2.6.33-kvmymz/build/vmlinux 1790.00 1.2% ext3_free_blocks_sb /lib/modules/2.6.33-kvmymz/build/vmlinux 1741.00 1.2% find_revoke_record /lib/modules/2.6.33-kvmymz/build/vmlinux With vanilla host kernel, perf top data is stable and spinlock doesn't take too much cpu time. With guest os, __ticket_spin_lock consumes 28% cpu time, and sometimes it fluctuates between 9%~28%. Another interesting finding is aim7. If I start aim7 on tmpfs testing in guest os with 1GB memory, the login hangs and cpu is busy. With the new patch, I could check what happens in guest os, where spinlock is busy and kernel is shrinking memory mostly from slab. --- linux-2.6.33/arch/x86/kernel/cpu/perf_event.c 2010-02-25 02:52:17.000000000 +0800 +++ linux-2.6.33_perfkvm/arch/x86/kernel/cpu/perf_event.c 2010-03-01 15:57:51.672990615 +0800 @@ -1621,6 +1621,7 @@ static void intel_pmu_drain_bts_buffer(s struct perf_event_header header; struct perf_sample_data data; struct pt_regs regs; + int ret; if (!event) return; @@ -1647,7 +1648,9 @@ static void intel_pmu_drain_bts_buffer(s * We will overwrite the from and to address before we output * the sample. */ - perf_prepare_sample(&header, &data, event, ®s); + ret = perf_prepare_sample(&header, &data, event, ®s); + if (ret) + return; if (perf_output_begin(&handle, event, header.size * (top - at), 1, 1)) --- linux-2.6.33/arch/x86/kvm/vmx.c 2010-02-25 02:52:17.000000000 +0800 +++ linux-2.6.33_perfkvm/arch/x86/kvm/vmx.c 2010-03-02 10:21:57.588586179 +0800 @@ -26,6 +26,7 @@ #include <linux/sched.h> #include <linux/moduleparam.h> #include <linux/ftrace_event.h> +#include <linux/perf_event.h> #include "kvm_cache_regs.h" #include "x86.h" @@ -3553,8 +3554,19 @@ static void vmx_complete_interrupts(stru /* We need to handle NMIs before interrupts are enabled */ if ((exit_intr_info & INTR_INFO_INTR_TYPE_MASK) == INTR_TYPE_NMI_INTR && - (exit_intr_info & INTR_INFO_VALID_MASK)) + (exit_intr_info & INTR_INFO_VALID_MASK)) { + u64 rip = vmcs_readl(GUEST_RIP); + int user_mode = vmcs_read16(GUEST_CS_SELECTOR); + +#ifdef CONFIG_X86_32 + user_mode = (user_mode & SEGMENT_RPL_MASK) == USER_RPL; +#else + user_mode = !!(user_mode & 3); +#endif + perf_save_virt_ip(user_mode, rip); asm("int $2"); + perf_reset_virt_ip(); + } idtv_info_valid = idt_vectoring_info & VECTORING_INFO_VALID_MASK; --- linux-2.6.33/include/linux/perf_event.h 2010-02-25 02:52:17.000000000 +0800 +++ linux-2.6.33_perfkvm/include/linux/perf_event.h 2010-03-02 12:26:15.050947780 +0800 @@ -125,8 +125,9 @@ enum perf_event_sample_format { PERF_SAMPLE_PERIOD = 1U << 8, PERF_SAMPLE_STREAM_ID = 1U << 9, PERF_SAMPLE_RAW = 1U << 10, + PERF_SAMPLE_KVM = 1U << 11, - PERF_SAMPLE_MAX = 1U << 11, /* non-ABI */ + PERF_SAMPLE_MAX = 1U << 12, /* non-ABI */ }; /* @@ -798,7 +799,7 @@ extern void perf_output_sample(struct pe struct perf_event_header *header, struct perf_sample_data *data, struct perf_event *event); -extern void perf_prepare_sample(struct perf_event_header *header, +extern int perf_prepare_sample(struct perf_event_header *header, struct perf_sample_data *data, struct perf_event *event, struct pt_regs *regs); @@ -858,7 +859,6 @@ extern void perf_bp_event(struct perf_ev #ifndef perf_misc_flags #define perf_misc_flags(regs) (user_mode(regs) ? PERF_RECORD_MISC_USER : \ PERF_RECORD_MISC_KERNEL) -#define perf_instruction_pointer(regs) instruction_pointer(regs) #endif extern int perf_output_begin(struct perf_output_handle *handle, @@ -905,6 +905,34 @@ static inline void perf_event_enable(str static inline void perf_event_disable(struct perf_event *event) { } #endif +//#if defined(CONFIG_PERF_EVENTS && CONFIG_PERF_HAS_VIRT_IP) +#if defined(CONFIG_PERF_EVENTS) +struct virt_ip_info { + int user_mode; + u64 ip; +}; + +DECLARE_PER_CPU(struct virt_ip_info, perf_virt_ip); +extern void perf_save_virt_ip(int user_mode, u64 ip); +extern void perf_reset_virt_ip(void); +extern int perf_get_virt_user_mode(void); +static inline u64 perf_instruction_pointer(struct perf_event *event, struct pt_regs *regs) +{ + u64 ip; + if (event->attr.sample_type & PERF_SAMPLE_KVM) + ip = percpu_read(perf_virt_ip.ip); + else + ip = instruction_pointer(regs); + return ip; +} +#else +static inline void perf_save_virt_ip(int user_mode, u64 ip) { } +static inline void perf_reset_virt_ip(void) { } +static inline int perf_get_virt_user_mode(void) { return -1; } +#define perf_instruction_pointer(event, regs) instruction_pointer(regs)) +#endif + + #define perf_output_put(handle, x) \ perf_output_copy((handle), &(x), sizeof(x)) --- linux-2.6.33/kernel/perf_event.c 2010-02-25 02:52:17.000000000 +0800 +++ linux-2.6.33_perfkvm/kernel/perf_event.c 2010-03-02 12:30:41.236003180 +0800 @@ -3077,7 +3077,38 @@ void perf_output_sample(struct perf_outp } } -void perf_prepare_sample(struct perf_event_header *header, +//#ifdef CONFIG_PERF_VIRT_IP +DEFINE_PER_CPU(struct virt_ip_info, perf_virt_ip) = {0,0}; +EXPORT_PER_CPU_SYMBOL(perf_virt_ip); + +void perf_save_virt_ip(int user_mode, u64 ip) +{ + if (!atomic_read(&nr_events)) + return; + percpu_write(perf_virt_ip.user_mode, ip); + percpu_write(perf_virt_ip.ip, ip); +} +EXPORT_SYMBOL_GPL(perf_save_virt_ip); + +void perf_reset_virt_ip(void) +{ + if (!percpu_read(perf_virt_ip.ip)) + return; + percpu_write(perf_virt_ip.user_mode, 0); + percpu_write(perf_virt_ip.ip, 0); +} +EXPORT_SYMBOL_GPL(perf_reset_virt_ip); + +int perf_get_virt_user_mode(void) +{ + if (!percpu_read(perf_virt_ip.ip)) + return -1; + return percpu_read(perf_virt_ip.user_mode); +} + +//#endif + +int perf_prepare_sample(struct perf_event_header *header, struct perf_sample_data *data, struct perf_event *event, struct pt_regs *regs) @@ -3090,10 +3121,15 @@ void perf_prepare_sample(struct perf_eve header->size = sizeof(*header); header->misc = 0; - header->misc |= perf_misc_flags(regs); + if (event->attr.sample_type & PERF_SAMPLE_KVM) + header->misc |= percpu_read(perf_virt_ip.user_mode)?PERF_RECORD_MISC_USER:PERF_RECORD_MISC_KERNEL; + else + header->misc |= perf_misc_flags(regs); if (sample_type & PERF_SAMPLE_IP) { - data->ip = perf_instruction_pointer(regs); + data->ip = perf_instruction_pointer(event, regs); + if (!data->ip) + return -1; header->size += sizeof(data->ip); } @@ -3162,6 +3198,8 @@ void perf_prepare_sample(struct perf_eve WARN_ON_ONCE(size & (sizeof(u64)-1)); header->size += size; } + + return 0; } static void perf_event_output(struct perf_event *event, int nmi, @@ -3170,8 +3208,11 @@ static void perf_event_output(struct per { struct perf_output_handle handle; struct perf_event_header header; + int ret; - perf_prepare_sample(&header, data, event, regs); + ret = perf_prepare_sample(&header, data, event, regs); + if (ret) + return; if (perf_output_begin(&handle, event, header.size, nmi, 1)) return; --- linux-2.6.33/tools/perf/builtin-record.c 2010-02-25 02:52:17.000000000 +0800 +++ linux-2.6.33_perfkvm/tools/perf/builtin-record.c 2010-03-02 13:19:53.564376291 +0800 @@ -251,6 +251,8 @@ static void create_counter(int counter, PERF_FORMAT_ID; attr->sample_type |= PERF_SAMPLE_IP | PERF_SAMPLE_TID; + if (sample_kvm) + attr->sample_type |= PERF_SAMPLE_KVM; if (freq) { attr->sample_type |= PERF_SAMPLE_PERIOD; --- linux-2.6.33/tools/perf/builtin-top.c 2010-02-25 02:52:17.000000000 +0800 +++ linux-2.6.33_perfkvm/tools/perf/builtin-top.c 2010-03-01 16:35:41.972067501 +0800 @@ -1091,6 +1091,8 @@ static void start_counter(int i, int cou attr = attrs + counter; attr->sample_type = PERF_SAMPLE_IP | PERF_SAMPLE_TID; + if (sample_kvm) + attr->sample_type |= PERF_SAMPLE_KVM; if (freq) { attr->sample_type |= PERF_SAMPLE_PERIOD; --- linux-2.6.33/tools/perf/perf.c 2010-02-25 02:52:17.000000000 +0800 +++ linux-2.6.33_perfkvm/tools/perf/perf.c 2010-03-02 09:57:03.164001069 +0800 @@ -28,6 +28,8 @@ struct pager_config { int val; }; +int sample_kvm = 0; + static char debugfs_mntpt[MAXPATHLEN]; static int pager_command_config(const char *var, const char *value, void *data) @@ -320,6 +322,13 @@ static void handle_internal_command(int argv[0] = cmd = "help"; } + if (argc > 1 && !strcmp(argv[0], "kvm")) { + sample_kvm = 1; + argv++; + argc--; + cmd = argv[0]; + } + for (i = 0; i < ARRAY_SIZE(commands); i++) { struct cmd_struct *p = commands+i; if (strcmp(p->cmd, cmd)) --- linux-2.6.33/tools/perf/perf.h 2010-02-25 02:52:17.000000000 +0800 +++ linux-2.6.33_perfkvm/tools/perf/perf.h 2010-03-01 16:12:42.470082418 +0800 @@ -131,4 +131,6 @@ struct ip_callchain { u64 ips[0]; }; +extern int sample_kvm; + #endif -- To unsubscribe from this list: send the line "unsubscribe kvm" in the body of a message to majordomo@xxxxxxxxxxxxxxx More majordomo info at http://vger.kernel.org/majordomo-info.html