Re: KVM PMU virtualization

"Zhang, Yanmin" <yanmin_zhang@xxxxxxxxxxxxxxx> · Wed, 03 Mar 2010 11:32:32 +0800

On Tue, 2010-03-02 at 10:36 +0100, Ingo Molnar wrote:
> * Zhang, Yanmin <yanmin_zhang@xxxxxxxxxxxxxxx> wrote:
> 
> > On Fri, 2010-02-26 at 10:17 +0100, Ingo Molnar wrote:
> 
> > > My suggestion, as always, would be to start very simple and very minimal:
> > > 
> > > Enable 'perf kvm top' to show guest overhead. Use the exact same kernel 
> > > image both as a host and as guest (for testing), to not have to deal with 
> > > the symbol space transport problem initially. Enable 'perf kvm record' to 
> > > only record guest events by default. Etc.
> > > 
> > > This alone will be a quite useful result already - and gives a basis for 
> > > further work. No need to spend months to do the big grand design straight 
> > > away, all of this can be done gradually and in the order of usefulness - 
> > > and you'll always have something that actually works (and helps your other 
> > > KVM projects) along the way.
> >
> > It took me for a couple of hours to read the emails on the topic. Based on 
> > above idea, I worked out a prototype which is ugly, but does work with 
> > top/record when both guest side and host side use the same kernel image, 
> > while compiling most needed modules into kernel directly..
> > 
> > The commands are:
> > perf kvm top
> > perf kvm record
> > perf kvm report
> > 
> > They just collect guest kernel hot functions.
> 
> Fantastic, and there's some really interesting KVM guest/host comparison 
> profiles you've done with this prototype!
> 
> > With my patch, I collected dbench data on Nehalem machine (2*4*2 logical 
> > cpu).
> >
> > 1) Vanilla host kernel (6G memory):
> > ------------------------------------------------------------------------------------------------------------------------
> >    PerfTop:   15491 irqs/sec  kernel:93.6% [1000Hz cycles],  (all, 16 CPUs)
> > ------------------------------------------------------------------------------------------------------------------------
> > 
> >              samples  pcnt function                        DSO
> >              _______ _____ _______________________________ ________________________________________
> > 
> >             99376.00 40.5% ext3_test_allocatable           /lib/modules/2.6.33-kvmymz/build/vmlinux
> >             41239.00 16.8% bitmap_search_next_usable_block /lib/modules/2.6.33-kvmymz/build/vmlinux
> >              7019.00  2.9% __ticket_spin_lock              /lib/modules/2.6.33-kvmymz/build/vmlinux
> >              5350.00  2.2% copy_user_generic_string        /lib/modules/2.6.33-kvmymz/build/vmlinux
> >              5208.00  2.1% do_get_write_access             /lib/modules/2.6.33-kvmymz/build/vmlinux
> >              4484.00  1.8% journal_dirty_metadata          /lib/modules/2.6.33-kvmymz/build/vmlinux
> >              4078.00  1.7% ext3_free_blocks_sb             /lib/modules/2.6.33-kvmymz/build/vmlinux
> >              3856.00  1.6% ext3_new_blocks                 /lib/modules/2.6.33-kvmymz/build/vmlinux
> >              3485.00  1.4% journal_get_undo_access         /lib/modules/2.6.33-kvmymz/build/vmlinux
> >              2803.00  1.1% ext3_try_to_allocate            /lib/modules/2.6.33-kvmymz/build/vmlinux
> >              2241.00  0.9% __find_get_block                /lib/modules/2.6.33-kvmymz/build/vmlinux
> >              1957.00  0.8% find_revoke_record              /lib/modules/2.6.33-kvmymz/build/vmlinux
> > 
> > 2) guest os: start one guest os with 4GB memory.
> > ------------------------------------------------------------------------------------------------------------------------
> >    PerfTop:     827 irqs/sec  kernel: 0.0% [1000Hz cycles],  (all, 16 CPUs)
> > ------------------------------------------------------------------------------------------------------------------------
> > 
> >              samples  pcnt function                        DSO
> >              _______ _____ _______________________________ ________________________________________
> > 
> >             41701.00 28.1% __ticket_spin_lock              /lib/modules/2.6.33-kvmymz/build/vmlinux
> >             33843.00 22.8% ext3_test_allocatable           /lib/modules/2.6.33-kvmymz/build/vmlinux
> >             16862.00 11.4% bitmap_search_next_usable_block /lib/modules/2.6.33-kvmymz/build/vmlinux
> >              3278.00  2.2% native_flush_tlb_others         /lib/modules/2.6.33-kvmymz/build/vmlinux
> >              3200.00  2.2% copy_user_generic_string        /lib/modules/2.6.33-kvmymz/build/vmlinux
> >              3009.00  2.0% do_get_write_access             /lib/modules/2.6.33-kvmymz/build/vmlinux
> >              2834.00  1.9% journal_dirty_metadata          /lib/modules/2.6.33-kvmymz/build/vmlinux
> >              1965.00  1.3% journal_get_undo_access         /lib/modules/2.6.33-kvmymz/build/vmlinux
> >              1907.00  1.3% ext3_new_blocks                 /lib/modules/2.6.33-kvmymz/build/vmlinux
> >              1790.00  1.2% ext3_free_blocks_sb             /lib/modules/2.6.33-kvmymz/build/vmlinux
> >              1741.00  1.2% find_revoke_record              /lib/modules/2.6.33-kvmymz/build/vmlinux
> > 
> > 
> > With vanilla host kernel, perf top data is stable and spinlock doesn't take 
> > too much cpu time. With guest os, __ticket_spin_lock consumes 28% cpu time, 
> > and sometimes it fluctuates between 9%~28%.
> 
> Looks quite convenient to be able to profile guest and host from the same 
> space, right?
> 
> Btw, another, convenient way to compare profiles is 'perf diff':
> 
> $ perf diff
> 
> # Baseline  Delta          Shared Object  Symbol
> # ........ ..........  .................  ......
> #
>      5.45%     +4.31%  [kernel.kallsyms]  [k] _raw_spin_lock
>      3.52%     +3.74%  [kernel.kallsyms]  [k] copy_user_generic_string
>      3.11%     +4.08%  [kernel.kallsyms]  [k] sock_alloc_send_pskb
>      4.32%     +2.62%  [kernel.kallsyms]  [k] _raw_spin_lock_irqsave
>      3.34%     +2.31%  [kernel.kallsyms]  [k] __cache_free
>      1.49%     +3.49%  [kernel.kallsyms]  [k] _raw_read_lock
>      7.44%     -3.06%  [kernel.kallsyms]  [k] avc_has_perm_noaudit
>      0.14%     +2.49%  [kernel.kallsyms]  [k] skb_release_head_state
>      0.22%     +2.29%  [kernel.kallsyms]  [k] vfs_read
>      1.67%     +0.75%  [kernel.kallsyms]  [k] file_has_perm
>      0.09%     +2.31%  [kernel.kallsyms]  [k] rw_verify_area
> 
> By default it compares the last two profiles done in the current directory, if 
> you have two separate data files, say perf.data.host and perf.data.guest, you 
> can do:
> 
>   perf diff perf.data.host perf.data.guest
> 
> To get a host -> guest slowdown comparison.
Thanks for your good pointer. That would be more user friendly.

> 
> Another suggestion: you could add --guest / --host convenience flag to 'perf 
> kvm', to allow for an easy host/guest comparison workflow:
> 

>   perf kvm record --guest     # creates perf.data.guest
>   perf kvm record --host      # creates perf.data.host
So here we would have a new meaning that --guest means just collects guest os
data and --host just collects host data.

>   perf kvm diff               # shortcut for: 'perf diff perf.data.host perf.data.guest'
> 
> > Another interesting finding is aim7. If I start aim7 on tmpfs testing in 
> > guest os with 1GB memory, the login hangs and cpu is busy. With the new 
> > patch, I could check what happens in guest os, where spinlock is busy and 
> > kernel is shrinking memory mostly from slab.
> 
> This is exactly the kind of usage proper perf events integration would allow! 
> Your 'perf kvm' looks very powerful, even in its early prototype.
> 
> Now, regarding the technical details of your patch:
> 
> > +++ linux-2.6.33_perfkvm/arch/x86/kvm/vmx.c	2010-03-02 10:21:57.588586179 +0800
> 
> > @@ -3553,8 +3554,19 @@ static void vmx_complete_interrupts(stru
> >  
> >  	/* We need to handle NMIs before interrupts are enabled */
> >  	if ((exit_intr_info & INTR_INFO_INTR_TYPE_MASK) == INTR_TYPE_NMI_INTR &&
> > -	    (exit_intr_info & INTR_INFO_VALID_MASK))
> > +	    (exit_intr_info & INTR_INFO_VALID_MASK)) {
> > +		u64 rip = vmcs_readl(GUEST_RIP);
> > +		int user_mode = vmcs_read16(GUEST_CS_SELECTOR);
> > +
> > +#ifdef CONFIG_X86_32
> > +		user_mode = (user_mode & SEGMENT_RPL_MASK) == USER_RPL;
> > +#else
> > +		user_mode = !!(user_mode & 3);
> > +#endif
> 
> This test could use a helper i guess, to remove the #ifdef?
Right.

> 
> > +		perf_save_virt_ip(user_mode, rip);
> >  		asm("int $2");
> > +		perf_reset_virt_ip();
> > +	}
> 
> > --- linux-2.6.33/include/linux/perf_event.h	2010-02-25 02:52:17.000000000 +0800
> > +++ linux-2.6.33_perfkvm/include/linux/perf_event.h	2010-03-02 12:26:15.050947780 +0800
> > @@ -125,8 +125,9 @@ enum perf_event_sample_format {
> >  	PERF_SAMPLE_PERIOD			= 1U << 8,
> >  	PERF_SAMPLE_STREAM_ID			= 1U << 9,
> >  	PERF_SAMPLE_RAW				= 1U << 10,
> > +        PERF_SAMPLE_KVM                         = 1U << 11,
> 
> yep, we can extend it like this, but maybe there's another method:
Here PERF_SAMPLE_KVM is just used by tool perf to notify kernel that we
want to collect KVM guest os data instead of whole system. It isn't
used when parsing data. Interface sys_perf_event_open has no flag to notify
that we want to collect guest os event. Another solution is
add new member in perf_event_attr, such like
typedef enum {
	PERF_SAMPLE_HOST,
	PERF_SAMPLE_GUEST
} perf_os;
per_event_attr {
	...
	perf_os os;
	...
};

> 
> > +//#if defined(CONFIG_PERF_EVENTS && CONFIG_PERF_HAS_VIRT_IP)
> > +#if defined(CONFIG_PERF_EVENTS)
> > +struct virt_ip_info {
> > +	int	user_mode;
> > +	u64	ip;
> > +};
> 
> basically what we want is a 'this is guest user mode' differentiator in the 
> call frame stream data, right?
Here user mode is to match PERF_RECORD_MISC_USER/PERF_RECORD_MISC_KERNEL.
If checking current codes of tool perf and kernel, we could
find they uses both perf_callchain_context and
PERF_RECORD_MISC_USER/PERF_RECORD_MISC_KERNEL. The 1st is used for callchain
and the 2nd is used for non-callchain. As for the callchain, we could use
perf_callchain_context=>PERF_CONTEXT_GUEST_KERNEL.

As for non callchain, the doable solution is just like what Peter suggested
that we need add a new PERF_RECORD_MISC_CPUMODE_MASK, such like
PERF_RECORD_MISC_GUEST_KERNEL and PERF_RECORD_MISC_GUEST_USER. The new flag
will be used in perf's function thread__find_addr_location to find the
right map for guest os.

> 
> We already have such separators (it's just not used very much):
> 
> enum perf_callchain_context {
>         PERF_CONTEXT_HV                 = (__u64)-32,
>         PERF_CONTEXT_KERNEL             = (__u64)-128,
>         PERF_CONTEXT_USER               = (__u64)-512,
> 
>         PERF_CONTEXT_GUEST              = (__u64)-2048,
>         PERF_CONTEXT_GUEST_KERNEL       = (__u64)-2176,
>         PERF_CONTEXT_GUEST_USER         = (__u64)-2560,
> 
>         PERF_CONTEXT_MAX                = (__u64)-4095,
> };
> 
> Basically KVM's guest context could be expressed as a PERF_CONTEXT_GUEST 
> separator pushed into the stream (and then recognized by 'perf kvm' - and 
> generally by 'perf report' et al), followed by the virtual RIP as a regular 
> IP.
Right when we use it for callchain.

> 
> That way we dont need PERF_SAMPLE_KVM. Note that the tooling doesnt know about 
> such separators very well yet, so there might be some gotchas along the way. 
> Please let us know if you run into any problems here.
> 
> Peter, what's your preference for this KVM profiling ABI detail?
> 
> > +DECLARE_PER_CPU(struct virt_ip_info, perf_virt_ip);
> > +extern void perf_save_virt_ip(int user_mode, u64 ip);
> > +extern void perf_reset_virt_ip(void);
> > +extern int perf_get_virt_user_mode(void);
> > +static inline u64 perf_instruction_pointer(struct perf_event *event, struct pt_regs *regs)
> > +{
> > +	u64 ip;
> > +	if (event->attr.sample_type & PERF_SAMPLE_KVM)
> > +		ip = percpu_read(perf_virt_ip.ip);
> > +	else
> > +		ip = instruction_pointer(regs);
> > +	return ip;
> 
> And this complication we can perhaps avoid by extending the stack-trace engine 
> (arch/x86/kernel/stacktrace.c) to 'know' about virtual guest RIPs 
> automatically?
> 
> ( Note that this would be a bit of an advantage for regular oops printing as 
>   well: if a KVM thread crashes or generates a stack-dump it could 
>   automatically print the guest virtual RIP as well. )
> 
> Walking into the guest context is more complex, but not impossible either - 
> and that bit definitely has to be done not in an NMI context but in the KVM 
> thread context.
> 
> So maybe we should not extend dump_stack_trace() after all (it cannot really 
> work from NMI or from oops context), but add the KVM variant and let it 
> directly inject into the perf data stream? (like your patch does it pretty 
> much)
> 
> > +++ linux-2.6.33_perfkvm/tools/perf/perf.c	2010-03-02 09:57:03.164001069 +0800
> 
> > +	if (argc > 1 && !strcmp(argv[0], "kvm")) {
> > +		sample_kvm = 1;
> > +		argv++;
> > +		argc--;
> > +		cmd = argv[0];
> > +	}
> 
> this is fine as a quick hack. For the real thing i suspect we want to add 
> 'perf kvm' as a real builtin-kvm.c command - see builtin-sched.c and 
> builtin-lock.c about how to create such 'subsystem commands'. builtin-record.c 
> can be extended with various host/guest recording detail switches (off by 
> default), and builtin-kvm.c can use those. For example builtin-sched.c 
> implements 'perf sched record' the following way:
> 
> static const char *record_args[] = {
>         "record",
>         "-a",
>         "-R",
>         "-M",
>         "-f",
>         "-m", "1024",
>         "-c", "1",
>         "-e", "sched:sched_switch:r",
>         "-e", "sched:sched_stat_wait:r",
>         "-e", "sched:sched_stat_sleep:r",
>         "-e", "sched:sched_stat_iowait:r",
>         "-e", "sched:sched_stat_runtime:r",
>         "-e", "sched:sched_process_exit:r",
>         "-e", "sched:sched_process_fork:r",
>         "-e", "sched:sched_wakeup:r",
>         "-e", "sched:sched_migrate_task:r",
> };
> 
> So it simply passes these arguments to perf record. 'perf kvm record' could do 
> something similar.
That's a good pointer. I will try it late.

> 
> Anyway, your patch already shows great progress and it's the kind of direction 
> for enhanced performance analysis of KVM that i think would be very fruitful 
> to KVM developers to pursue.

--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at  http://vger.kernel.org/majordomo-info.html