This is the guest side code of lazy tscdeadline. If the cpuid tell us lazy tscdeadline is enabled, swtich .set_next_event to lazy tscdeadline version. And Let's explain the core idea here. Every time guest start or modify a hrtimer, we need to write the msr of tsc deadline, a vm-exit occurs and host arms a hv or sw timer for it. However, in some workload that needs setup timer frequently, msr of tscdeadline is usually overwritten many times before the timer expires. w: write msr x: vm-exit t: hv or sw timer 1. write to msr with t1 Guest w1 ----------------------------------------> Time Host x1 t1 ... n. write to msr with tn Guest wn ------------------------------------------> Time Host xn tn-1 -> tn What this patch want to do is to eliminate the vm-exit of x2 ... xn Firstly, we have two fields shared between guest and host as other pv features, saying, - armed, the value of tscdeadline that has a timer in host side, only updated by HOST side - pending, the next value of tscdeadline, only updated by GUEST side 1. write to msr with t1 armed : t1 pending : t1 Guest w1 ----------------------------------------> Time Host x1 t1 vm-exit occurs and arms a timer for t1 in host side 2. write to msr with t2 armed : t1 pending : t2 Guest w2 ------------------------------------------> Time Host t1 the value of tsc deadline that has been armed, namely t1, is smaller than t2, needn't to write to msr but just update pending to t2 dd ... n. write to msr with tn armed : t1 pending : tn Guest wn ------------------------------------------> Time Host t1 Similar with step 2, just update pending field with tn, no vm-exit n+1. t1 expires, arm tn armed : tn pending : tn Guest ------------------------------------------> Time Host t1 ------> tn When we try to update the tscdeadline, if the 'pending' field is smaller, then we know there is a pending timer, needn' to do msr write. Signed-off-by: Li Shujin <arkinjob@xxxxxxxxxxx> Signed-off-by: Wang Jianchao <jianchwa@xxxxxxxxxxx> --- arch/x86/kernel/apic/apic.c | 30 +++++++++++++++++++++++++++++- 1 file changed, 29 insertions(+), 1 deletion(-) diff --git a/arch/x86/kernel/apic/apic.c b/arch/x86/kernel/apic/apic.c index af49e24..5aea74f 100644 --- a/arch/x86/kernel/apic/apic.c +++ b/arch/x86/kernel/apic/apic.c @@ -62,6 +62,9 @@ #include <asm/intel-family.h> #include <asm/irq_regs.h> #include <asm/cpu.h> +#include <linux/kvm_para.h> + +DECLARE_PER_CPU_DECRYPTED(struct kvm_lazy_tscdeadline, kvm_lazy_tscdeadline); unsigned int num_processors; @@ -495,6 +498,26 @@ static int lapic_next_deadline(unsigned long delta, return 0; } +static int kvm_lapic_next_deadline(unsigned long delta, + struct clock_event_device *evt) +{ + struct kvm_lazy_tscdeadline *lazy_tscddl = this_cpu_ptr(&kvm_lazy_tscdeadline); + u64 tsc; + + tsc = rdtsc() + (((u64) delta) * TSC_DIVISOR); + lazy_tscddl->pending = tsc; + /* + * There fence can have two functions: + * - avoid the wrmsrl is reordered + * - avoid the reorder of writing to pending and reading from armed + */ + weak_wrmsr_fence(); + if (!lazy_tscddl->armed || tsc < lazy_tscddl->armed) + wrmsrl(MSR_IA32_TSC_DEADLINE, tsc); + + return 0; +} + static int lapic_timer_shutdown(struct clock_event_device *evt) { unsigned int v; @@ -639,7 +662,12 @@ static void setup_APIC_timer(void) levt->name = "lapic-deadline"; levt->features &= ~(CLOCK_EVT_FEAT_PERIODIC | CLOCK_EVT_FEAT_DUMMY); - levt->set_next_event = lapic_next_deadline; + if (kvm_para_available() && + kvm_para_has_feature(KVM_FEATURE_LAZY_TSCDEADLINE)) { + levt->set_next_event = kvm_lapic_next_deadline; + } else { + levt->set_next_event = lapic_next_deadline; + } clockevents_config_and_register(levt, tsc_khz * (1000 / TSC_DIVISOR), 0xF, ~0UL); -- 2.7.4