Hi This patchset attemps to introduce a new pv feature, lazy tscdeadline. Before this patch, every time guest start or modify a hrtimer, we need to write the msr of tsc deadline, a vm-exit occurs and host arms a hv or sw timer for it. w: write msr x: vm-exit t: hv or sw timer Guest w ---------------------------------------> Time Host x t However, in some workload that needs setup timer frequently, msr of tscdeadline is usually overwritten many times before the timer expires. And every time we modify the tscdeadline, a vm-exit ocurrs 1. write to msr with t0 Guest w1 ----------------------------------------> Time Host x1 t1 2. write to msr with t2 Guest w2 ------------------------------------------> Time Host x2 t1->t2 2. write to msr with t3 Guest w3 ------------------------------------------> Time Host x3 t2->t3 3. write to msr with t4 Guest w4 ------------------------------------------> Time Host x4 t3->t4 What this patch want to do is to eliminate the vm-exit of x2 x3 and x4 as following, Firstly, we have two fields shared between guest and host as other pv features, saying, - armed, the value of tscdeadline that has a timer in host side, only updated by __host__ side Everytime the host side arm timer of tscdeadline mode, it update @armed - pending, the next value of tscdeadline, only updated by __guest__ side. Everytime the guest invoke kvm_lapic_next_deadline (lazy_tscdeadline version set_next_event callback), it updates the @pending no matter jumps to wrmsrl In guest side, saying we want to set tscdeadline to t, we needs to update @pending first, then, - if @armed is zero, or t < @armed, jumps to wrmsrl to trap int host to arm the timer - if t >= @armed, just returns In host side, - if @pending == @armed, inject local timer interrupt - if @pending > @armed, just re-arm the timer - there shouldn't be case @pending < @armed, the guest side will trap into host to update @armed in this case 1. write to msr with t1 armed : t1 pending : t1 Guest w1 ----------------------------------------> Time Host x1 t1 vm-exit occurs and arms a timer for t1 in host side 2. write to msr with t2 armed : t1 pending : t2 Guest w2 ------------------------------------------> Time Host t1 the value of tsc deadline that has been armed, namely t1, is smaller than t2, needn't to write to msr but just update pending 3. write to msr with t3 armed : t1 pending : t3 Guest w3 ------------------------------------------> Time Host t1 Similar with step 2, just update pending field with t3, no vm-exit 4. write to msr with t4 armed : t1 pending : t4 Guest w4 ------------------------------------------> Time Host t1 Similar with step 2, just update pending field with t4, no vm-exit 5. t1 expires, arm t4 armed : t4 pending : t4 Guest ------------------------------------------> Time Host t1 ------> t4 t1 is fired, it checks the pending field and re-arm a timer based on it. In this case, the vm-exit caused by writing msr of tsc deadline for t2 t3 t4 is reduced. Even thougth t1 causes another vm-exit of preemption-timer, but we win 2 in this case. Here is the test results of netperf TCP-RR on loopback: VM-Exit: Close Open sum 10485133 6177331 halt 2082894 2958096 msr-write 8323993 3140474 preemption-timer 36036 42064 ------------------------------------------- MSR: sum 8324075 3140518 apic-icr 2115802 2969154 tsc-deadline 6208273 171364 --------------------------------------------- Intrrupts: 236 44003 55059 251 2081941 2943361 Note: - Host kernel is 6.5-rc1 - Guest kernel is 5.14 + patch This patchset includes 6 patches, The 1st patch, KVM: x86: add msr register and data structure for lazy tscdeadline add msr register, feature flag and data structure for this new feature. There is no functional changes in this patch. The 2nd patch, KVM: x86: exchange info about lazy_tscdeadline with msr Exchange the gpa of kvm_lazy_tscdeadline data structure between gust and host. The 3rd patch, x86/apic: switch set_next_event to lazy tscdeadline version If lazy_tscdeadline is enabled, switch the set_next_event callback from lapic_next_deadline to kvm_lapic_next_deadline. The 4th patch, KVM: x86: do lazy_tscdeadline init and exit Do some init and exit jobs of lazy_tscdeadline. It pins the page at which the gpa of kvm_lazy_tscdeadline locates and maps it to kernel space. The exit path will release them. The 5th patch, KVM: X86: add lazy tscdeadline support to reduce vm-exit of msr-write It introduces the update, kick and clear operations to make lazy_tscdeadline work in host side. Refer to following comment, - UPDATE, when the guest update msr of tsc deadline, we need to update the value of 'armed' field of kvm_lazy_tscdeadline - KICK, when the hv or sw timer is fired, we need to check the 'pending' field to decide whether to re-arm timer or inject local timer vector. The sw timer is not in vcpu context, so a new kvm req is added to handle the kick in vcpu context. - CLEAR, this is a bit tricky. We need to clear the 'armed' field properly otherwise the guestOS can be hung. The 6th patch, KVM: x86: add debugfs file for lazy tscdeadline per vcpu Add a debug entry for this feature. Changes from V2: - Comments and chart in cover letter and patches are rewritten - Move weak_wrmsr_fence after updating @pending the avoid re-order of update @pending and read @armed - Split the orignial 3rd patch into 3 to reduce the size of patches - Avoid to inject interrupt into guest when lazy tscdeadline timer is kicked - Add kvm_vcpu_kick() when write to lazy_tscdeadline debugfs interface Changes from V1: - In 3rd patch, rename the variable of kvm_host_lazy_tscdeadline from 'host' to 'hlt'. And in addition, add more details into the comment of patch - Add 4th patch which add debugfs file for this patch Any comment is welcome. Thanks Jianchao Wang Jianchao (6) KVM: x86: add debugfs file for lazy tscdeadline per vcpu KVM: X86: add lazy tscdeadline support to reduce vm-exit of msr-write KVM: x86: do lazy_tscdeadline init and exit x86/apic: switch set_next_event to lazy tscdeadline version KVM: x86: exchange info about lazy_tscdeadline with msr KVM: x86: add msr register and data structure for lazy tscdeadline arch/x86/include/asm/kvm_host.h | 10 ++++++++ arch/x86/kernel/apic/apic.c | 30 +++++++++++++++++++++- arch/x86/kernel/kvm.c | 13 ++++++++++ arch/x86/kvm/debugfs.c | 80 +++++++++++++++++++++++++++++++++++++++++++++++++++++++++- arch/x86/kvm/lapic.c | 138 ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++------- arch/x86/kvm/lapic.h | 4 +++ arch/x86/kvm/x86.c | 27 ++++++++++++++++++++ 7 files changed, 291 insertions(+), 11 deletions(-)